PBsim Taxanomic Classification ID/OOD benchmark dataset

Gavin Hearne; Gail Rosen

doi:10.5281/zenodo.19071325

Back

Dataset

PBsim Taxanomic Classification ID/OOD benchmark dataset

Gavin Hearne and Gail Rosen

17 Mar 2026

DOI: https://doi.org/10.5281/zenodo.19071325

Files and links (1)

url

https://doi.org/10.5281/zenodo.19071325View

Open

Abstract

This dataset was created with the intent to evaluate the performance of fine-tuned genomic language models on both ID and OOD taxanomic classification tasks. The Woltka pipeline was used to compile the full genome dataset, consisting of 4634 genomes, with each genus represented by a single genome. From there, taxa were filtered in to categories with different levels of representation in the training to produce various levels of distribution shift. Reads were then generated using PBsim, to simulate 6kbp PacBio generated reads from the full genomes. There are 5 files available: train.csv - full training dataset of bacterial reads. id_novel_genus.csv - ID test set, if classifying on a family level. Novel genus, but shared family with training ood_novel_family.csv - OOD test set. Novel family, but shared orders ood_nonbacterial.csv - OOD test set. No shared taxanomic lineage with training full_basic_lineage.csv - metadata lineage file

Metrics

1 Record Views

Details

Title: PBsim Taxanomic Classification ID/OOD benchmark dataset
Creators: Gavin Hearne - Drexel University
Gail Rosen
Publisher: Zenodo
Resource Type: Dataset
Language: English
Academic Unit: Electrical and Computer Engineering
Other Identifier: 991022171586304721

PBsim Taxanomic Classification ID/OOD benchmark dataset

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media