Dataset
PBsim Taxanomic Classification ID/OOD benchmark dataset
17 Mar 2026
Abstract
This dataset was created with the intent to evaluate the performance of fine-tuned genomic language models on both ID and OOD taxanomic classification tasks.
The Woltka pipeline was used to compile the full genome dataset, consisting of 4634 genomes, with each genus represented by a single genome. From there, taxa were filtered in to categories with different levels of representation in the training to produce various levels of distribution shift. Reads were then generated using PBsim, to simulate 6kbp PacBio generated reads from the full genomes.
There are 5 files available:
train.csv - full training dataset of bacterial reads.
id_novel_genus.csv - ID test set, if classifying on a family level. Novel genus, but shared family with training
ood_novel_family.csv - OOD test set. Novel family, but shared orders
ood_nonbacterial.csv - OOD test set. No shared taxanomic lineage with training
full_basic_lineage.csv - metadata lineage file
Metrics
1 Record Views
Details
- Title
- PBsim Taxanomic Classification ID/OOD benchmark dataset
- Creators
- Gavin Hearne - Drexel UniversityGail Rosen
- Publisher
- Zenodo
- Resource Type
- Dataset
- Language
- English
- Academic Unit
- Electrical and Computer Engineering
- Other Identifier
- 991022171586304721