Logo image
PBsim Taxanomic Classification ID/OOD benchmark dataset
Dataset

PBsim Taxanomic Classification ID/OOD benchmark dataset

Gavin Hearne and Gail Rosen
17 Mar 2026
url
https://doi.org/10.5281/zenodo.19071325View
Open

Abstract

This dataset was created with the intent to evaluate the performance of fine-tuned genomic language models on both ID and OOD taxanomic classification tasks.  The Woltka pipeline was used to compile the full genome dataset, consisting of 4634 genomes, with each genus represented by a single genome. From there, taxa were filtered in to categories with different levels of representation in the training to produce various levels of distribution shift. Reads were then generated using PBsim, to simulate 6kbp PacBio generated reads from the full genomes.  There are 5 files available: train.csv - full training dataset of bacterial reads.  id_novel_genus.csv - ID test set, if classifying on a family level. Novel genus, but shared family with training ood_novel_family.csv - OOD test set. Novel family, but shared orders ood_nonbacterial.csv - OOD test set. No shared taxanomic lineage with training full_basic_lineage.csv - metadata lineage file

Metrics

1 Record Views

Details

Logo image