Logo image
A Dataset for Drug Resistance Classification from Antimicrobial DNA Sequences
Dataset   Open access

A Dataset for Drug Resistance Classification from Antimicrobial DNA Sequences

Hyunwoo Yoo, Bahrad Sokhansanj, James Brown and Gail Rosen
01 Jan 2024
url
https://doi.org/10.5281/zenodo.15213478View
Open

Abstract

This dataset provides a curated and standardized collection of antimicrobial resistance (AMR) gene sequences and annotations for drug resistance classification tasks. It integrates entries from the Comprehensive Antibiotic Resistance Database (CARD) and MEGARes v3.0, and unifies resistance labels using the Antibiotic Resistance Ontology (ARO). To enhance reliability, classes with fewer than 15 samples were excluded. Each data sample includes a full-length nucleotide sequence, along with harmonized annotations for Drug Class, Resistance Mechanism, and Gene Family. The dataset covers 9 major antimicrobial Drug Classes: Beta-lactams Aminoglycosides Glycopeptides Tetracyclines Fluoroquinolones MLS (Macrolide-Lincosamide-Streptogramin) Sulfonamides Phenicol Multi-drug resistance Resistance mechanisms include categories such as antibiotic inactivation, target alteration, efflux, target protection, target replacement, and reduced permeability to antibiotics. Gene family annotations show a long-tailed distribution, with frequently observed families including beta-lactamases, aminoglycoside-modifying enzymes, major facilitator superfamily (MFS) efflux pumps, ribosomal protection proteins, and rRNA methyltransferases. This dataset has been used in studies involving sequence-based classification models such as Nucleotide Transformer. For model training, input sequences were truncated to 1000 base pairs, although the dataset itself provides full-length sequences. It is suitable for AMR prediction tasks and supports research in computational biology, genomic analysis, and biomedical natural language processing.

Metrics

54 Record Views

Details

Logo image