Conference proceeding
Normalized Compression Distance for DNA Classification
Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, v 93
22 Nov 2024
Abstract
The increasingly common use of next-generation sequencing has enabled greater access to large-scale (meta-)genomic datasets than ever before. The resulting deluge of data has made the quest for efficient DNA sequence classification methods an urgent challenge for downstream analyses. Traditional sequence alignment-based methods for DNA sequence classification struggle when presented with increasingly large volumes of sequence data due to the computational complexity of alignment. Subsequently, there is a need for methods capable of sequence identification without alignment. Normalized compression distance (NCD) has demonstrated capabilities in the field of text classification as a low-resource alternative to deep neural networks by leveraging compression algorithms to approximate Kolmogorov information distance. In an effort to apply this technique toward genomics tasks akin to tools such as Many-against-Many sequence searching (MMseqs) and Kraken2, we have explored the use of a gzip-based NCD towards both gene labeling of ORFs (open reading frames) and taxonomic classification of short reads. This demonstrates the efficacy of NCD in diverse multitask classification, and we further explore the capacity for NCD to classify larger libraries of metagenomic reads.
Metrics
7 Record Views
Details
- Title
- Normalized Compression Distance for DNA Classification
- Creators
- Gavin L.A. Hearne - Drexel University, Philadelphia, PA, USAMohammad S. Refahi - Drexel UniversityHaozhe Neil Duan - Drexel UniversityJames R. Brown - Drexel UniversityGail L. Rosen - Drexel University
- Publication Details
- Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, v 93
- Conference
- BCB '24: 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
- Series
- ACM Conferences
- Publisher
- ACM
- Number of pages
- 1
- Grant note
- National Science Foundation: 1936791, 1919691, 2107108
This work is supported by National Science Foundation under Grant Number #1936791, #1919691 and #2107108. We thank the University Research Computing Facility for their paid services. We thank Dr. Bahrad Sokhansanj for discussion about related methods.
- Resource Type
- Conference proceeding
- Language
- English
- Academic Unit
- Electrical and Computer Engineering
- Web of Science ID
- WOS:001430744700092
- Scopus ID
- 2-s2.0-85216413637
- Other Identifier
- 991022005488004721
InCites Highlights
Data related to this publication, from InCites Benchmarking & Analytics tool:
- Web of Science research areas
- Computer Science, Artificial Intelligence
- Mathematical & Computational Biology
- Medical Informatics