Journal article
Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization
Communications biology, v 8(1), 517
29 Mar 2025
PMID: 40155693
Featured in Collection : UN Sustainable Development Goals @ Drexel
Abstract
Analysis of genomic and metagenomic sequences is inherently more challenging than that of amino acid sequences due to the higher divergence among evolutionarily related nucleotide sequences, variable
k-mer
and codon usage within and among genomes of diverse species, and poorly understood selective constraints. We introduce
Scorpio
(Sequence Contrastive Optimization for Representation and Predictive Inference on DNA), a versatile framework designed for nucleotide sequences that employ contrastive learning to improve embeddings. By leveraging pre-trained genomic language models and
k-mer
frequency embeddings, Scorpio demonstrates competitive performance in diverse applications, including taxonomic and gene classification, antimicrobial resistance (AMR) gene identification, and promoter detection. A key strength of Scorpio is its ability to generalize to novel DNA sequences and taxa, addressing a significant limitation of alignment-based methods. Scorpio has been tested on multiple datasets with DNA sequences of varying lengths (long and short) and shows robust inference capabilities. Additionally, we provide an analysis of the biological information underlying this representation, including correlations between codon adaptation index as a gene expression factor, sequence similarity, and taxonomy, as well as the functional and structural information of genes.
Scorpio enhances genomic language models with contrastive learning and hierarchical sampling to improve classification, generalization, and biological representation, enabling transferable embeddings to other domains.
Metrics
7 Record Views
Details
- Title
- Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization
- Creators
- Mohammadsaleh Refahi - Electrical and Computer Engineering, Drexel UniversityBahrad A. Sokhansanj - Electrical and Computer Engineering, Drexel UniversityJoshua C. Mell - Drexel UniversityJames R. Brown - Drexel UniversityHyunwoo Yoo - Electrical and Computer Engineering, Drexel UniversityGavin Hearne - Electrical and Computer Engineering, Drexel UniversityGail L. Rosen - Electrical and Computer Engineering, Drexel University
- Publication Details
- Communications biology, v 8(1), 517
- Publisher
- Nature Publishing Group
- Number of pages
- 18
- Grant note
- 1936791; 1919691; 2107108 / National Science Foundation (NSF) (100000001)
- Resource Type
- Journal article
- Language
- English
- Academic Unit
- Microbiology and Immunology; Electrical and Computer Engineering
- Web of Science ID
- WOS:001454825400001
- Scopus ID
- 2-s2.0-105001402577
- Other Identifier
- 991022041939804721
UN Sustainable Development Goals (SDGs)
This publication has contributed to the advancement of the following goals:
InCites Highlights
Data related to this publication, from InCites Benchmarking & Analytics tool:
- Web of Science research areas
- Biology