Conference proceeding
Fast and Scalable Gene Embedding Search: A Comparative Study of FAISS and ScaNN
Companion Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, 21
11 Oct 2025
Abstract
The exponential growth of DNA sequencing has produced vast amounts of genetic data, yet alignment-based similarity search tools such as BLAST and MMseqs2 struggle to scale to these large, noisy datasets. Embedding-based approximate nearest-neighbor search promises both efficiency and biological fidelity by learning dense vector representations of genes and employing high-performance vector indices. This highlight summarizes recent work that systematically benchmarks two state-of-the-art vector search libraries Facebook's FAISS and Google's ScaNN on biologically meaningful gene embeddings derived from a transformer based language model. The goal is to assess their ability to detect novel sequences, support gene annotation and taxonomic classification, and to illuminate the trade-offs between accuracy, runtime and memory usage for metagenomic retrieval.
The study uses the Scorpio-Gene-Taxa dataset of 400 bp DNA fragments representing 497 housekeeping genes across 2,046 bacterial and archaeal genera. A database of 165,615 sequences is indexed, and two query sets (7,000 in-domain and 7,000 out-of-domain fragments) are encoded into 1,024 dimensional vectors with the MetaBERTA-BigBird model before being searched. FAISS is explored under multiple indexing strategies: exact "Flat" search; inverted files (IVF); and product quantization (PQ) with optional pre-processing through principal component analysis (PCA) and optimized PQ (OPQ). ScaNN is tuned with combinations of vector partitioning, asymmetric hashing and optional re-scoring. These configurations illuminate how hyperparameters affect speed and recall.
Results show that embedding-based search markedly outperforms alignment-based baselines. In FAISS, PCA-enhanced Flat indexes yield the highest top-1 accuracy (≈ 36 %) while avoiding quantization losses; PCAWR64,IVF4096,Flat offers a strong trade-off between accuracy (≈ 33 %) and fast query time (~ 0.3 s). Heavily compressed PQ/OPQ configurations accelerate queries but incur substantial accuracy drops. ScaNN's brute-force mode attains ≈ 33 % accuracy but is slower, whereas asymmetric hashing combined with partitioning and re-ordering reduces query time to ~ 1.8 s at modest accuracy (≈ 25 %). When the best configurations are compared, FAISS consistently achieves superior accuracy and lower latency than ScaNN, and both embedding methods dramatically outperform MMseqs2, which achieves only ~ 1.8% accuracy on short fragments.
Embedding-based ANN search provides a scalable, biologically meaningful framework for novelty detection, gene annotation and metagenomic taxonomic classification. Its tunable indexing parameters allow practitioners to adjust the speed-accuracy trade-off for specific applications, delivering orders-of-magnitude speedups over alignment tools while maintaining or improving recall.
Metrics
2 Record Views
Details
- Title
- Fast and Scalable Gene Embedding Search: A Comparative Study of FAISS and ScaNN
- Creators
- Mohammadsaleh Refahi - Drexel UniversityGavin Hearne - Drexel University, Philadelphia, PA, USAHarrison Muller - Drexel UniversityKieran Lynch - Drexel UniversityBahrad A. Sokhansanj - Drexel UniversityJames R. Brown - Drexel UniversityGail Rosen - Drexel University
- Publication Details
- Companion Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, 21
- Conference
- BCB Companion '25: Companion Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
- Series
- ACM Conferences
- Publisher
- Association for Computing Machinery
- Number of pages
- 1
- Grant note
- National Science Foundation: 1936791, 1919691, 2107108
This work was supported by the National Science Foundation under Grant Numbers #1936791, #1919691, and #2107108. We acknowledge the University Research Computing Facility for providing paid computational services.
- Resource Type
- Conference proceeding
- Language
- English
- Academic Unit
- Electrical and Computer Engineering
- Web of Science ID
- WOS:001661442600021
- Other Identifier
- 991022138673504721