Logo image
Characterizing the Empirical Distribution of Prokaryotic Genome n-mers in the Presence of Nullomers
Journal article

Characterizing the Empirical Distribution of Prokaryotic Genome n-mers in the Presence of Nullomers

Loni Philip Tabb, Wei Zhao, Jingyu Huang and Gail L Rosen
Journal of computational biology, v 21(10), pp 732-740
01 Oct 2014
PMID: 25075627

Abstract

Research Articles
Characterizing the empirical distribution of the frequency of n -mers is a vital step in understanding the entire genome. This will allow for researchers to examine how complex the genome really is, and move beyond simple, traditional modeling frameworks that are often biased in the presence of abundant and/or extremely rare words. We hypothesize that models based on the negative binomial distribution and its zero-inflated counterpart will characterize the n -mer distributions of genomes better than the Poisson. Our study examined the empirical distribution of the frequency of n -mers (6 ≤  n  ≤ 11) in 2,199 genomes. We considered four distributions: Poisson, negative binomial, zero-inflated Poisson, and zero-inflated negative binomial (ZINB). The number of genomes that have nullomers in 6-, 7-, and 8-mers was 150, 602 and 2,012, respectively, whereas all of the genomes for the 9-, 10-, and 11-mers had nullomers. In each n -mer considered, the negative binomial model performed the best for at least 93% of the 2,199 genomes; however, a small percentage (i.e., <7%) of the genomes did prefer the ZINB. The negative binomial and zero-inflation distributions extend the traditional Poisson setting and are more flexible in handling overdispersion that can be caused by an increase in nullomers. In an effort to characterize the distribution of the frequency of n -mers, researchers should also consider other discrete distributions that are more flexible and adjust for possible overdispersion.

Metrics

5 Record Views

Details

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Collaboration types
Domestic collaboration
Web of Science research areas
Biochemical Research Methods
Biotechnology & Applied Microbiology
Computer Science, Interdisciplinary Applications
Mathematical & Computational Biology
Statistics & Probability
Logo image