Dataset
Data from: Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus
30 Jan 2019
Abstract
Aligning sequences for phylogenetic analysis (multiple sequence alignment;
MSA) is an important, but increasingly computationally expensive step with
the recent surge in DNA sequence data. Much of this sequence data is
publicly available, but can be extremely fragmentary (i.e., a combination
of full genomes and genomic fragments), which can compound the
computational issues related to MSA. Traditionally, alignments are
produced with automated algorithms and then checked and/or corrected “by
eye” prior to phylogenetic inference. However, this manual curation is
inefficient at the data scales required of modern phylogenetics and
results in alignments that are not reproducible. Recently, methods have
been developed for fully automating alignments of large data sets, but it
is unclear if these methods produce alignments that result in compatible
phylogenies when compared to more traditional alignment approaches that
combined automated and manual methods. Here we use approximately 33,000
publicly available sequences from the hepatitis B virus (HBV), a globally
distributed and rapidly evolving virus, to compare different alignment
approaches. Using one data set comprised exclusively of whole genomes and
a second that also included sequence fragments, we compared three MSA
methods: (1) a purely automated approach using traditional software, (2)
an automated approach including by eye manual editing, and (3) more recent
fully automated approaches. To understand how these methods affect
phylogenetic results, we compared resulting tree topologies based on these
different alignment methods using multiple metrics. We further determined
if the monophyly of existing HBV genotypes was supported in phylogenies
estimated from each alignment type and under different statistical support
thresholds. Traditional and fully automated alignments produced similar
HBV phylogenies. Although there was variability between branch support
thresholds, allowing lower support thresholds tended to result in more
differences among trees. Therefore, differences between the trees could be
best explained by phylogenetic uncertainty unrelated to the MSA method
used. Nevertheless, automated alignment approaches did not require human
intervention and were therefore considerably less time-intensive than
traditional approaches. Because of this, we conclude that fully automated
algorithms for MSA are fully compatible with older methods even in
extremely difficult to align data sets. Additionally, we found that most
HBV diagnostic genotypes did not correspond to evolutionarily-sound
groups, regardless of alignment type and support threshold. This suggests
there may be errors in genotype classification in the database or that HBV
genotypes may need a revision.
Metrics
3 Record Views
Details
- Title
- Data from: Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus
- Creators
- Therese A. Catanach - Drexel UniversityAndrew D. Sweet - University of Illinois Urbana-ChampaignNam-Phuong D. Nguyen - University of San DiegoRhiannon M. Peery - University of AlbertaAndrew H. Debevec - University of Illinois Urbana-ChampaignAndrea K. Thomer - University of MichiganAmanda C. Owings - University of Illinois Urbana-ChampaignBret M. Boyd - University of Illinois Urbana-ChampaignAron D. Katz - University of Illinois Urbana-ChampaignFelipe N. Soto-Adames - Florida Department of Agriculture and Consumer ServicesJulie M. Allen - University of Nevada, Reno
- Publisher
- Dryad
- Resource Type
- Dataset
- Language
- English
- Academic Unit
- Ornithology
- Other Identifier
- 991022048367404721