Logo image
Genetic grouping of SARS-CoV-2 coronavirus sequences using informative subtype markers for pandemic spread visualization
Journal article   Open access

Genetic grouping of SARS-CoV-2 coronavirus sequences using informative subtype markers for pandemic spread visualization

Zhengqiao Zhao, Bahrad A Sokhansanj, Charvi Malhotra, Kitty Zheng and Gail L Rosen
PLoS computational biology, v 16(9), pp e1008269-e1008269
Sep 2020
PMID: 32941419
url
https://doi.org/10.1371/journal.pcbi.1008269View
Published, Version of Record (VoR)CC BY V4.0 Open

Abstract

Betacoronavirus - classification Betacoronavirus - genetics Coronavirus Infections - epidemiology Coronavirus Infections - transmission Coronavirus Infections - virology COVID-19 Evolution, Molecular Genetic Markers - genetics Genome, Viral - genetics Genomics - methods Humans Mutation - genetics Pandemics Phylogeny Pneumonia, Viral - epidemiology Pneumonia, Viral - transmission Pneumonia, Viral - virology RNA, Viral - genetics SARS-CoV-2 Sequence Alignment Sequence Analysis, RNA Spatio-Temporal Analysis
We propose an efficient framework for genetic subtyping of SARS-CoV-2, the novel coronavirus that causes the COVID-19 pandemic. Efficient viral subtyping enables visualization and modeling of the geographic distribution and temporal dynamics of disease spread. Subtyping thereby advances the development of effective containment strategies and, potentially, therapeutic and vaccine strategies. However, identifying viral subtypes in real-time is challenging: SARS-CoV-2 is a novel virus, and the pandemic is rapidly expanding. Viral subtypes may be difficult to detect due to rapid evolution; founder effects are more significant than selection pressure; and the clustering threshold for subtyping is not standardized. We propose to identify mutational signatures of available SARS-CoV-2 sequences using a population-based approach: an entropy measure followed by frequency analysis. These signatures, Informative Subtype Markers (ISMs), define a compact set of nucleotide sites that characterize the most variable (and thus most informative) positions in the viral genomes sequenced from different individuals. Through ISM compression, we find that certain distant nucleotide variants covary, including non-coding and ORF1ab sites covarying with the D614G spike protein mutation which has become increasingly prevalent as the pandemic has spread. ISMs are also useful for downstream analyses, such as spatiotemporal visualization of viral dynamics. By analyzing sequence data available in the GISAID database, we validate the utility of ISM-based subtyping by comparing spatiotemporal analyses using ISMs to epidemiological studies of viral transmission in Asia, Europe, and the United States. In addition, we show the relationship of ISMs to phylogenetic reconstructions of SARS-CoV-2 evolution, and therefore, ISMs can play an important complementary role to phylogenetic tree-based analysis, such as is done in the Nextstrain project. The developed pipeline dynamically generates ISMs for newly added SARS-CoV-2 sequences and updates the visualization of pandemic spatiotemporal dynamics, and is available on Github at https://github.com/EESI/ISM (Jupyter notebook), https://github.com/EESI/ncov_ism (command line tool) and via an interactive website at https://covid19-ism.coe.drexel.edu/.

Metrics

7 Record Views
32 citations in Scopus

Details

UN Sustainable Development Goals (SDGs)

This publication has contributed to the advancement of the following goals:

#3 Good Health and Well-Being

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Web of Science research areas
Biochemical Research Methods
Mathematical & Computational Biology
Logo image