Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning

Bahrad A. Sokhansanj; Gail L. Rosen

doi:10.1016/j.compbiomed.2022.105969

Back

Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning

Journal article

Open access

Peer reviewed

Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning

Bahrad A. Sokhansanj and Gail L. Rosen

Computers in biology and medicine, v 149, 105969

Oct 2022

DOI: https://doi.org/10.1016/j.compbiomed.2022.105969

Featured in Collection : UN Sustainable Development Goals @ Drexel

Files and links (1)

url

https://doi.org/10.1016/j.compbiomed.2022.105969View

Published, Version of Record (VoR)CC BY V4.0, Open

Abstract

Bioinformatics

COVID-19

Machine learning

SARS-CoV-2

Viral genomics

Epidemiological studies show that COVID-19 variants-of-concern, like Delta and Omicron, pose different risks for severe disease, but they typically lack sequence-level information for the virus. Studies which do obtain viral genome sequences are generally limited in time, location, and population scope. Retrospective meta-analyses require time-consuming data extraction from heterogeneous formats and are limited to publicly available reports. Fortuitously, a subset of GISAID, the global SARS-CoV-2 sequence repository, includes “patient status” metadata that can indicate whether a sequence record is associated with mild or severe disease. While GISAID lacks data on comorbidities relevant to severity, such as obesity and chronic disease, it does include metadata for age and sex to use as additional attributes in modeling. With these caveats, previous efforts have demonstrated that genotype-patient status models can be fit to GISAID data, particularly when country-of-origin is used as an additional feature. But are these models robust and biologically meaningful? This paper shows that, in fact, temporal and geographic biases in sequences submitted to GISAID, as well as the evolving pandemic response, particularly reduction in severe disease due to vaccination, create complex issues for model development and interpretation. This paper poses a potential solution: efficient mixed effects machine learning using GPBoost, treating country as a random effect group. Training and validation using temporally split GISAID data and emerging Omicron variants demonstrates that GPBoost models are more predictive of the impact of spike protein mutations on patient outcomes than fixed effect XGBoost, LightGBM, random forests, and elastic net logistic regression models. •We propose to use global SARS-CoV-2 sequence data to predict severe COVID-19 disease.•Global sequence and patient outcome data vary greatly over time and between regions.•Mixed effects machine learning with GPBoost outperform fixed effect methods.•Trained models can provide early warnings for risks of emerging viral variants.•GPBoost modeling can also identify key mutations of potential clinical interest.

Metrics

6 Record Views

12 citations in Web of Science

12 citations in Scopus

Details

Title: Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning
Creators: Bahrad A. Sokhansanj - Drexel University
Gail L. Rosen
Publication Details: Computers in biology and medicine, v 149, 105969
Publisher: Elsevier
Resource Type: Journal article
Language: English
Academic Unit: Electrical and Computer Engineering
Web of Science ID: WOS:000877199500001
Scopus ID: 2-s2.0-85136539879
Other Identifier: 991019173852404721

UN Sustainable Development Goals (SDGs)

This publication has contributed to the advancement of the following goals:

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Web of Science research areas: Biology; Computer Science, Interdisciplinary Applications; Engineering, Biomedical; Mathematical & Computational Biology

Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning

Files and links (1)

Abstract

Metrics

Details

UN Sustainable Development Goals (SDGs)

InCites Highlights

Drexel University Social media