Logo image
Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning
Journal article   Open access   Peer reviewed

Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning

Bahrad A. Sokhansanj and Gail L. Rosen
Computers in biology and medicine, v 149, 105969
Oct 2022
url
https://doi.org/10.1016/j.compbiomed.2022.105969View
Published, Version of Record (VoR)CC BY V4.0 Open

Abstract

Bioinformatics COVID-19 Machine learning SARS-CoV-2 Viral genomics
Epidemiological studies show that COVID-19 variants-of-concern, like Delta and Omicron, pose different risks for severe disease, but they typically lack sequence-level information for the virus. Studies which do obtain viral genome sequences are generally limited in time, location, and population scope. Retrospective meta-analyses require time-consuming data extraction from heterogeneous formats and are limited to publicly available reports. Fortuitously, a subset of GISAID, the global SARS-CoV-2 sequence repository, includes “patient status” metadata that can indicate whether a sequence record is associated with mild or severe disease. While GISAID lacks data on comorbidities relevant to severity, such as obesity and chronic disease, it does include metadata for age and sex to use as additional attributes in modeling. With these caveats, previous efforts have demonstrated that genotype-patient status models can be fit to GISAID data, particularly when country-of-origin is used as an additional feature. But are these models robust and biologically meaningful? This paper shows that, in fact, temporal and geographic biases in sequences submitted to GISAID, as well as the evolving pandemic response, particularly reduction in severe disease due to vaccination, create complex issues for model development and interpretation. This paper poses a potential solution: efficient mixed effects machine learning using GPBoost, treating country as a random effect group. Training and validation using temporally split GISAID data and emerging Omicron variants demonstrates that GPBoost models are more predictive of the impact of spike protein mutations on patient outcomes than fixed effect XGBoost, LightGBM, random forests, and elastic net logistic regression models. •We propose to use global SARS-CoV-2 sequence data to predict severe COVID-19 disease.•Global sequence and patient outcome data vary greatly over time and between regions.•Mixed effects machine learning with GPBoost outperform fixed effect methods.•Trained models can provide early warnings for risks of emerging viral variants.•GPBoost modeling can also identify key mutations of potential clinical interest.

Metrics

6 Record Views
12 citations in Scopus

Details

UN Sustainable Development Goals (SDGs)

This publication has contributed to the advancement of the following goals:

#3 Good Health and Well-Being

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Web of Science research areas
Biology
Computer Science, Interdisciplinary Applications
Engineering, Biomedical
Mathematical & Computational Biology
Logo image