Rigorous curation in practice: high-fidelity protein side-chain modeling, benchmarking, and cross-assay affinity normalization
Xiyao Long
Doctor of Philosophy (Ph.D.), Drexel University
Aug 2025
DOI:
https://doi.org/10.17918/00011160
Files and links (1)
pdf
Long_Xiyao_202510.55 MB
PDF Embargoed Access, Embargo ends: 30 Sep 2027
Abstract
Data filtering Protein structure prediction Sidechain modeling Chemical Engineering Machine Learning
Modeling of protein side-chain conformations is a long-standing subproblem in protein structure prediction. Related studies of side chains date back to the 1980s--from statistical analyses of side-chain conformations and energy functions to algorithms that decompose the side-chain interaction graph (e.g., SCWRL4). While overall protein-structure prediction has advanced, side-chain prediction still requires further clarification: reported metrics vary and benchmark data quality is not always controlled, obscuring true method performance. In this work we curate and quality-control Protein Data Bank (PDB) structures to build a training and test corpus for high-fidelity side-chain modeling. We resolve atom-naming ambiguities and mis-rotations, filter residues lacking electron-density support, reconstruct crystal symmetries for feature extraction, and mask overly flexible regions. Critically, we partition the corpus by ECOD evolutionary groups so that training and evaluation share no hidden homology, providing a more realistic generalization test than sequence-identity filters alone. We then develop a compact geometric deep-learning model (Relation-Shape Convolution, RSConv) that represents each residue as a node with identity and local backbone geometry. Using only backbone coordinates in a per-residue N-C_[alpha]-C frame plus neighborhood context, the network predicts full side-chain conformations without rotamer sampling. On a held-out test set, we obtain [chi]₁ accuracy of 89% and [chi]₂ accuracy of 83% given correct [chi]₁ (both measured within 40°). On a low-noise subset (B < 30 Ų, ~53% of residues), [chi]₁ rises to 93%. This rate is comparable to the [chi]₁ accuracy reported for AlphaFold2 when the backbone is highly correct (100 lDDT-C_[alpha]). Further, errors correlate strongly with B-factors Ų and EDIA, indicating that current plateaus reflect input uncertainty and single-conformer labels more than model size. Under matched evaluation conditions, performance approaches that of recent end-to-end predictors on high-confidence backbones, though direct cross-paper numbers are not strictly comparable. Finally, we normalize fragmented immunoglobulin-peptide affinity measurements by linearly calibrating log fold-changes from diverse assays to a single reference, yielding a cross-assay dataset suitable for mutational-effect modeling. Together, careful data stewardship, evolution-aware benchmarking, and judicious structural priors can deliver near-experimental all-atom accuracy without ever-larger models; principled cross-assay normalization likewise unlocks sparse biochemical measurements for learning-based predictors.
Metrics
18 Record Views
Details
Title
Rigorous curation in practice
Creators
Xiyao Long
Contributors
Roland L. Dunbrack Jr. (Advisor)
Xiaohua Hu (Advisor)
Awarding Institution
Drexel University
Degree Awarded
Doctor of Philosophy (Ph.D.)
Publisher
Drexel University; Philadelphia, Pennsylvania
Number of pages
87 pages
Resource Type
Dissertation
Language
English
Academic Unit
Information Science (Informatics) [Historical]; College of Computing and Informatics (2013-2026); Drexel University