Rigorous curation in practice: high-fidelity protein side-chain modeling, benchmarking, and cross-assay affinity normalization

Xiyao Long

doi:10.17918/00011160

Modeling of protein side-chain conformations is a long-standing subproblem in protein structure prediction. Related studies of side chains date back to the 1980s--from statistical analyses of side-chain conformations and energy functions to algorithms that decompose the side-chain interaction graph (e.g., SCWRL4). While overall protein-structure prediction has advanced, side-chain prediction still requires further clarification: reported metrics vary and benchmark data quality is not always controlled, obscuring true method performance. In this work we curate and quality-control Protein Data Bank (PDB) structures to build a training and test corpus for high-fidelity side-chain modeling. We resolve atom-naming ambiguities and mis-rotations, filter residues lacking electron-density support, reconstruct crystal symmetries for feature extraction, and mask overly flexible regions. Critically, we partition the corpus by ECOD evolutionary groups so that training and evaluation share no hidden homology, providing a more realistic generalization test than sequence-identity filters alone. We then develop a compact geometric deep-learning model (Relation-Shape Convolution, RSConv) that represents each residue as a node with identity and local backbone geometry. Using only backbone coordinates in a per-residue N-C_[alpha]-C frame plus neighborhood context, the network predicts full side-chain conformations without rotamer sampling. On a held-out test set, we obtain [chi]₁ accuracy of 89% and [chi]₂ accuracy of 83% given correct [chi]₁ (both measured within 40°). On a low-noise subset (B < 30 Å², ~53% of residues), [chi]₁ rises to 93%. This rate is comparable to the [chi]₁ accuracy reported for AlphaFold2 when the backbone is highly correct (100 lDDT-C_[alpha]). Further, errors correlate strongly with B-factors Å² and EDIA, indicating that current plateaus reflect input uncertainty and single-conformer labels more than model size. Under matched evaluation conditions, performance approaches that of recent end-to-end predictors on high-confidence backbones, though direct cross-paper numbers are not strictly comparable. Finally, we normalize fragmented immunoglobulin-peptide affinity measurements by linearly calibrating log fold-changes from diverse assays to a single reference, yielding a cross-assay dataset suitable for mutational-effect modeling. Together, careful data stewardship, evolution-aware benchmarking, and judicious structural priors can deliver near-experimental all-atom accuracy without ever-larger models; principled cross-assay normalization likewise unlocks sparse biochemical measurements for learning-based predictors.

Rigorous curation in practice: high-fidelity protein side-chain modeling, benchmarking, and cross-assay affinity normalization

Files and links (1)

Abstract

Metrics

Details

Rigorous curation in practice: high-fidelity protein side-chain modeling, benchmarking, and cross-assay affinity normalization

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media