Improved methods for detection and prioritization of structural variants from long-read sequencing data

Jonathan Elliot Perdomo

doi:10.17918/00011173

Back

Improved methods for detection and prioritization of structural variants from long-read sequencing data

Dissertation

Open access

Improved methods for detection and prioritization of structural variants from long-read sequencing data

Jonathan Elliot Perdomo

Doctor of Philosophy (Ph.D.), Drexel University

May 2025

DOI:

https://doi.org/10.17918/00011173

Files and links (1)

pdf

Perdomo_Jonathan_20253.72 MBDownload View

PDFOpen Access (License Unspecified), Open Access

Abstract

Human genomics

Long read sequencing

Quality control

Structural variation

Variant calling

Structural variants (SVs) are the largest source of variations in the human genome and are frequently associated with disease phenotypes. Thus, the identification and characterization of SVs are essential for understanding human genome structure and function. Long read sequencing technologies such as Oxford Nanopore (ONT) and Pacific Biosciences provide increased sensitivity and resolution over short reads to resolve complex structural variants (SVs) with base-pair resolution. Widely used long-read SV callers, such as Sniffles2, cuteSV and PBSV, have limitations in the size and complexity of SVs detectable with high confidence, largely due to using limited alignment information. SVs identified with these tools are generally <50kb in length, and therefore large, potentially disease causal SVs may be overlooked. This dissertation involves the development of ContextSV, a novel computational method that overcomes these limitations and complements existing tools by combining long read alignments with copy number predictions from a Hidden Markov Model (HMM). Our method enables the simultaneous analysis of SVs and single-nucleotide variants (SNVs) to provide a more comprehensive understanding of genomic variation. HMM copy number predictions are based on coverage and expected SNV allele frequencies, using ethnicity-specific variant allele frequency information from human population databases such as gnomAD. We demonstrate that ContextSV achieves comparable performance with major long-read SV callers, and we further highlight its unique advantages in the identification and classification of large copy number variants (CNVs), and genotyping with high accuracy. Additionally, in this dissertation I work on addressing the low precision which is common in long-read SV callsets: SV callers typically aim to maximize the true positive rate to avoid missing important SVs which may be rare or clinically relevant, but this comes at a cost of an increased false positive rate (decreased precision). To address this, I develop a novel machine learning-based method for assigning SV confidence scores based on important genomic context and alignment features. These scores can be used to filter false positives and increase precision in the final long-read SV callset.

Metrics

61 File views/ downloads

25 Record Views

Details

Title: Improved methods for detection and prioritization of structural variants from long-read sequencing data
Creators: Jonathan Elliot Perdomo
Contributors: Kai Wang (Advisor) - University of Pennsylvania
Ming Xiao (Advisor)
Awarding Institution: Drexel University
Degree Awarded: Doctor of Philosophy (Ph.D.)
Publisher: Drexel University; Philadelphia, Pennsylvania
Number of pages: viii, 82 pages
Resource Type: Dissertation
Language: English
Academic Unit: School of Biomedical Engineering, Science, and Health Systems (1997-2026); Drexel University
Other Identifier: 991022093052504721

Improved methods for detection and prioritization of structural variants from long-read sequencing data

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media