Human genomics Long read sequencing Quality control Structural variation Variant calling
Structural variants (SVs) are the largest source of variations in the human genome and are frequently associated with disease phenotypes. Thus, the identification and characterization of SVs are essential for understanding human genome structure and function. Long read sequencing technologies such as Oxford Nanopore (ONT) and Pacific Biosciences provide increased sensitivity and resolution over short reads to resolve complex structural variants (SVs) with base-pair resolution. Widely used long-read SV callers, such as Sniffles2, cuteSV and PBSV, have limitations in the size and complexity of SVs detectable with high confidence, largely due to using limited alignment information. SVs identified with these tools are generally <50kb in length, and therefore large, potentially disease causal SVs may be overlooked. This dissertation involves the development of ContextSV, a novel computational method that overcomes these limitations and complements existing tools by combining long read alignments with copy number predictions from a Hidden Markov Model (HMM). Our method enables the simultaneous analysis of SVs and single-nucleotide variants (SNVs) to provide a more comprehensive understanding of genomic variation. HMM copy number predictions are based on coverage and expected SNV allele frequencies, using ethnicity-specific variant allele frequency information from human population databases such as gnomAD. We demonstrate that ContextSV achieves comparable performance with major long-read SV callers, and we further highlight its unique advantages in the identification and classification of large copy number variants (CNVs), and genotyping with high accuracy. Additionally, in this dissertation I work on addressing the low precision which is common in long-read SV callsets: SV callers typically aim to maximize the true positive rate to avoid missing important SVs which may be rare or clinically relevant, but this comes at a cost of an increased false positive rate (decreased precision). To address this, I develop a novel machine learning-based method for assigning SV confidence scores based on important genomic context and alignment features. These scores can be used to filter false positives and increase precision in the final long-read SV callset.
Metrics
61 File views/ downloads
25 Record Views
Details
Title
Improved methods for detection and prioritization of structural variants from long-read sequencing data
Creators
Jonathan Elliot Perdomo
Contributors
Kai Wang (Advisor) - University of Pennsylvania
Ming Xiao (Advisor)
Awarding Institution
Drexel University
Degree Awarded
Doctor of Philosophy (Ph.D.)
Publisher
Drexel University; Philadelphia, Pennsylvania
Number of pages
viii, 82 pages
Resource Type
Dissertation
Language
English
Academic Unit
School of Biomedical Engineering, Science, and Health Systems (1997-2026); Drexel University