Dissertation
Semantic data quality assessment: an investigation of fitness for use in large clinical datasets
Doctor of Philosophy (Ph.D.), Drexel University
Sep 2023
DOI:
https://doi.org/10.17918/00001863
Abstract
Introduction: One of the foundations of research is that the underlying data are of sufficient quality to yield valid results. This is particularly important where research makes secondary use of data derived from electronic health records (EHR) or other clinical data, which are highly complex and heterogeneous. Current data quality assessment (DQA) standards are siloed and focus on structural evaluations of data quality that provide little or no information about data fitness for an intended use. Objective: The research presented in this dissertation aims to codify and apply a DQA framework that is flexible, reproducible, and can be applied to assess fitness for use in large clinical datasets. Methods: Qualitative methods were utilized to identify important features for a DQA framework as well as to elucidate barriers in implementation. Methods included literature review, content and gap analysis of current practice, semi-structured interviews, and an expert review panel to iterate versions of the framework. The framework connects theoretical constructs, specific check types, and quantitative methods for consistent results interpretation. In order to evaluate the framework following its development, test cases were applied to real world study designs. All output were evaluated for data quality utility and reproducibility. Results: A framework was produced that incorporates the following strategies: (1) standardization and reproducibility of checks; (2) specificity of check terms with paired methods and output; (3) user-driven check specification; (4) accessible methods for both exploratory visualizations and outlier detection; (5) data quality utility that address bias or confounding in studies. To create interpretable results, a variety of quantitative analyses such as summary statistics, median absolute deviation, time series analyses, and inferential tests were selected, as well as visualizations such as bar graphs, run charts, treemaps, heat maps, and dot plots. The DQA modules selected for evaluation addressed different aspects of completeness of patient cohorts: (1) patient fact density; (2) medical complexity representation in cohort; (3) impacts of alternative eligibility definitions. All three identified anomalies in real world data that could introduce significant bias into the study, and that may have been otherwise undetected or unaddressed. Discussion: The results suggest that a fitness-for-use DQA framework is feasible for adoption and application across a range of studies. The features of the framework include both user input and an algorithmic backend that pairs broad data quality concepts (e.g., completeness) with guided specifications, analytic methods, and helpful visualizations. Application of the framework suggests that tools can be developed in the future to aid in reusability across a wide range of clinical studies.
Metrics
80 File views/ downloads
68 Record Views
Details
- Title
- Semantic data quality assessment
- Creators
- Hanieh Razzaghi
- Contributors
- Jane Greenberg (Advisor)
- Awarding Institution
- Drexel University
- Degree Awarded
- Doctor of Philosophy (Ph.D.)
- Publisher
- Drexel University; Philadelphia, Pennsylvania
- Number of pages
- viii, 139 pages
- Resource Type
- Dissertation
- Language
- English
- Academic Unit
- Information Science (Informatics) (2013-2026); College of Computing and Informatics (2013-2026); Drexel University
- Other Identifier
- 991021229614204721