Trade-offs between privacy and utility in machine learning: methodological frameworks and real-world challenges in healthcare data

Yusi Wei

doi:10.17918/00011353

Data sharing plays a critical role in advancing innovation across domains, particularly in healthcare, where large-scale data enables improved analytics and machine learning (ML) applications. However, sharing data that contains personally identifiable information raises significant privacy concerns. Regulatory frameworks such as the Health Insurance Portability and Accountability Act of 1996 (HIPAA) require data de-identification prior to sharing, yet prior research shows that regulatory compliance alone does not fully prevent re-identification attacks. Additional privacy-preserving techniques, including anonymization and differential privacy, can strengthen protection but often introduce information loss that may negatively affect ML performance. Consequently, balancing privacy protection and data utility remains a critical challenge. This study investigates the trade-offs between privacy preservation and data utility in ML applications and proposes optimization-based frameworks for privacy-preserving data sharing. First, we develop a multi-objective optimization-based anonymization model that improves privacy protection while maintaining the analytical usefulness of the data. The model is designed to handle both numeric and categorical attributes and evaluates privacy risks under multiple attack scenarios. Building on this model, we further propose a closed-loop optimization framework that incorporates feedback from ML performance to iteratively refine anonymization strategies. This framework reflects real-world data-sharing processes in which data owners and data users interact, enabling adaptive adjustments that improve both privacy protection and ML effectiveness over the open-loop framework. The proposed methods are evaluated using healthcare data to examine how anonymization strategies affect both privacy risk and ML performance. The analysis also identifies subpopulations that are particularly vulnerable to privacy attacks, including groups characterized by specific demographic and healthcare utilization patterns such as older adults, frequently hospitalized patients, and minority populations. These vulnerable groups may also play an important role in ML model performance, highlighting the need to carefully balance privacy protection and analytical value. Overall, this research provides methodological contributions to privacy-preserving data sharing by integrating optimization techniques with ML evaluation in a unified framework. The findings offer practical insights for healthcare organizations seeking to responsibly share data while maintaining both strong privacy protection and meaningful analytical utility. In addition, numerical results on datasets from other domains suggest that the proposed models have broader applicability and can potentially be extended to financial and governmental data-sharing scenarios in future research.

Trade-offs between privacy and utility in machine learning: methodological frameworks and real-world challenges in healthcare data

Files and links (1)

Abstract

Metrics

Details

Trade-offs between privacy and utility in machine learning: methodological frameworks and real-world challenges in healthcare data

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media