Privacy-Preserving Data Sharing Using Generative Models

Author/Creator

Author/Creator ORCID

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.

Abstract

The modern era heavily relies on data for decision-making in areas like cybersecurity, healthcare, and social sciences. However, this abundance of data raises significant privacy issues, leading to regulations such as GDPR and COPPA that are designed to protect individual privacy but also create barriers to data access for researchers. Organizations, both private and governmental, collect vast amounts of data, but sharing this data is often restricted, especially in specialized fields like cybersecurity and healthcare. These privacy concerns hinder collaboration, limit data sharing, and ultimately impede research progress, making it challenging for researchers to leverage the full potential of available data. Consequently, there is a critical need to find ways to make privately collected data available for public use without compromising individual privacy. Existing methods to balance privacy and data utility, like cryptographic techniques, noise addition, and distributed modeling, often fall short, either by failing to provide strong privacy guarantees or by sacrificing data usability. Synthetic data generation offers a promising solution by creating artificial datasets that mirror the statistical properties of original data without exposing sensitive information. Generative models like GANs can produce such synthetic data while preserving underlying patterns and protecting privacy. However, generating realistic tabular data in specialized domains like cybersecurity and healthcare remains challenging due to the complexity of the data and the scarcity of diverse training samples. This work advances the development of privacy-preserving data generation using GANs, which are bound by privacy constraints to produce shareable datasets that protect privacy. Our technique has demonstrated that data generated in this manner can effectively replace original datasets for training machine learning models, with minimal accuracy loss when tested on the original data. We have also developed a novel model for data generation enhanced with domain-specific knowledge to improve the realism and accuracy of synthetic data. Furthermore, our prior research has shown how organizational policies constrain data sharing and how policy ambiguity complicates automatic enforcement. We developed synthetic data generation models that enforce privacy policies during the data generation process, ensuring compliance with regulations and reducing the risk of privacy breaches. By applying and validating these models across various domains, including cybersecurity and healthcare, we demonstrate their effectiveness in addressing privacy concerns while maintaining data utility. This research offers practical solutions for secure data sharing across different fields, advancing privacy-preserving data practices and supporting collaborative, data-driven research.