PriveTAB : Secure and Privacy-Preserving sharing of Tabular Data

dc.contributor.authorKotal, Anantaa
dc.contributor.authorPiplai, Aritran
dc.contributor.authorChukkapalli, Sai Sree Laya
dc.contributor.authorJoshi, Anupam
dc.date.accessioned2022-03-29T16:38:39Z
dc.date.available2022-03-29T16:38:39Z
dc.date.issued2022-04-24
dc.descriptionACM International Workshop on Security and Privacy Analytics, April 24–27, 2022, Baltimore, MD, USAen_US
dc.description.abstractMachine Learning has increased our ability to model large quantities of data efficiently in a short time. Machine learning approaches in many application domains require collecting large volumes of data from distributed sources and combining them. However, sharing of data from multiple sources leads to concerns about privacy. Privacy regulations like European Union's General Data Protection Regulation (GDPR) have specific requirements on when and how such data can be shared. Even when there are no specific regulations, organizations may have concerns about revealing their data. For example in cybersecurity, organizations are reluctant to share their network-related data to permit machine learning-based intrusion detectors to be built. This has, in particular, hampered academic research. We need an approach to make confidential data widely available for accurate data analysis without violating the privacy of the data subjects. Privacy in shared data has been discussed in prior work focusing on anonymization and encryption of data. An alternate approach to make data available for analysis without sharing sensitive information is by replacing sensitive information with synthetic data that behave as original data for all analytical purposes. Generative Adversarial Networks (GANs) are one of the well-known models to generate synthetic samples that can have the same distributional characteristics as the original data. However, modeling tabular data using GAN is a non-trivial task. Tabular data contain a mix of categorical and continuous variables and require specialized constraints as described in the CTGAN model. In this paper, we propose a framework to generate privacy-preserving synthetic data suitable for release for analytical purposes. The data is generated using the CTGAN approach, and so is analytically similar to the original dataset. To ensure that the generated data meet the privacy requirements, we use the principle of t-closeness. We ensure that the distribution of attributes in the released dataset is within a certain threshold distance from the real dataset. We also encrypt sensitive values in the final released version of the dataset to minimize information leakage. We show that in a variety of cases, models trained on this synthetic data instead of the real data perform nearly as well when tested on the real data. Specifically, we show that the machine learning models used for network event/attack recognition tasks do not have a significant loss in accuracy when trained on data generated from our framework in place of the real dataset.en_US
dc.description.urihttps://dl.acm.org/doi/10.1145/3510548.3519377en_US
dc.format.extent11 pagesen_US
dc.genreconference papers and proceedingsen_US
dc.genrepreprintsen_US
dc.identifierdoi:10.13016/m2xuey-qhp9
dc.identifier.citationAnantaa Kotal, Aritran Piplai, Sai Sree Laya Chukkapalli, and Anupam Joshi. 2022. PriveTAB : Secure and Privacy-Preserving sharing of Tabular Data. In Proceedings of the 2022 ACM International Workshop on Security and Privacy Analytics (IWSPA ’22), April 24–27, 2022, Baltimore, MD, USA. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3510548.3519377en_US
dc.identifier.urihttps://doi.org/10.1145/3510548.3519377
dc.identifier.urihttp://hdl.handle.net/11603/24451
dc.language.isoen_USen_US
dc.publisherACMen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Student Collection
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.en_US
dc.subjectUMBC Ebiquity Research Group
dc.titlePriveTAB : Secure and Privacy-Preserving sharing of Tabular Dataen_US
dc.typeTexten_US
dcterms.creatorhttps://orcid.org/0000-0002-8641-3193en_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
1135.pdf
Size:
1.03 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.56 KB
Format:
Item-specific license agreed upon to submission
Description: