ENABLING DATA PRIVACY THROUGH ANONYMIZATION IN CENTRALIZED AND DISTRIBUTED ENVIRONMENTS TO SECURELY SHARE NETWORK TRACE & HEALTHCARE DATA

Author/Creator

Author/Creator ORCID

Date

2024-01-01

Department

Information Systems

Program

Information Systems

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Abstract

Privacy is the right to control sensitive information and protect it from unauthorized access or disclosure. Anonymization ensures privacy by removing or altering sensitive data, making it difficult to uncover the original information. This dissertation investigates anonymization techniques in both centralized and distributed environments and emphasizes on preserving the privacy of data in two different application domains, namely network trace data, and healthcare data. Organizations collect vast amounts of network trace data for purposes such as network optimization and user behavior analysis but are often hesitant to share this data due to privacy concerns and proprietary information. Existing anonymization tools have significant shortcomings: they lack provable protection and rely heavily on parameter settings without offering adequate guidance. This dissertation proposes a self-adaptive and secure approach for sharing network trace data in order to maintain privacy by removing or obfuscating sensitive information. Additionally, we investigate network trace data anonymization in distributed environments. Organizations often rely on integrating data from multiple sites, presenting challenges in anonymization due to the required communication. This dissertation introduces two new methods for cluster-based distributed anonymization: one based on distributed coordinated anonymization and the other on top-down distributed anonymization. These methods enable each site to anonymize its data in a coordinated manner, allowing the merged anonymized data to be centrally analyzed. Finally, this dissertation examines anonymization and integration of healthcare data. Anonymizing healthcare data is essential for protecting privacy, requiring the removal of personal identifiers while ensuring accurate integration and alignment of distributed patient information. In order to address the anonymization and integration challenge of distributed healthcare data, we introduce a novel approach to anonymize distributed data with limited communication, followed by an integration process for subsequent analysis. This approach ensures consistency across sources so that anonymized data can be directly integrated without expensive procedures. A hash-function generator is used to create consistent noise based on a locally generated seed, which also serves as a unique identifier for data integration. Our approaches overcome these limitations by providing provable protection and automatically optimizing parameter settings. The proposed solutions support differential privacy, k-anonymity and random response. Experimental evaluations demonstrate that the proposed techniques ensure privacy through anonymization, maintain data utility, and enable efficient integration of distributed anonymized data with minimal computational overhead.