A Cluster-based Approach for Distributed Anonymization of Vertically Partitioned Data
Loading...
Permanent Link
Author/Creator
Date
Type of Work
Department
Program
Citation of Original Publication
Rights
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
Subjects
Abstract
In modern organizations, data is often spread across different sites, posing challenges for effective analysis. A typical solution is to transfer data to a centralized server which may jeopardize privacy and leak sensitive or proprietary information. As a result, many organizations are hesitant to adopt this solution despite its potential to fully utilize the data, analyze it, and thus, provide valuable insights for problem-solving and decision-making. Current approaches to address this issue mainly concentrate on distributed privacy-preserving techniques for data analysis, which is conducted in a distributed and a privacy-preserving way where data does not need to leave each site. However, such methods often require substantial computational and communication overhead, every time an analysis is applied on the dataset. This paper focuses on distributed anonymization, where distributed data is anonymized at each site in a coordinated way. Then, the anonymized data is merged and sent to a centralized server, where it is analyzed. In summary, it introduces two new approaches on cluster-based distributed anonymization when data is vertically partitioned (meaning each site has a subset of the features), one based on distributed coordinated anonymization, and the other based on top-down distributed anonymization. The benefit of these approaches is that the overhead of anonymization only occurs once at each site and all subsequent analyses do not incur extra anonymization cost. Experimental results show the proposed methods preserve the privacy of distributed data with very minor loss of utility of anonymized data and impose little computational overhead.