Bayesian Analysis of Synthetic Data under Multiple Linear Regression, Multivariate Normal and Multivariate Regression Models

Author/Creator

Author/Creator ORCID

Date

2020-01-01

Department

Mathematics and Statistics

Program

Statistics

Citation of Original Publication

Rights

Distribution Rights granted to UMBC by the author.
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu

Abstract

Statistical Disclosure Control (SDC) methods are used to preserve confidentiality of publicly released microdata, without compromising on its fundamental structure, so as to ensure adequate and accurate statistical analysis of the data. The synthetic data approach is a popular form of SDC methodology where (all or part of) the real data are not released, but are instead used to create synthetic data which are released. In this dissertations we develop Bayesian inference based on singly or multiply imputed synthetic data, when the original data are derived from the following models: multiple linear regression, multivariate normal and multivariate regression. We assume that the synthetic data are generated by using two methods: plug-in sampling, where unknown parameters in the data model are set equal to observed values of their point estimators based on the original data, and synthetic data are drawn from this estimated version of the model; posterior predictive sampling, where an imputed posterior distribution of the unknown parameters is used to generate a posterior draw, which in turn is plugged in the original model to produce synthetic data. In the single imputation case, the procedures developed here fill the gap in the existing literature where inferential methods are only available for multiple imputation and by being based on exact distributions, it may even be applied to cases where the sample size is small. Simulation results are presented to demonstrate how the proposed methodology performs compared to the theoretical predictions. We also outline some ways to extend the proposed methodology for certain scenarios where the required set of conditions do not hold.