U.S. flag

An official website of the United States government

Skip Header


Formal Privacy and Synthetic Data for the American Community Survey

Written by:

This paper assesses an empirical measure of disclosure risk of synthetic demographic data generated using classification and regression trees. We synthesized a dataset with 50 implicates and tried to infer from the synthetic data the maximum income in the original dataset. If synthetic values were determined by drawing without noise from a leaf of the regression tree, then the maximum value across implicates was a very good estimate of the maximum value in the original dataset. If synthetic values were determined by drawing from the leaf with noise, then skewness in the incomes within the leaves led to substantial bias in the mean wage for the synthetic dataset. Furthermore, the maximum income could still be determined with unreasonable accuracy, estimable by the median of the maxima of the implicates, or in some cases by rescaling the maximum across all of the implicates. We conclude that this method of generating synthetic data does not adequately protect continuous variables such as income from reconstruction, at least not when many implicates are created.

Page Last Revised - October 28, 2021
Is this page helpful?
Thumbs Up Image Yes Thumbs Down Image No
NO THANKS
255 characters maximum 255 characters maximum reached
Thank you for your feedback.
Comments or suggestions?

Top

Back to Header