Skip Header

Topics

Data & Maps

Surveys & Programs

Resource Library

Partners Researchers Educators Survey Respondents

News NAICS Codes Jobs About Us Contact Us Help

View All Topics and Subtopics Index A to Z

Age and Sex Business and Economy Education Emergency Management / Disasters Employment Families and Living Arrangements Geography Health Hispanic Origin Housing Income and Poverty International Trade Migration/Geographic Mobility Population Population Estimates Public Sector Race Redistricting Research Voting and Registration

Explore data on data.census.gov View all data resources

Census Academy Combining Data Data Tools and Apps Datasets Developers Experimental Data Products Interactive Maps Mapping Files Profiles Related Sites Software Tables Training and Workshops Visualizations

Survey Help View all Surveys & Programs

2020 Census 2026 Census Test 2030 Census American Community Survey (ACS) American Housing Survey (AHS) Annual Business Survey (ABS) Annual Integrated Economic Survey (AIES) County Business Patterns (CBP) Current Population Survey (CPS) Economic Census Household Pulse Survey International Programs Metro and Micro Areas Population Estimates Population Projections Small Area Income and Poverty Statistics of U.S. Businesses Survey of Income and Program Participation (SIPP)

View all library resources Glossary

America Counts: Stories Audio Blogs By the Numbers Facts for Features Fact Sheets Infographics and Visualizations Photos Publications Spotlights Stats for Stories Training (Census Academy) Videos Working Papers

Formal Privacy and Synthetic Data for the American Community Survey

Skip Navigation

Formal Privacy and Synthetic Data for the American Community Survey

2018

Written by:

Michael H. Freiman, Amy D. Lauger, and Jerome P. Reiter

2017-09-FreimanLaugerReiter-ACSSynthesisPerturbation

Download Data Synthesis and Perturbation for the American Community Survey at the U.S. Census Bureau [PDF - 1.0MB]

This paper assesses an empirical measure of disclosure risk of synthetic demographic data generated using classification and regression trees. We synthesized a dataset with 50 implicates and tried to infer from the synthetic data the maximum income in the original dataset. If synthetic values were determined by drawing without noise from a leaf of the regression tree, then the maximum value across implicates was a very good estimate of the maximum value in the original dataset. If synthetic values were determined by drawing from the leaf with noise, then skewness in the incomes within the leaves led to substantial bias in the mean wage for the synthetic dataset. Furthermore, the maximum income could still be determined with unreasonable accuracy, estimable by the median of the maxima of the implicates, or in some cases by rescaling the maximum across all of the implicates. We conclude that this method of generating synthetic data does not adequately protect continuous variables such as income from reconstruction, at least not when many implicates are created.

Page Last Revised - October 28, 2021

Some content on this site is available in several different electronic formats. Some of the files may require a plug-in or additional software to view.

✕

Is this page helpful?
Thumbs Up Image

Thumbs Up Image

Yes

Thumbs Down Image

No

✕

NO THANKS

255 characters maximum

255 characters maximum reached

✕

Thank you for your feedback.
Comments or suggestions?

Top