U.S. flag

An official website of the United States government

Skip Header


What to Expect: Disclosure Avoidance and the 2020 Census Demographic and Housing Characteristics File

Written by:

Estimated reading time: 8 minutes

On May 25, the U.S. Census Bureau is set to release the next wave of data from the 2020 Census providing new details about our nation and communities.

This next wave of data is called the Demographic and Housing Characteristics File (DHC). The characteristics include sex and age, race, Hispanic origin, relationship, group quarters, household and family type, tenure (whether the home is owned or rented) and vacancy. At the same time, we will also release the Demographic Profile, which provides an overview of the characteristics in one, easy-to-reference table.

The granular detail of these characteristics, how they intersect with each other and their availability for small geographic areas (many down to the census block) require the Census Bureau to carefully protect the underlying response data. We don’t want anyone to be able to identify an individual by looking at or manipulating the data tables for a community.

To protect the confidentiality of respondent information, the Census Bureau has historically used a variety of statistical methods collectively known as disclosure avoidance. These methods have included not publishing certain tables or statistics if the underlying populations were too small, aggregating data to higher geographic levels and swapping entire households’ data across geographies. 

To keep pace with growing disclosure risks, we are now using a more sophisticated, mathematical approach that inserts a small amount of “noise” into the data to protect the confidentiality of individuals’ information. To learn more about this approach and why previous methods aren’t sufficient protection against new threats, view the blog, Modernizing Privacy Protections for the 2020 Census: Next Steps.

In this blog, we explain how the new approach will affect the next wave of 2020 Census results.

Statistical Noise in the Demographic Profile and DHC

We’re providing some metrics to help data users understand how much statistical noise is in the new data products. For now, we have metrics based on 2010 Census data. With the DHC release, we’ll have metrics for the new 2020 data.

Metrics based on 2010: For now, we can use 2010 Census data to help you anticipate how close the DHC results will likely be to the enumerated count following disclosure avoidance. We applied the same disclosure avoidance mechanism and settings we’re using for DHC on 2010 Census data.

From this effort, we released a suite of files in early April 2023 showing the anticipated impact of the 2020 Census disclosure avoidance system.

For example:

  • The published count of owner-occupied housing units would likely be within about two houses on average of the actual enumerated count at the census tract level. That means, for example, a tract with 725 owner-occupied houses may instead show 723, 724, 726 or 727 houses. It should be within that margin of losing or gaining two, on average, though some tracts will have counts within one or zero of the collected value. Other tracts will have published counts further than two away from the collected count.

  • Renter-occupied housing units would likely be within the same amount and follow a similar pattern: If a tract had 530 renter-occupied units, it would show a count between 528 and 532 on average – that is, it could show 528, 529, 530, 531 or 532 renter-occupied units.

  • The count of 4-year-olds would likely be within about six people at the county level. The average county had about 1,923 4-year-olds; with this average amount of statistical noise, that county would show a count somewhere between 1,917 to 1,929. 

Metrics for 2020: With the DHC release or shortly thereafter, we plan to provide similar metrics comparing the published DHC tables to the 2020 Census enumerated counts. However, given the sensitivity of comparing these 2020 numbers, we are only releasing two primary statistics:

  • Mean absolute error, which gives insight into accuracy. Like the examples above, it shows how close the published data point is to the enumerated count on average.

  • Mean error, which gives insight into statistical bias. It shows whether the published data point tends to go higher or lower than the enumerated count and by how much. We don’t allow negative numbers in the published tables, so small populations may have positive bias (that is, the published count tends to be slightly higher than the enumerated count).

We plan to produce these metrics for many of the same characteristics and geographic levels as we did for each of the DHC demonstration data products we’ve released over the past few years. The metrics will give data users a measure of the noise added for disclosure avoidance and the best picture so far of the disclosure avoidance-related variability in the DHC. Users can also still rely on the existing 2010 metrics to get a fuller suite of metrics like percent error or counts of outliers as we work out the right way to calculate and protect the 2020 versions of those numbers.

One note: our disclosure avoidance methods add statistical noise to the person and housing unit counts separately. This may lead to implausible or improbable results, especially for lower-level geographies. For example, the data tables may show more homeowners of a particular race for a geography than people of that race.

Because of the way that noise was added and because larger geographies are less prone to these implausible results, we recommend the following:

  • Aggregate small geographies. For example, you can add census blocks together in the DHC to reduce noise. Or, instead of using census blocks, use higher-level geographies whenever possible. The disclosure avoidance system was tuned to support tracts, places, school districts and other higher-level geographic areas. These areas will have less noise than blocks.

  • Aggregate small populations. For example, you could add single-year age groups together to reduce noise.

Also keep in mind that if you come across something unexpected in the DHC and Demographic Profile results, it may not always be a result of disclosure avoidance. Other possible reasons are explained on the What to Consider if You Find an Unexpected Census Result webpage.

More Technical Information to Come

More guidance on using the data, as well as subject definitions and information on 2020 Census data collection and confidentiality protections is available in our 2020 Census Demographic and Housing Characteristics File Summary File (DHC) Technical Documentation.

If you would like to explore further the technical aspects of how disclosure avoidance affects the DHC data, stay tuned. We’ll have additional technical information soon about:

Noisy Measurement Files. In June, we plan to release the 2020 Redistricting (P.L. 94-171) Noisy Measurement File and the 2010 DHC Noisy Measurement File.

o   A noisy measurement file is the intermediate output of the disclosure avoidance system. It's what is created after the system adds statistical noise to a series of tabulations from the confidential data.

o   The noise can result in internal and hierarchical inconsistencies (such as negative numbers or totals for counties that don't add up to the state total). To correct these inconsistencies before publishing the tables, we complete a final step called “post-processing.” Researchers might want to explore alternative ways to post-process the data.

o   By releasing the noisy measurements, we give researchers and data scientists the opportunity to independently process the files, complete analysis and assess the confidentiality protections. An example of noisy measurements is available in the 2010 Census Production Settings Redistricting Data (P.L. 94-171) Demonstration Noisy Measurement File (umich.edu).

  • Confidence Intervals. Another technical resource we are developing is confidence intervals for the published tables. Data users could use these intervals to adapt their own analyses, as needed.

    o   To create the intervals, we will simulate running the data through disclosure avoidance and post-processing many times. This will reveal a range for what any given query (such as the number of renter-occupied housing units) would likely be. Data users could have confidence that 90% of the time, the enumerated count falls within that range.

    o   Our goal is to make the confidence intervals available for every query at every level of geography in the DHC. As we continue our research, we’ll know more about the scope of the confidence intervals we can publish.

    o   We are currently testing this approach for creating the confidence intervals using 2010 Census data and plan to extend the research to 2020 Census data soon. We’re still working on the timeline for this research project and will keep the public informed on its progress.

We invite you to sign up for our 2020 Census data products newsletters to receive the updates.

Conclusion

We have worked closely with the data user community to develop the confidentiality protections for the DHC and Demographic Profile. As a result, we are disseminating data fit for a wide variety of uses. Our approach also fulfills our commitment and legal obligation to protect the confidentiality of individuals’ responses to the 2020 Census.

For more information about disclosure avoidance for the 2020 Census, visit the Disclosure Avoidance Modernization webpage and browse our series of briefs.

For more information about the DHC and Demographic Profile, including the content that will be available, visit the press kit and join us for a webinar May 16 at 1 p.m. EDT.

This article was filed under:

     
Page Last Revised - August 21, 2023
Is this page helpful?
Thumbs Up Image Yes Thumbs Down Image No
NO THANKS
255 characters maximum 255 characters maximum reached
Thank you for your feedback.
Comments or suggestions?

Top

Back to Header