U.S. flag

An official website of the United States government

Skip Header


Synthetic SIPP Data

ANNOUNCEMENT

We are currently in the process of determining the best home for the SIPP Synthetic Beta, taking into account various technical, funding, and process-related issues.  We will post updates as they become available.

Background on the SIPP Synthetic Beta

The SIPP Synthetic Beta (SSB) is a Census Bureau product that integrates person-level micro-data from a household survey with administrative tax and benefit data. These data link respondents from the Survey of Income and Program Participation (SIPP) to Social Security Administration (SSA)/Internal Revenue Service (IRS) Form W-2 records and SSA records of retirement and disability benefit receipt, and were produced by Census Bureau staff economists and statisticians in collaboration with researchers at Cornell University, the SSA and the IRS. The purpose of the SSB is to provide access to linked data that are usually not publically available due to confidentiality concerns. To overcome these concerns, Census synthesizes, or models, all the variables in a way that changes the record of each individual so as to preserve the underlying covariate relationships between the variables.  In version 7.0, the underlying missing data pattern is also synthesized.

Nine SIPP panels (1984, 1990, 1991, 1992, 1993, 1996, 2001, 2004, and 2008) form the basis for the SSB, with a subset of variables available across all the panels selected for inclusion and harmonization of variable definitions across the years covered by the panels. Administrative data are added and some editing is done to correct for logical inconsistencies in the IRS and Social Security earnings and benefits data. Thus, the SSB is a particularly appealing data set for new SIPP users because little data preparation is needed.  A complete list of variables included in SSB version 7.0 as well as details about the harmonization and editing are available in our Codebook.

The synthesis process involves estimating the joint distribution of all the variables in the data and taking random draws from this modeled distribution. These draws are then used to replace actual data values. This process is repeated multiple times to create a set of 4 files, which are called implicates.  For more information on the statistical methods used to create the SSB and formulae for combining results across implicates, please see The Creation and Use of the SIPP Synthetic Beta.

Before releasing these data, Census staff performed extensive testing for disclosure risk and determined that the probability of linking a record from the SSB to an actual person was negligible. Based on the conclusions of this testing, the Census Disclosure Review Board and their counterparts at IRS and SSA have approved these data for use by researchers working outside secure Census facilities.

Announcing Release of Version 7.0 December 2018

In Version 7.0, several minor changes have been made to the methodology for creating the SIPP Synthetic Beta (SSB) with the hopes of making the file more flexible for users, easier to update, and even more protected from a disclosure perspective. The underlying concept is still the same: to use Sequential Regression Multivariate Imputation (SRMI) to replace all values of sensitive variables with draws from an estimated probability distribution to create multiple synthetic files from which proper variance measures can be obtained.  In version 7.0, the incomplete data is synthesized, as is a missing data pattern.  Where there are missing data, users can implement their preferred approach to handling missing data in their analysis. Tools and code will be provided to the users to help implement a variety of statistically defensible methods for dealing with missing data.

With this update, the administrative records coverage in the data now extends through 2014.  Several variables have also been added to version 7.0 of the SSB.  Information on the date of filing for the most recently accepted and most recently denied SSDI applications has been added.  Synthesized linkages between mothers and children under 18 are also now part of the SSB.  Stratum and half-sample variables have also been added to allow the user to construct replicate weights.  Further, several variables available in past versions of the SSB have been updated or expanded.  For example, version 7.0 features more categories for industry and occupation than were available in past versions of the SSB, and the SIPP state of residence variable is now disaggregated for all years.

Back to top

Announcing Release of Version 6.0 February 2015

Version 6.0 of the SSB allows for a longer time series of analysis in three ways. First, the SSB 6.0 includes two additional SIPP panels (1984 and 2008); second, the administrative W-2 earnings records now extend through 2011; and third, OASDI and SSI benefit information are available through 2012. Several new SIPP time series variables were added to version 6.0 of the SSB, including monthly indicators of receipt and amounts of AFDC or TANF benefits, workers’ compensation benefits, veterans’ compensation or pension benefits, and food stamp or SNAP benefits. Two additional SIPP point-in-time variables, a sample base weight and a life insurance policy indicator, are now available on the file. Also, the fertility history variables (e.g., first_admin_birthdate and last_admin_birthdate) have been improved by reconciling children’s administrative birth dates with those birth dates reported by the children’s mother. Finally, we have added many new variables on SSDI and SSI applications that include when applications were submitted, when and whether they were approved or denied, and a diagnosis group classifying the type of disability.

Back to top

Version 5.1 Release May 2013

Version 5.1 of the SSB incorporates modeling improvements and new SIPP variables that expand the scope of analyses that can be performed relative to version 5.0 (released in 2010). In particular, we have added the following SIPP monthly time-series variables: weeks with a job, weeks with pay, usual hours worked, survey-reported earnings, total personal income, any health insurance coverage, and employer-provided health insurance coverage. We have also added two often-requested variables: first, a categorical variable for state of residence at the beginning of the SIPP panel; second, an indicator for whether the individual linked to administrative records via SSN or whether these records were imputed because no SSN was available. Finally, we have edited the administrative earnings variables prior to the data completion and synthesis process in order to modify some values that we determined to be clerical data error.

Back to top

How to Access the SSB

Researchers must submit an application to use the Synthetic Data Server. The application requires contact information, a brief description of the project, and a list of variables to be used. File access will be approved or denied based only on the feasibility of the proposal, which is determined by evaluating whether the data necessary to conduct the analysis are included on the file. Census generally expects to be able to approve applications within five business days. To apply please submit "Application to use the SIPP Synthetic Beta File" to sehsd.synthetic.data.use.list@census.gov

The SSB is housed on the Synthetic Data Server (SDS) at the Virtual Research Data Center at Cornell University. A free account on the SDS is created for each approved user. Using this account, researchers may run SAS and Stata programs using SSB data. Census staff will email program and log files to researchers upon request and without disclosure review since these data are public-use. Users will receive instructions about accessing their account for the first time once their application is approved.

Back to top

Analytic Validity of the SSB: Disclaimer

The data synthesis process employed by Census to protect the linked data from the risk of disclosing the identity of individuals is relatively new and substantially changes both the survey and administrative data. The intent of the modeling done as part of the synthesis is to preserve relationships among variables that are of interest to researchers while ensuring that personally identifiable information is not revealed to the data user. It is not feasible to ensure accuracy by comparing every relationship among SSB variables with the corresponding relationship in the underlying confidential micro-data. Hence, we strongly urge researchers not to publish results produced from the SSB without first requesting that Census validate these results with confidential data housed in a secure environment at the Census Bureau. Census will perform this validation free of charge to researchers, as resources permit and according to the protocol established by the three agencies involved and outlined below.

Without validation of results, Census, SSA, and IRS make no guarantee of the validity of the SSB for any research purpose.

Back to top

Protocol for Validation of Results

Census will validate results obtained from the SSB on the internal, confidential version of these data (Completed Gold Standard Files). Users who wish to obtain validated results should follow the protocol outlined here.

  • The restricted access site will provide SAS and Stata analysis software and a computing environment similar to the one used to analyze the confidential Completed Gold Standard data on Census Bureau internal computers. Researchers should follow the Census Bureau programming requirements described in SSB Validation Request Guidelines to ensure that the programs will successfully transfer to internal Census computers for validation. Researchers should plan to share their results and programs from the synthetic data analysis with the Census Bureau.
  • After programs have successfully run without error on the synthetic data, researchers may request that Census run these programs on the Completed Gold Standard Files. Only programs successfully run without error on the SDS will be eligible to be run on the confidential data by Census staff. Any programs that produce errors on the Completed Gold Standard Files will be returned to users for correction.  Researchers also need to ensure that their output follows the Census disclosure review rounding rules (see DRB Rounding Rules Memo).
  • Once an analysis has been repeated on the Completed Gold Standard File, the results will be reviewed by Census staff for disclosure concerns. Researchers should familiarize themselves with standard Census disclosure rules for outside projects (See the FSRDC Disclosure Avoidance Methods Handbook here) and should fill out the appropriate memo documenting the requested output (see FSRDC Clearance Request Memo). Data products and output approved by Census staff will be released to the users, ORES/SSA, and SOI/IRS.
  • The validation process can be accomplished in as little as one week for simple results that are generated by clean code and have no disclosure issues. However, if the code does not run properly, the sample sizes are too small, or the researcher does not accurately fill out the disclosure memo, the process can take much longer. Census makes no guarantee on the length of time between submission of programs and the release of results from the confidential data.
  • For more information about the validation process, including advice on how to make the process go smoothly and quickly, please see SSB Validation Request Guidelines.

Back to top

The Future of the SSB and Feedback from Researchers

We are always interested in hearing from users about which variables they would like to see added to the file. Similarly, unexpected data patterns or variable values, from either SSB or Gold Standard results, should be reported to Census using our SSB email address, sehsd.synthetic.data.use.list@census.gov, in order to help us continually improve the file.

We request that researchers who publish results from analyses done using these data cite the SSB as their data source and acknowledge the use of the SDS server at Cornell and the support of Census staff in running any validation programs. These citations will help ensure continued funding for the SDS server and the creation of the Gold Standard File and the SSB.

Suggested acknowledgement:

"This analysis was first performed using the SIPP Synthetic Beta (SSB) on the Synthetic Data Server housed at Cornell University which is funded by NSF Grant #SES-1042181. These data are public use and may be accessed by researchers outside secure Census facilities. For more information, visit //www.census.gov/programs-surveys/sipp/methodology/sipp-synthetic-beta-data-product.html. Final results for this paper were obtained from a validation analysis conducted by Census Bureau staff using the SIPP Completed Gold Standard Files and the programs written by this author and originally run on the SSB. The validation analysis does not imply endorsement by the Census Bureau of any methods, results, opinions, or views presented in this paper".

Data Citation:

U.S. Census Bureau. SIPP Synthetic Beta: Version 6.0 [Computer file].Washington DC; Cornell University, Synthetic Data Server [distributor], Ithaca, NY, 2015.

Back to top

Further Questions

For questions about a specific project using the SIPP Synthetic Beta, please email sehsd.synthetic.data.use.list@census.gov.

Back to top

Related Materials

.pdf Disclosure Review Board Memo: Second Request for Release of SIPP Synthetic Beta Version 6.0 [<1.0 MB]
We are requesting the approval of the Census Disclosure Review Board (DRB) for the release of the SIPP Synthetic Beta (SSB) v6.0, produced by the Survey Improvement Research Branch (SIRB) of the Census Bureau’s Social, Economic, and Housing Statistics Division. This data product is an update to the previ- ously released SSB v5.1. In this memo we provide a brief review of the creation of the SSB and then describe our disclosure-risk analysis. From the results of this analysis, we conclude that the release of SSB 6.0 would not risk disclosing the identity of any SIPP respondent.
Page Last Revised - February 6, 2024
Is this page helpful?
Thumbs Up Image Yes Thumbs Down Image No
NO THANKS
255 characters maximum 255 characters maximum reached
Thank you for your feedback.
Comments or suggestions?

Top

Back to Header