Synthetic SIPP Data

ANNOUNCEMENT

Update 07/08/2024: We now have a new process in place! Please read below the description and instructions for the process to access the SSB data and validation requests. Please note that it might take longer than usual to address requests initially, as we work on setting up both previous and new SSB users in the new system.

Background on the SIPP Synthetic Beta

The SIPP Synthetic Beta (SSB) is a Census Bureau product that integrates person-level micro-data from a household survey with administrative tax and benefit data. These data link respondents from the Survey of Income and Program Participation (SIPP) to Social Security Administration (SSA)/Internal Revenue Service (IRS) Form W-2 records and SSA records of retirement and disability benefit receipt, and were produced by Census Bureau staff economists and statisticians in collaboration with researchers at Cornell University, the SSA and the IRS. The purpose of the SSB is to provide access to linked data that are usually not publically available due to confidentiality concerns. To overcome these concerns, Census synthesizes, or models, all the variables in a way that changes the record of each individual so as to preserve the underlying covariate relationships between the variables. In version 7.0, the underlying missing data pattern is also synthesized.

Nine SIPP panels (1984, 1990, 1991, 1992, 1993, 1996, 2001, 2004, and 2008) form the basis for the SSB, with a subset of variables available across all the panels selected for inclusion and harmonization of variable definitions across the years covered by the panels. Administrative data are added and some editing is done to correct for logical inconsistencies in the IRS and Social Security earnings and benefits data. Thus, the SSB is a particularly appealing data set for new SIPP users because little data preparation is needed. A complete list of variables included in SSB version 7.0 as well as details about the harmonization and editing are available in our Codebook.

The synthesis process involves estimating the joint distribution of all the variables in the data and taking random draws from this modeled distribution. These draws are then used to replace actual data values. This process is repeated multiple times to create a set of 4 files, which are called implicates. For more information on the statistical methods used to create the SSB and formulae for combining results across implicates, please see The Creation and Use of the SIPP Synthetic Beta.

Before releasing these data, Census staff performed extensive testing for disclosure risk and determined that the probability of linking a record from the SSB to an actual person was negligible. Based on the conclusions of this testing, the Census Disclosure Review Board and their counterparts at IRS and SSA have approved these data for use by researchers working outside secure Census facilities.

Announcing Release of Version 7.0 December 2018

Announcing Release of Version 6.0 February 2015

Announcing Release of Version 5.1 May 2013

How to Access the SSB

Analytic Validity of the SSB: Disclaimer

Protocol for Validation of Results

The Future of the SSB and Feedback from Researchers

Further Questions

Related Materials

Announcing Release of Version 7.0 December 2018

In Version 7.0, several minor changes have been made to the methodology for creating the SIPP Synthetic Beta (SSB) with the hopes of making the file more flexible for users, easier to update, and even more protected from a disclosure perspective. The underlying concept is still the same: to use Sequential Regression Multivariate Imputation (SRMI) to replace all values of sensitive variables with draws from an estimated probability distribution to create multiple synthetic files from which proper variance measures can be obtained. In version 7.0, the incomplete data is synthesized, as is a missing data pattern. Where there are missing data, users can implement their preferred approach to handling missing data in their analysis. Tools and code will be provided to the users to help implement a variety of statistically defensible methods for dealing with missing data.

With this update, the administrative records coverage in the data now extends through 2014. Several variables have also been added to version 7.0 of the SSB. Information on the date of filing for the most recently accepted and most recently denied SSDI applications has been added. Synthesized linkages between mothers and children under 18 are also now part of the SSB. Stratum and half-sample variables have also been added to allow the user to construct replicate weights. Further, several variables available in past versions of the SSB have been updated or expanded. For example, version 7.0 features more categories for industry and occupation than were available in past versions of the SSB, and the SIPP state of residence variable is now disaggregated for all years.

Announcing Release of Version 6.0 February 2015

Version 6.0 of the SSB allows for a longer time series of analysis in three ways. First, the SSB 6.0 includes two additional SIPP panels (1984 and 2008); second, the administrative W-2 earnings records now extend through 2011; and third, OASDI and SSI benefit information are available through 2012. Several new SIPP time series variables were added to version 6.0 of the SSB, including monthly indicators of receipt and amounts of AFDC or TANF benefits, workers’ compensation benefits, veterans’ compensation or pension benefits, and food stamp or SNAP benefits. Two additional SIPP point-in-time variables, a sample base weight and a life insurance policy indicator, are now available on the file. Also, the fertility history variables (e.g., first_admin_birthdate and last_admin_birthdate) have been improved by reconciling children’s administrative birth dates with those birth dates reported by the children’s mother. Finally, we have added many new variables on SSDI and SSI applications that include when applications were submitted, when and whether they were approved or denied, and a diagnosis group classifying the type of disability.

Version 5.1 Release May 2013

Version 5.1 of the SSB incorporates modeling improvements and new SIPP variables that expand the scope of analyses that can be performed relative to version 5.0 (released in 2010). In particular, we have added the following SIPP monthly time-series variables: weeks with a job, weeks with pay, usual hours worked, survey-reported earnings, total personal income, any health insurance coverage, and employer-provided health insurance coverage. We have also added two often-requested variables: first, a categorical variable for state of residence at the beginning of the SIPP panel; second, an indicator for whether the individual linked to administrative records via SSN or whether these records were imputed because no SSN was available. Finally, we have edited the administrative earnings variables prior to the data completion and synthesis process in order to modify some values that we determined to be clerical data error.

How to Access the SSB

Researchers must submit an application to access the SSB data. The application requires contact information, a brief description of the project, and a list of variables to be used. File access will be approved or denied based only on the feasibility of the proposal, which is determined by evaluating whether the data necessary to conduct the analysis are included on the file. Census generally expects to be able to approve applications within five business days. To apply please download the Application to use the SIPP Synthetic Beta File and email the completed form to [email protected]. Once the application is approved, Census staff will notify the user, via email, of the approval and provide account credentials to access the SSB via SecureFTP. The user will correspond with Census staff to arrange a date to download the SSB datafiles. SecureFTP requests are processed bi-monthly at the beginning and middle of each month. Links are only valid for 48 hours and are not sent on Fridays or over the weekend.

When the user is ready to download the SSB data, they will click on the 48-hour link provided via email and will login to the SecureFTP account with the credentials previously provided. The user will then be able to download the datafiles, SSB version 7.0. There are four datafiles, each about 3-4 Gigabytes in size, which you can select to download in SAS or STATA file type.

If the user has difficulty or issues accessing the SecureFTP credentials or the 48-hour link, please email [email protected]. However, once the SSB files are downloaded to the user’s computing environment, the Census Bureau staff cannot assist with troubleshooting or providing technical support with using the SSB files.

Protocol for Validation of Results

Analytic Validity of the SSB: Disclaimer

The data synthesis process employed by Census to protect the linked data from the risk of disclosing the identity of individuals is relatively new and substantially changes both the survey and administrative data. The intent of the modeling done as part of the synthesis is to preserve relationships among variables that are of interest to researchers while ensuring that personally identifiable information is not revealed to the data user. It is not feasible to ensure accuracy by comparing every relationship among SSB variables with the corresponding relationship in the underlying confidential micro-data. Hence, we strongly urge researchers not to publish results produced from the SSB without first requesting that Census validate these results with confidential data housed in a secure environment at the Census Bureau. Census will perform this validation free of charge to researchers with an approved application, as resources permit and according to the protocol established by the three agencies involved and outlined below. There is a limit on how much output can be validated per researcher/project, therefore, validation requests for final results/outputs is most appropriate.

Without validation of results, Census, SSA, and IRS make no guarantee of the validity of the SSB for any research purpose.

Census will validate results obtained from the SSB on the internal, confidential version of these data (Completed Gold Standard Files). This service is available only to users who have an approved application and received access to the SSB data. Users who wish to obtain validated results should follow the protocol outlined below.

Only SAS and Stata programs compatible with the program versions available on the Census Bureau internal computers (SAS 9.4 M8 and Stata/SE 18.0) will be accepted and analyzed with the confidential Completed Gold Standard data on Census Bureau internal computers. Researchers should follow the Census Bureau programming requirements described in SSB Validation Request Guidelines to ensure that the programs will successfully transfer to internal Census computers for validation. The code should follow Linux file conventions (i.e. folders separated by / not \). Researchers should plan to share and submit their results and programs from the synthetic data analysis with the Census Bureau.

After programs have successfully run without error on the synthetic data in the user’s local environment, researchers may request that the Census Bureau run these programs on the Completed Gold Standard File. Only programs that successfully run without error will be eligible to be run on the confidential data by Census staff. Any programs that produce errors on the Completed Gold Standard Files will be returned to users for correction. Researchers also need to ensure that their output follows the Census disclosure guidelines and standards (See the FSRDC Disclosure Avoidance Methods Handbook) and that estimates adhere to the rounding rules (see DRB Rounding Rules Memo).

Once the program and output are set as specified in the instructions above, the user should complete the Clearance Request Memo documenting the requested output.

To initiate the request for validation on the confidential data, the user sends an email to [email protected] with the completed memo (Clearance Request Memo) attached.

The Census staff will review the validation request and if it is feasible and the memo is complete (i.e., no missing information) then the Census staff will reply to the user with a new set of credentials to access the send-a-file link where users can upload their programs and outputs.

Once an analysis has been repeated on the Completed Gold Standard File, the results will be reviewed by Census staff for disclosure concerns. Data products and output approved by Census staff will be released to the user via email. As part of our data usage agreement, we report the data products released with ORES/SSA, and SOI/IRS to document the justification and need for the SSB data.

User send-a-file uploads for validation are processed bi-monthly at the beginning and middle of each month.

The validation process can be accomplished in as little as 7 business days for simple results that are generated by clean code and have no disclosure issues. However, if the program/code does not run properly, the sample sizes are too small, or the researcher does not accurately fill out the disclosure memo, the process can take much longer. Census makes no guarantee on the length of time between submission of programs and the release of results from the confidential data.

For more information about the validation process, including advice on how to make the process go smoothly and quickly, please see SSB Validation Request Guidelines.

The Future of the SSB and Feedback from Researchers

We are always interested in hearing from users about which variables they would like to see added to the file. Similarly, unexpected data patterns or variable values, from either SSB or Gold Standard results, should be reported to Census SSB email address, [email protected], in order to help us to continually improve the file.

We also welcome suggestions and advice to improve the process (e.g., the accessibility and validation) of the SSB program.

We request that researchers who publish results from analyses done using these data cite the SSB as their data source and acknowledge the support of Census staff in running the validation programs. These citations will help ensure continued funding for environment and the creation of the Gold Standard File and the SSB.

Suggested acknowledgement:

"This analysis was first performed using the SIPP Synthetic Beta (SSB) available at the Census Bureau. These data are public use and may be accessed by researchers outside secure Census facilities. For more information, visit https://www.census.gov/programs-surveys/sipp/guidance/sipp-synthetic-beta-data-product.html. Final results for this paper were obtained from a validation analysis conducted by Census Bureau staff using the SIPP Completed Gold Standard Files and the programs written by this author and originally run on the SSB. The validation analysis does not imply endorsement by the Census Bureau of any methods, results, opinions, or views presented in this paper".

Data Citation:

U.S. Census Bureau. SIPP Synthetic Beta: Version 7.0 [Computer file]. Washington DC, 2018.

Further Questions

For questions about a specific project using the SIPP Synthetic Beta, please email [email protected].

The Census Bureau has reviewed this data product to ensure appropriate access, use, and disclosure avoidance protection of the confidential source data used to produce this product (Data Management System (DMS) number: P-6000562, Disclosure Review Board (DRB) approval number: CBDRB-FY18-479).

Related Materials

DRB Rounding Rules Memo [<1.0 MB]
The Creation and Use of the SIPP Synthetic Beta v7.0 (2018) [<1.0 MB]
The Creation and Use of the SIPP Synthetic Beta (2013) [<1.0 MB]
Disclosure Review Board Memo: Second Request for Release of SIPP Synthetic Beta Version 6.0 [<1.0 MB]

We are requesting the approval of the Census Disclosure Review Board (DRB) for the release of the SIPP Synthetic Beta (SSB) v6.0, produced by the Survey Improvement Research Branch (SIRB) of the Census Bureau s Social, Economic, and Housing Statistics Division. This data product is an update to the previ- ously released SSB v5.1. In this memo we provide a brief review of the creation of the SSB and then describe our disclosure-risk analysis. From the results of this analysis, we conclude that the release of SSB 6.0 would not risk disclosing the identity of any SIPP respondent.
SSB Request Guidelines_May 2024 [<1.0 MB]
FSRDC Disclosure Avoidance Methods Handbook v.4 [<1.0 MB]
Clearance Request Memo [<1.0 MB]
SSB Codebook V7 [<1.0 MB]

Page Last Revised - July 15, 2025

Is this page helpful?
Thumbs Up Image

Yes

NO THANKS

255 characters maximum

255 characters maximum reached

Thank you for your feedback.
Comments or suggestions?

Top