The SIPP Synthetic Beta (SSB) is a Census Bureau product that integrates person-level micro-data from a household survey with administrative tax and benefit data. These data link respondents from the Survey of Income and Program Participation (SIPP) to Social Security Administration (SSA)/Internal Revenue Service (IRS) Form W-2 records and SSA records of retirement and disability benefit receipt, and were produced by Census Bureau staff economists and statisticians in collaboration with researchers at Cornell University, the SSA and the IRS. The purpose of the SSB is to provide access to linked data that are usually not publically available due to confidentiality concerns. To overcome these concerns, Census synthesizes, or models, all the variables in a way that changes the record of each individual so as to preserve the underlying covariate relationships between the variables. Only gender and a link to the first reported marital partner are not altered by the synthesis process and still contain their original values.
Nine SIPP panels (1984, 1990, 1991, 1992, 1993, 1996, 2001, 2004, and 2008) form the basis for the SSB, with a subset of variables available across all the panels selected for inclusion and harmonization of variable definitions across the years covered by the panels. Administrative data are added and some editing is done to correct for logical inconsistencies in the IRS and Social Security earnings and benefits data. Thus, the SSB is a particularly appealing data set for new SIPP users because little data preparation is needed. A complete list of variables included in SSB version 6.0, along with details about the harmonization and editing, is available in our Codebook.
As part of the synthesis process, missing survey data and missing administrative data were multiply imputed. The resulting data sets are called the Completed Gold Standard Files and contain all original, non-missing, confidential values and imputed values in place of originally missing data. These files form the basis for evaluating results from the synthetic data. The goal of the SSB is to produce results that are qualitatively the same as results from the Completed Gold Standard Files.
The synthesis process itself involves estimating the joint distribution of all the variables in the data and taking random draws from this modeled distribution. These draws are then used to replace actual data values. This process is repeated multiple times to create a set of 16 files, which are called implicates. For more information on the statistical methods used to create the SSB and formulae for combining results across implicates, please seeThe Creation and Use of the SIPP Synthetic Beta.
Before releasing these data, Census staff performed extensive testing for disclosure risk and determined that the probability of linking a record from the SSB to an actual person was negligible. For details on this risk assessment see DRB Memo SSB Version 6.0.” Based on the conclusions of this testing, the Census Disclosure Review Board and their counterparts at IRS and SSA have approved these data for use by researchers working outside secure Census facilities.
Version 6.0 of the SSB allows for a longer time series of analysis in three ways. First, the SSB 6.0 includes two additional SIPP panels (1984 and 2008); second, the administrative W-2 earnings records now extend through 2011; and third, OASDI and SSI benefit information are available through 2012. Several new SIPP time series variables were added to version 6.0 of the SSB, including monthly indicators of receipt and amounts of AFDC or TANF benefits, workers’ compensation benefits, veterans’ compensation or pension benefits, and food stamp or SNAP benefits. Two additional SIPP point-in-time variables, a sample base weight and a life insurance policy indicator, are now available on the file. Also, the fertility history variables (e.g., first_admin_birthdate and last_admin_birthdate) have been improved by reconciling children’s administrative birth dates with those birth dates reported by the children’s mother. Finally, we have added many new variables on SSDI and SSI applications that include when applications were submitted, when and whether they were approved or denied, and a diagnosis group classifying the type of disability.
Version 5.1 of the SSB incorporates modeling improvements and new SIPP variables that expand the scope of analyses that can be performed relative to version 5.0 (released in 2010). In particular, we have added the following SIPP monthly time-series variables: weeks with a job, weeks with pay, usual hours worked, survey-reported earnings, total personal income, any health insurance coverage, and employer-provided health insurance coverage. We have also added two often-requested variables: first, a categorical variable for state of residence at the beginning of the SIPP panel; second, an indicator for whether the individual linked to administrative records via SSN or whether these records were imputed because no SSN was available. Finally, we have edited the administrative earnings variables prior to the data completion and synthesis process in order to modify some values that we determined to be clerical data error.
Researchers must submit an application to use the Synthetic Data Server. The application requires contact information, a brief description of the project, and a list of variables to be used. File access will be approved or denied based only on the feasibility of the proposal, which is determined by evaluating whether the data necessary to conduct the analysis are included on the file. Census generally expects to be able to approve applications within five business days. To apply please submit "Application to use the SIPP Synthetic Beta File" to email@example.com
The SSB is housed on the Synthetic Data Server (SDS) at the Virtual Research Data Center at Cornell University. A free account on the SDS is created for each approved user. Using this account, researchers may run SAS and Stata programs using SSB data. Census staff will email program and log files to researchers upon request and without disclosure review since these data are public-use. Users will receive instructions about accessing their account for the first time once their application is approved.
The data synthesis process employed by Census to protect the linked data from the risk of disclosing the identity of individuals is relatively new and substantially changes both the survey and administrative data. The intent of the modeling done as part of the synthesis is to preserve relationships among variables that are of interest to researchers while ensuring that personally identifiable information is not revealed to the data user. It is not feasible to ensure accuracy by comparing every relationship among SSB variables with the corresponding relationship in the underlying confidential micro-data. Hence, we strongly urge researchers not to publish results produced from the SSB without first requesting that Census validate these results with confidential data housed in a secure environment at the Census Bureau. Census will perform this validation free of charge to researchers, as resources permit and according to the protocol established by the three agencies involved and outlined below.
Without validation of results, Census, SSA, and IRS make no guarantee of the validity of the SSB for any research purpose.
Census will validate results obtained from the SSB on the internal, confidential version of these data (Completed Gold Standard Files). Users who wish to obtain validated results should follow the protocol outlined here.
We are always interested in hearing from users about which variables they would like to see added to the file. Similarly, unexpected data patterns or variable values, from either SSB or Gold Standard results, should be reported to Census using our SSB email address, firstname.lastname@example.org, in order to help us continually improve the file.
We request that researchers who publish results from analyses done using these data cite the SSB as their data source and acknowledge the use of the SDS server at Cornell and the support of Census staff in running any validation programs. These citations will help ensure continued funding for the SDS server and the creation of the Gold Standard File and the SSB.
"This analysis was first performed using the SIPP Synthetic Beta (SSB) on the Synthetic Data Server housed at Cornell University which is funded by NSF Grant #SES-1042181. These data are public use and may be accessed by researchers outside secure Census facilities. For more information, visit http://www.census.gov/programs-surveys/sipp/methodology/sipp-synthetic-beta-data-product.html. Final results for this paper were obtained from a validation analysis conducted by Census Bureau staff using the SIPP Completed Gold Standard Files and the programs written by this author and originally run on the SSB. The validation analysis does not imply endorsement by the Census Bureau of any methods, results, opinions, or views presented in this paper".
U.S. Census Bureau. SIPP Synthetic Beta: Version 6.0 [Computer file].Washington DC; Cornell University, Synthetic Data Server [distributor], Ithaca, NY, 2015.