Douglass Huang, Todd Johnsson
ADI LLC
We have devised a method, Record Linkage Production Data Quality (RL-PDQ), to enhance the data quality of any record linkage system without disrupting the core design or operation of that system. As the Census Bureau increases its use of administrative records, one potential scenario involves using record linkage to incorporate administrative records as primary survey data for the 2030 Census, reducing respondent burden and operational costs. In this session we will review the theory and supporting research that indicates a further 70-80% reduction in false positives (impacting survey data quality) and a further 96-99% reduction in false negatives (impacting traditional enumeration efforts) when applying RL-PDQ in such a scenario. This outcome would drive production data quality improvement and lead to substantial enumeration cost savings, with measurable results.
Kevin A. Hunt, Rachael V. Jennings, Denise A. Abreu
National Agricultural Statistics Service
The USDA National Agricultural Statistics Service (NASS) collects data from farm operators using a list frame. A farm is any place with $1000 or more in sales or potential sales. The list frame is not complete. In 2017, a report by the National Academies of Sciences Engineering Medicine (NASEM) recommended that NASS update the list frame to a georeferenced farm-level database and develop linkages to other administrative sources to improve the coverage of the list frame. The Agency has developed an approach using an alternative geospatial field-level boundaries dataset called Crop Sequence Boundaries (CSBs). The CSBs are developed from stacking 8 years of Crop Land Data Layers (CDLs) using an algorithm-based approach to identify field boundaries of homogenously cropped areas overtime. The approach overlays the CSBs with georeferenced data called Common Land Units (CLUs) from the USDA Farm Service Agency (FSA) to identify areas without CLU coverage. CSBs without CLU coverage are used to retrieve additional tax assessor landowner parcel information from CoreLogic Inc. The parcel records are linked to the NASS list frame and any new records are selected and surveyed to determine their status as a farm. This resulted in thousands of new farm operator records being added to the NASS list frame. The results from 12 states are presented here.
Krista Park1, Mishal Ahmed1, W. Glenn Ambill1, John Cuffe1, Khoa Dong1, Suzanne Dorinski1, Juan Humud1, Shawn Klimek1, Daniel Moshinsky1, Kevin Shaw1, Damon Smith1, Yves Thibaudeau1, Christine Tomaszewski1, Victoria Udalova1, Daniel Weinberg1, Daniel Whitehead1, Casey Blalock1, Steven Nesbit2
1U.S. Census Bureau, 2PORTAL Technologies
Over 17 months, a team of more than 25 Census Bureau record linkage practitioners from across the Decennial, Economic, Demographic, and Research directorates met to define and describe requirements for record linkage systems. These requirements will facilitate moving record linkage activities from specialist production or research into the every-day production environment as the demand increases for new data products to enable evidence-based policymaking and alternative data sources are increasingly harnessed to develop more products previously created through survey data collection. The team defined two assessment tools over the course of their work to conduct a landscape analysis of record linkage solutions. This presentation and accompanying paper disseminate the teams two assessment tools which that take as input, solution requirements and associated weighting definitions. The first tool, the Technical Solutions Assessment (TSA), is a comprehensive framework designed to analyze robust, commercial solutions. It consists of 378 capability requirements across six categories: Technical, Operations, Performance, Cost Factors, User Experience, and Security. After review of 23 internal, commercial, open-source, academic, and federal record linkage systems, the team produced a narrower tool, the Condensed TSA, designed to better assess the non-commercial solutions which often do not provide stand-alone, end-to-end data handling throughout the record linkage pipeline. The team then used this Condensed TSA, consisting of 194 requirements, to assess internal Census Bureau record linkage and open-source record linkage solutions in preparation for a validating benchmarking test.
Ancillary Data Record Linkage to characterize the completeness of data for the All of Us Research Program
Yuyang Yang1, Kelsey Rodriguez2, Melissa Basford2, Abel Kho1, Lew Berman3
1Northwestern University, 2Vanderbilt University Medical Center, 3National Institutes of Health
The All of Us Research Program (AoURP) is a federal initiative to support biomedical research by creating a consented patient cohort of over 1 million Americans. AoURP health data is collected via survey questions, electronic health records (EHR), and biosamples to create a comprehensive picture of participant health. Despite this goal, investigations within the AoURP have found that participant health information is oftentimes incomplete. To improve data quality within the AoURP, we describe work to supplement AoU patient information using insurance claims data via a privacy preserving record linkage (PPRL) pipeline. We linked EHR data for 400,000 consented AoURP participants with claims data provided by IPM.AI (Swoop Analytics), an analytics company with estimated private and public claims records for over 90% of the U.S. population. We found that claims data contained 1.5x more service dates and diagnoses and 3.2x more procedures compared to AoU data for the 95% of matched participants. We found variation between AoURP and claims data for different patient profiles, such as patients with certain chronic conditions. Overall, the union of AoU and IPM.AI data greatly increases data completeness compared to either source alone. Our study has implications for data collection in the AoURP and shows that supplementary linkages can improve data completeness in national research initiatives.
Yunie Le, Nathan Barrett, Allison Nunez, Ekaterina Levitskaya
Coleridge Initiative
The U.S. government seeks evidence to support a more comprehensive understanding of the availability and demand for global science and engineering training and talent. The objective of this project is to establish the foundations of a national data infrastructure that will augment the federal survey data with administrative data linkages and help address unanswered questions about foreign-born scientists and engineers in the United States, beginning with estimating the economic return on investment on U.S. training of Foreign-Born Scientists and Engineers (FBSEs). The goal of the project is to establish the feasibility of constructing a linked data infrastructure, which can help answer important research questions about this population. As part of the project, the team explores and documents the coverage using federal government surveys, including Census, NCES, and NCSES data, and benchmarks it against administrative data from three U.S. states on educational enrollments and completions linked to individual and firm level employment data.