U.S. flag

An official website of the United States government

Skip Header


Record Linkage & Machine Learning

Motivation:

Record linkage continues to grow in importance as a fundamental activity in statistical agencies. The number of available administrative lists and commercial files has grown exponentially and present statistical agencies with opportunities to accumulate information through record-linkage to support the production of official statistics. In addition to cost, new obstacles to traditional data collection have emerged in the form of possibly recurrent pandemics. These circumstances further motivate the accumulation of information by linking public, private and administrative files. Thibaudeau (2020) describes the strides the Census Bureau, a pioneer in record linkage, has made over the years. While this is impressive, more is needed. With its own suite of in-house record-linkage software packages, such as the “SAS (PVS) Matcher,” “BigMatch,” “d-blink” and “MAMBA,” and easy access to open-source packages, such as “fastLink” and “RecordLinkage in R,” the Census Bureau now has access to a wide spectrum of methodologies and the potential to rapidly develop and integrate new ones. The Census Bureau must remain abreast of the ever improving state-of-the-art in record linkage and be prepared to champion its own methodologies as some of the best in the world. Our goal is to achieve the synergy of methods and software that will benefit most the Census Bureau and its mission. System portability is also an objective. The Census Bureau should have the freedom to upgrade its IT infrastructure knowing record-linkage applications will remain functional.

 

Research Problems:

One challenge is continuing to research and experiment with new methodologies on multiple software platforms while also moving toward integration. Description of such experiments are:

  • Markov Chains Monte-Carlo (MCMC), like that powered by d-blink, give full probabilistic characterizations of the record-linkage process and are becoming indispensable for full comprehension of a record linkage process. At the same time MCMCs can be tweaked to deliver fast snapshots of the linked population. Research in that direction is crucial. Old-School programs like BigMatch have been greatly optimized for fast linking but lack in nuance. They need to be garnished by richer comparison schemes, such as dictionary-assisted fuzzy string comparisons.
  • New data structure for record-linkage of multiple large lists need to be explored. d-blink is an example of a more efficient data structure: Node-connected structures minimize the number of comparisons, as opposed to a traditional all pairwise comparisons. Other structures are possible, such as cyclical linked lists (Thibaudeau 1992), and should be researched.
  • As new techniques continue to be implemented and experimented on various existing software (R, Python, C) and hardware (Windows, OSX, IRE, CAES) platforms, the dominant paradigms are emerging and work toward integration and unification, while maintaining versatility, is moving in high gear.

 

Current Subprojects:

  • Adjusting the Statistical Analysis on Integrated Data (Ben-David)
  • Entity Resolution and Merging Noisy Databases (Steorts, Brown/CES, Blalock/CODS, Thibaudeau)
  • Record-Linkage Support for the Decennial Census (Ben-David, Weinberg, Brown/CES, Thibaudeau)

 

Potential Applications:

  • Possible massive concurrent record-linkage implementations for Census 2030. The objective is counting all distinguishable persons in linked and unduplicated administrative and commercial person-level lists.
  • Unduplication and record-linkage for frame construction in the demographic and economic areas.
  • Re-identification through record-linking for proofing confidentiality of data lists.
  • Analysis and estimation based on linked lists.
  • Linking probabilistic design-based surveys to large non-probability lists and sample for probabilistic calibration.

   

Accomplishments (October 2018-September 2020):

  • Deployed the FEBRL (Peter Christen) Python file simulator on multiple platforms.
  • Used FEBRL to simulate candidate files –up to 500k records each- for record-linkage with known “true-links.” Used the simulated files to assess the rates of precision and recall of “BigMatch” and the “SAS (PVS) matcher.”
  • Presented poster at Data Linkage Day 2019 entitled: "False Duplicates in the Census: A Novel Approach to identifying False Matches from Record Linkage Software."
  • Wrote EM algorithm for estimating the weights of the Fellegi-Sunter record-linkage model for use with “BigMatch” and the R “RecordLinkage” package.
  • Installed BigMatch on multiple platforms IRE, Windows, MacOS (simulated data).
  • Ran experiments to measure and compare the performance in speed and CPU cycles of a multi-core linkage software (R fastLink) and an optimized single-core record-linkage software (BigMatch) in a multi-core environment.
  • Used BigMatch for multiple linkage projects, including the linkage of commercial files, in the construction of a master reference file at the person and housing unit levels for research and experimentation in preparation for Census 2030.

 

Short-Term Activities (FY 2021 - FY 2023):

  • Provide advice to individuals who plan to update and maintain the programs for record linkage and related data preparation.
  • Conduct research on record linkage error-rate estimation, particularly for unsupervised and semi-supervised situations.
  • Evaluate “R” vs “Python” packages for record linkage focusing on fuzzy string comparison.
  • Assess the possibility of using a surname and given-name reference directory for record-linkage in decennial-census production.
  • Continue to research statistical and data-science methods for record linkage. Explore and compare in-house and “off-the-shelf” packages implementing these methods. Ascertain the competency of record-linkage methods at the Census Bureau.
  • Extending record linkage outside the PIK universe.

 

Longer-Term Activities (beyond FY 2023):

  • Construct census-based equivalence dictionaries of U.S. given names and surnames for cross-referencing and supervised learning in record-linkage.
  • Integrate new methods in our in-house record-linkage engines. Consider the integration of off-the-shelf packages when advantageous.
  • Evaluate and compare in-house and off-the-shelf data-science programs and packages (R and/or Python) to construct engines for massive numbers of record-linkage runs for Census 2030.
  • Further develop Markov Chain Monte-Carlo applications embedding record-linkage methods in massive parallel processing. Develop methods for extracting record-linkage snapshots from MCMCs.

 

Selected Publications:

Wang, Z., Ben-David, E., and Slawski, M. (2023). “Regularization for Shuffled Data Problems via Exponential Family Priors on the Permutation Group.” (Proceedings of the 26th International Conference on Artificial Intelligence and Statistics), Proceedings of Machine Learning Research, Volume 206, pgs 2939-2959. https://proceedings.mlr.press/v206/wang23a.

Steorts, R. (2023). “A Primer on the Data Cleaning Pipeline,” Journal of Survey Statistics and Methodology, 11, 553-568.

Marchant, N.G., Rubinstein, B.I.P., and Steorts, R. (2023), “Bayesian Graphical Entity Resolution Using Exchangeable Random Partition Priors,” Journal of Survey Statistics and Methodology, 11, 569-596.

Deo, N., Sanguthevar R., Joyanta B., Soliman, A., Weinberg, D., and Steorts, R. (In Press). “Novel Blocking Techniques and Distance Metrics for Record Linkage,” Proceedings of the 25th International Conference on Information Integration and Web Intelligence (iiWAS), Lecture Notes in Computer Sciences, Springer.

Basak J., Soliman A., Deo N., Haase, K., Mathur, A., Park, K., Steorts, R., Weinberg, D., Sahni. S., and Sanguthevar R. (2023). “On Computing the Jaro Similarity Between Two Strings,” Proceedings of the 19th International Symposium on Bioinformatics Research and Applications, Springer, 31-44.

Aleshin-Guendel, S. and Steorts, R. (In Press). “Monitoring Convergence Diagnostics for Entity Resolution,” Annual Review of Statistics and Its Applications.

Betancourt, B., Zanella, G., and Steorts, R. (In Press). “Random Partition Models for Microclustering Tasks,” Journal of the American Statistical Association, Theory and Methods.

Mosaferi, S., Ghosh, M., and Steorts, R. (In Press). “Measurement Error Models for Small Area Estimation,” Communications and Statistics: Simulation and Computation.

Wang, Z., Ben-David, E., Diao, G., & Slawski, M. (In Press). “Estimation in Exponential Family Regression Based on Linked Data Contaminated by Mismatch Error,” Statistics and Its Interface.

Wang, Z., Ben-David, E., Diao, G., & Slawski, M. (2022). “Regression with Linked Datasets Subject to Linkage Error,” Wiley Interdisciplinary Reviews: Computational Statistics, 14(4).

Marchant, N., Kaplan, A., Rubenstein, B., Elzar, D., and Steorts, R. (2021). “d-blink: Distributed End-to-End Bayesian Entity Resolution,” Journal of Computational Graphics and Statistics, 30(2), 406-421.

Slawski, M., Diao, G., and Ben-David, E. (2021). “A Pseudo-Likelihood Approach to Linear Regression with Partially Shuffled Data,” Journal of Computational and Graphical Statistics, DOI: 10.1080/10618600.2020.1870482

Thibaudeau, Y., Slud, E., and Cheng, Y. (2021). “Small-Area Estimation of Cross-Classified Gross Flows Using Longitudinal Survey Data,” Advances in Longitudinal Survey Methodology,469-489, Peter Lynn ed., Wiley.

Wang, Z., Ben-David, E., Diao, G., and Slawski, M. (2021). “Regression with Linked Datasets Subject to Linkage Error,” Wiley Interdisciplinary Reviews: Computational Statistics, DOI: 10.1002/wics.1570

Thibaudeau, Y. (In progress). “New Record Linkage Solutions for Demographic Methods at the Census Bureau,” Research Report Series (Statistics #2020-??), Center for Statistical Research & Methodology, U.S. Census Bureau, Washington, D.C.

Slawski, M. and Ben-David, E. (2019). “Linear Regression with Sparsely Permuted Data,” Electronic Journal of Statistics, Vol 13, No. 1, 1-36.

Slud, E. and Thibaudeau, Y. (2019). “Multi-outcome Longitudinal Small Area Estimation – A Case Study,” Statistical Theory and Related Fields, DOI: 10.1080/24754269.2019.1669360.

Steorts, R.J., Tancredi, A., and Liseo, B. (2018). “Generalized Bayesian Record Linkage and Regression with Exact Error Propagation” in Privacy in Statistical Databases (Lecture Notes in Computer Science 11126) (Eds.) Domingo-Ferrer, J. and Montes, F., Springer, 297-313.

Steorts, R.J. and Shrivastava, A. (2018). “Probabilistic Blocking with an Application to the Syrian Conflict,” in Privacy in Statistical Databases (Lecture Notes in Computer Science 11126) (Eds.) Domingo-Ferrer, J. and Montes, F., Springer, 314-327.

Winkler, W. E. (2018). “Cleaning and Using Administrative Lists: Enhanced Practices and Computational Algorithms for Record Linkage and Modeling/Editing/Imputation,” in (A.Y. Chun and M. D. Larsen, eds.) Administrative Records for Survey Methodology, J. Wiley, New York: NY.

Thibaudeau, Y., Slud, E., and Gottshalck, A. (2017). “Log-Linear Conditional Probabilities for Estimation in Surveys,” Annals of Applied Statistics, 11, 680-697.

Czaja, W., Hafftka, A., Manning, B., and Weinberg, D. (2015). “Randomized Approximations of Operators and their Spectral Decomposition for Diffusion Based on Embeddings of Heterogeneous Data,” 3rd International Workshop on Compressed Sensing Theory and Its Applications to Radar, Sonar and Remote Sensing (CoSeRa).

Winkler, W. E. (2015). “Probabilistic Linkage,” in (H. Goldstein, K. Harron, C. Dibben, eds.) Methodological Developments in Data Linkage, J. Wiley: New York.

Weinberg, D. and Levy, D. (2014). “Modeling Selective Local Interactions with Memory: Motion on a 2D Lattice,” Physica D 278-279, 13-30.

Winkler, W. E. (2014a). “Matching and Record Linkage,” Wiley Interdisciplinary Reviews: Computational Statistics, http://wires.wiley.com/WileyCDA/WiresArticle/wisId-WICS1317.html, DOI:10.1002/wics.1317, available from author by request for academic purposes.

Winkler, W. E. (2014b). “Very Fast Methods of Cleanup and Statistical Analysis of National Files,” Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM.

Winkler, W. E. (2013). “Record Linkage,” in Encyclopedia of Environmetrics. J. Wiley.

Winkler, W. E. (2013). “Cleanup and Analysis of Sets of National Files,” Federal Committee on Statistical Methodology, Proceedings of the Bi-Annual Research Conference, http://www.copafs.org/UserFiles/file/fcsm/J1_Winkler_2013FCSM.pdf., https://fcsm.sites.usa.gov/files/2014/05/J1_Winkler_2013FCSM.pdf

Winkler, W. E. (2011). “Machine Learning and Record Linkage” in Proceedings of the 2011 International Statistical Institute.

Herzog, T. N., Scheuren, F., and Winkler, W. E. (2010). “Record Linkage,” in (Y. H. Said, D. W. Scott, and E. Wegman, eds.)

Wiley Interdisciplinary Reviews: Computational Statistics.

Winkler, W. E. (2010). “General Discrete-data Modeling Methods for Creating Synthetic Data with Reduced Re-identification Risk that Preserve Analytic Properties,” https://www.census.gov/srd/papers/pdf/rrs2010-02.pdf .

Winkler, W. E., Yancey, W. E., and Porter, E. H. (2010). “Fast Record Linkage of Very Large Files in Support of Decennial and Administrative Records Projects,” Proceedings of the Section on Survey Research Methods, American Statistical Association, Alexandria, VA.

Winkler, W. E. (2009a). “Record Linkage,” in (D. Pfeffermann and C. R. Rao, eds.) Sample Surveys: Theory, Methods and Inference, New York: North-Holland, 351-380.

Winkler, W. E. (2009b). “Should Social Security numbers be replaced by modern, more secure identifiers?”, Proceedings of the National Academy of Sciences.

Alvarez, M., Jonas, J., Winkler, W. E., and Wright, R. “Interstate Voter Registration Database Matching: The Oregon- Washington 2008 Pilot Project,” Electronic Voting Technology.

Winkler, W. E. (2008). “Data Quality in Data Warehouses,” in (J. Wang, Ed.) Encyclopedia of Data Warehousing and Data Mining (2nd Edition).

Herzog, T. N., Scheuren, F., and Winkler, W. E. (2007). Data Quality and Record Linkage Techniques, New York, NY: Springer.

Yancey, W. E. (2007). “BigMatch: A Program for Extracting Probable Matches from a Large File,” Research Report Series

(Computing #2007-01), Statistical Research Division, U.S. Census Bureau, Washington, D.C.

Winkler, W. E. (2006a). “Overview of Record Linkage and Current Research Directions,” Research Report Series (Statistics

#2006-02), Statistical Research Division, U.S. Census Bureau, Washington, D.C.

Winker, W. E. (2006b). “Automatically Estimating Record Linkage False-Match Rates without Training Data,” Proceedings of the Section on Survey Research Methods, American Statistical Association, Alexandria, VA, CD-ROM.

Yancey, W. E. (2005). “Evaluating String Comparator Performance for Record Linkage,” Research Report Series (Statistics

#2005-05), Statistical Research Division, U.S. Census Bureau, Washington, D.C.

Thibaudeau, Y. (2002). “Model Explicit Item Imputation for Demographic Categories,” Survey Methodology, 28, 135-143.

Thibaudeau, Y. (1993). “The Discrimination Power in Dependency Structure in Record Linkage,” Survey Methodology, 19, 31-38

Thibaudeau, Y. (1992). “Identifying Discriminatory Models in Record Linkage,” Proceedings of the Section on Statistical Computing, American Statistical Association, Alexandria, VA.

Winkler, W. and Thibaudeau, Y. (1991). “An Application of the Fellegi-Sunter Model of RecordLinkage to the 1990 Decennial Census,” Research Report Series (Statistics) RR91/09, Statistical Research Division, U.S. Census Bureau, Washington, D.C.

 

Contact:

Yves Thibaudeau, Edward H. Porter, Emanuel Ben-David, Rebecca Steorts, Dan Weinberg

 

Funding Sources for FY 2021-2025:

0331 – Working Capital Fund / General Research Project

Various Decennial, Demographic, and Economic Projects

Related Information


Page Last Revised - January 3, 2024
Is this page helpful?
Thumbs Up Image Yes Thumbs Down Image No
NO THANKS
255 characters maximum 255 characters maximum reached
Thank you for your feedback.
Comments or suggestions?

Top

Back to Header