U.S. flag

An official website of the United States government

Skip Header


Missing Data & Observational Data Modeling

Motivation:

Missing data problems are endemic in the conduct of statistical experiments and data collection operations. The investigators almost never observe all the outcomes they had set to record. When dealing with sample surveys or censuses, this means that individuals or entities in the survey omit to respond, or give only part of the information they are being asked to provide. Even if a response is obtained, the information provided may be logically inconsistent, which is tantamount to missing. Agencies need to compensate for these types of missing data to compute official statistics. As data collection becomes more expensive and response rates decrease, observational data sources such as administrative records and commercial data provide a potential effective way forward. Statistical modeling techniques are useful for identifying observational units and/or planned questions that have quality alternative source data. In such units, sample survey or census responses can be supplemented or replaced with information obtained from quality observational data rather than traditional data collection. All these missing data problems and associated techniques involve statistical modeling along with subject matter experience.

 

Research Problems:

  • Correct quantification of the reliability of estimates with imputed values, as variances can be substantially greater than that computed nominally. Methods for adjusting the variance to reflect the additional uncertainty created by the missing data.
  • Simultaneous imputation of multiple survey variables to maintain joint properties, related to methods of evaluation of model-based imputation methods.
  • Integrating editing and imputation of sample survey and census responses via Bayesian multiple imputation and synthetic data methods.
  • Nonresponse adjustment and imputation using administrative records, based on propensity and/or multiple imputation models.
  • Development of joint modeling and imputation of categorical variables using log-linear models for (sometimes sparse) contingency tables.
  • Statistical modeling (e.g., latent class models) for combining sample survey, census and/or alternative source data.
  • Statistical techniques (e.g., classification methods, multiple imputation models) for using alternative data sources to augment or replace actual data collection.

 

Current Subprojects:

  • Data Editing and Imputation for Nonresponse (Thibaudeau, Morris, Shao)
  • Imputation and Modeling using Observational/Alternative Data Sources (Morris, Thibaudeau)

 

Potential Applications:

  • Modeling approaches for integrating Economic Census editing and imputation processing, and developing multiple synthetic industry-level Economic Census micro-data.
  • Modeling approaches for using administrative records in lieu of or to supplement Decennial Census field visits due to imminent and future design decisions.
  • Adapt survey questions in the American Community Survey based on modeling of administrative record quality.
  • Produce multiply imputed, synthetic and/or composite estimates of more geographical granular and timely economic activity based on third party data.

 

Accomplishments (October 2018-September 2020):

  • Researched, adapted, and implemented nonparametric Bayesian hierarchical models developed by Kim et al. (2017) for integrating Economic Census editing and imputation processing with developing multiple synthetic industry-level Economic Census micro-data that can be publicly shared in place of suppressed estimates.
  • Collaborated in adapting an R package based on Kim et al. (2017) to be specifically tailored to edit and multiply impute Economic Census data, and documented specifications in a user’s guide.
  • Developed multiple synthetic generators to produce industry-level Economic Census micro-data.
  • Collaborated to develop Bayesian multiple imputation models for using third party data to produce geographically granular and timely retail sales experimental estimates.
  • Applied and completed evaluation of optimization methods for raking balance complexes in the Quarterly Financial Report (QFR) when items can take negative values.
  • Showed how to use log-linear models coupled with complementary logistic regression to improve the efficiency (reducing the sampling error) of estimates of gross flows and month-to-month proportions classified by demographic variables. Illustrated methodology on labor force measurements and gross flows estimated from the Current Population Survey.

 

Short-Term Activities (FY 2021 - FY 2023):

  • Continue researching modeling approaches for using administrative records in lieu of Decennial Census field visits due to imminent design decisions.
  • Continue to investigate the feasibility of using third party (“big”) data from various available sources to supplement and/or enhance retail sales estimates.
  • Continue research, implementation, and resolution of editing and data issues when applying non-parametric Bayesian editing methods to edit and multiply impute Economic Census data.
  • Continue research on integration of Bayesian editing and multiple imputation processing with disclosure avoidance and data synthesis processing.
  • Extend the analysis and estimation of changes in the labor force status using log-linear models coupled with matching logistic regression methods to the Current Population Survey.
  • Research novel categorical distributions for contingency table modeling and joint imputation of categorical variables particularly for clustered data.
  • Continue research on accounting for observed zero cells in loglinear models for sparse contingency tables.

 

Longer-Term Activities (beyond FY 2023):

  • Joint modeling of response propensity and administrative source accuracy.
  • Research practical ways to apply decision theoretic concepts to the use of administrative records (versus personal contact or proxy response) in the Decennial Census.
  • Further development of joint administrative record and imputation modeling based on latent class models.
  • Research imputation methods for a Decennial Census design that incorporates adaptive design and administrative records to reduce contacts and consequently increases proxy response and nonresponse.
  • Joint models of attrition (or response rate) and clustered categorical outcomes using shared random effects with innovative GLMM computational techniques.
  • Extend small area estimation modeling for longitudinal data (survey and/or third party) in presence of attrition and/or other type of missing data using log-linear models in tandem with logistic regression.

 

Selected Publications:

Morris, D.S. and Raim, A.M. (2023). “Comparing Trial and Variable Association in Contingency Table Data Using Multinomial Models for Clustered Data.” In Proceedings of the 37th International Workshop on Statistical Modelling. Dortmund, Germany: Statistical Modelling Society, 536-542.

Kang, J., Morris, D.S., Joyce, P., and Dompreh, I. (In Press). “On Calibrated Inverse Probability Weighting and Generalized Boosting Propensity Score Models for Mean Estimation with Incomplete Survey Data,” Wiley Interdisciplinary Reviews (WIREs) Computational Statistics.

Morris, D.S. and Sellers, K.F. (2022). “A Flexible Mixed Model for Clustered Count Data,” Stats: Special Issue on Statistics, Data Analytics, and Inferences for Discrete Data, 5(1): 52–69. https://doi.org/10.3390/stats5010004.

Liu, B., Dompreh, I., and Hartman, A.M. (2021). “Small Area Estimation of Smoke-Free Workplace Policies and Home Rules in U.S. Counties,” Journal of Nicotine and Tobacco Research.

Dumbacher, B., Morris, D.S., and Hogue, C. (2019). “Using Electronic Transaction Data to Add Geographic Granularity to Official Estimates of Retail Sales,” Journal of Big Data, 6(80).

Keller, A., Mule, V.T., Morris, D.S., and Konicki, S. (2018). "A Distance Metric for Modeling the Quality of Administrative Records for Use in the 2020 Census," Journal of Official Statistics, 34(3): 1-27.

Winkler, W. E. (2018).  “Cleaning and Using Administrative Lists: Enhanced Practices and Computational Algorithms for Record Linkage and Modeling/Edit/Imputation,” Research Report Series (Statistics #2018-05), Center for Statistical Research and Methodology, U.S. Census Bureau, Washington, D.C.

Morris, D. S. (2017). “A Modeling Approach for Administrative Record Enumeration in the Decennial Census,” Public Opinion Quarterly: Special Issue on Survey Research, Today and Tomorrow, 81(S1): 357-384.

Thibaudeau Y., Slud, E., and Gottschalck, A. O. (2017). “Modeling Log-Linear Conditional Probabilities for Estimation in Surveys,” Annals of Applied Statistics 11(2), 680-697.

Morris, D.S., Keller, A., and Clark, B. (2016). “An Approach for Using Administrative Records to Reduce Contacts in the 2020 Census,” Statistical Journal of the International Association for Official Statistics, 32(2): 177-188.

Thibaudeau, Y. and Morris, D.S. (2016). “Bayesian Decision Theory to Optimize the Use of Administrative Records in Census NRFU,” Proceedings of the Joint Statistical Meetings.  Alexandria, VA: American Statistical Association.

Bechtel, L., Morris, D.S., and Thompson, K.J. (2015). “Using Classification Trees to Recommend Hot Deck Imputation Methods: A Case Study.” In FCSM Proceedings. Washington, DC: Federal Committee on Statistical Methodology.

Garcia, M., Morris, D.S., and Diamond, L.K. (2015). “Implementation of Ratio Imputation and Sequential Regression Multivariate

Imputation on Economic Census Products.” Proceedings of the Joint Statistical Meetings.

Winkler, W. and Garcia, M. (2009). “Determining a Set of Edits,” Research Report Series (Statistics #2009-05), Statistical Research Division, U.S. Census Bureau, Washington, D.C.

Winkler, W. E. (2008). “General Methods and Algorithms for Imputing Discrete Data under a Variety of Constraints,” Research

Report Series (Statistics #2008-08), Statistical Research Division, U.S. Census Bureau, Washington D.C.

Thibaudeau, Y. (2002). “Model Explicit Item Imputation for Demographic Categories,” Survey Methodology, 28(2), 135-143.

 

Contact:

Darcy Morris, Joseph Kang, Shane Lubold, Isaac Dompreh, Yves Thibaudeau, Jun Shao, Eric Slud, Xiaoyun Lu

 

Funding Sources for FY 2021-2025:         

0331 – Working Capital Fund / General Research Project

Various Decennial, Demographic, and Economic Projects

Related Information


Page Last Revised - January 3, 2024
Is this page helpful?
Thumbs Up Image Yes Thumbs Down Image No
NO THANKS
255 characters maximum 255 characters maximum reached
Thank you for your feedback.
Comments or suggestions?

Top

Back to Header