Simulation, Data Science, & Visualization

Motivation

Simulation studies that are carefully designed under realistic sample survey or census conditions can be used to evaluate the quality of new statistical methodology for Census Bureau data. Furthermore, new computationally intensive statistical methodology is often beneficial because it can require less strict assumptions, offer more flexibility in sampling or modeling, accommodate complex features in the data, enable valid inference where other methods might fail, etc. Statistical modeling is at the core of the design of realistic simulation studies and the development of computationally intensive statistical methods. Modeling also enables one to efficiently use all available information when producing estimates. Such studies can benefit from software for data processing, especially large data sets from nontraditional sources. Data visualizations can help reveal insights. Statistical disclosure avoidance methods are also developed, and properties studied.

Research Problems

Systematically develop an environment for simulating complex sample surveys that can be used as a test-bed for new data analysis methods.
Develop new methods for statistical disclosure control that simultaneously protect confidential data from disclosure while enabling valid inferences to be drawn on relevant population parameters.
Develop models for the analysis of measurement errors in Demographic sample surveys (e.g., Current Population Survey or the Survey of Income and Program Participation).
Investigate noise infusion and synthetic data for statistical disclosure control.

Current Subprojects

Visualizing the United States (Yau)
The Ranking Project: Methodology Development and Evaluation (Wright, Yau, Wieczorek/Colby College, Hall)

Potential Applications

Simulating data collection operations using Monte Carlo techniques can help the Census Bureau make more efficient changes.
Use noise multiplication or synthetic data as an alternative to top coding for statistical disclosure control in publicly released data. Both noise multiplication and synthetic data have the potential to preserve more information in the released data over top coding.
Rigorous statistical disclosure control methods allow for the release of new microdata products.
Using an environment for simulating complex sample surveys, statistical properties of new methods for missing data imputation, model-based estimation, small area estimation, etc. can be evaluated.
Model-based estimation procedures enable efficient use of auxiliary information (for example, Economic Census information in business surveys), and can be applied in situations where variables are highly skewed and sample sizes are not sufficiently large to justify normal approximations. These methods may also be applicable to analyze data arising from a mechanism other than random sampling.
Variance estimates and confidence intervals in complex sample surveys can be obtained via the bootstrap.
Modeling approaches with administrative records can help enhance the information obtained from various sample surveys.

Accomplishments (October 2020 – September 2024)

Developed, published, and updated visualizations for comparing populations.
Developed and published theory and updated a visualization for expressing uncertainty and an overall ranking of populations.
Conducted and published research results regarding noise infusion and synthetic data.

Short-Term Activities (FY 2025 – FY 2027)

Continue development to visualize the United States.
Continue development of new methodology for statistical disclosure control and to evaluate properties of new and existing methods.
Improve visualizations for comparing populations and for overall rankings of populations.

Longer-Term Activities (Beyond FY 2027)

Continue development to visualize the United States.
Study ways of quantifying the privacy protection/data utility tradeoff in statistical disclosure control.
Create an environment for simulating complex aspects of economic/demographic sample surveys.
Develop methodology for quantifying uncertainty in statistical rankings, and refine visualizations.

Selected Publications

Journal Articles, Peer Review

Basak, B. and Sinha, B. (2024). “Analysis of One-way ANOVA Model Using Synthetic Data,” Sankhya (B), Volume 86, 164-190.

Basak, B., Yehenew, G.K., and Sinha, B. (In Press). “Confidence Ellipsoids of a Multivariate Normal Mean Vector Based on Noise Perturbed and Synthetic Data with Applications,” Journal of Society of Statistics, Computer and Applications (SSCA), Special Issue Dedicated to the Fond Memories of Prof C.R. Rao on “Life and Work of C.R. Rao (1920-2023): The Revolutionary of Statistical Sciences, Vol 22.

Basak, B. and Sinha, B. (In Press). “Comparison of Tests and Confidence Intervals for Univariate Normal Mean Based on Multiply Imputed Synthetic Data Obtained by Posterior Predictive Sampling,” Calcutta Statistical Association Bulletin.

Guin, A., Roy, A., and Sinha, B. (2022). “Bayesian Analysis of Multiply Imputed Synthetic Data under the Multiple Linear Regression Model,” International Journal of Statistical Sciences, Volume 22(2), 25-38.

Guin, A., Roy, A., and Sinha, B. (2023). “Bayesian Analysis of Singly Imputed Synthetic Data under the Multivariate Normal Model,” International Journal of Statistical Sciences, Volume 23(2), November 2023.

Moura, R., Klein, M., Zylstra, J., Coelho, C., and Sinha, B. (In Press). “Inference for Multivariate Regression Model Based on Synthetic Data Generated Under Plug-In Sampling,” Journal of the American Statistical Association (Theory & Methods).

Chai, J. and Nayak, T.K. (2021). “Minimax Randomized Response Methods for Protecting Respondent’s Privacy,” Communications in Statistics - Theory and Methods, https://doi.org/10.1080/03610926.2021.1973503

Klein, M., Wright, T., and Wieczorek, J. (2020). “A Joint Confidence Region for an Overall Ranking of Populations,” Journal of the Royal Statistical Society, Series C, 69, Part 3, 589-606.

Klein, M.D., Zylstra, J., and Sinha, B.K. (2019). “Finite Sample Inference for Multiply Imputed Synthetic Data under a Multiple Linear Regression Model,” Calcutta Statistical Association Bulletin. https://doi.org/10.1177/0008068318803814

Wright, T., Klein, M., and Wieczorek, J. (2019). “A Primer on Visualizations for Comparing Populations Including the Issue of Overlapping Confidence Intervals,” The American Statistician, Vol 73, No. 2, 165-178.

Chai, J. and Nayak, T.K. (2018). “A Criterion for Privacy Protection in Data Collection and Its Attainment via Randomized Response Procedures,” Electronic Journal of Statistics, 12, 4264-4287.

Klein, M. and Datta, G. (2018). “Statistical Disclosure Control via Sufficiency under the Multiple Linear Regression Model,” Journal of Statistical Theory and Practice, 12(1), 100-110.

Nayak, T.K., Zhang, C., and You, J. (2018). “Measuring Identification Risk in Microdata Release and Its Control by Post- randomisation,” International Statistical Review, 86(2), 300-321.

Moura, R., Klein, M., Coelho, C., and Sinha, B. (2017). "Inference for Multivariate Regression Model Based on Synthetic Data Generated under Fixed-Posterior Predictive Sampling: Comparison with Plug-in Sampling," REVSTAT – Statistical Journal, 15(2): 155-186.

Klein, M. and Sinha, B. (2016). “Likelihood Based Finite Sample Inference for Singly Imputed Synthetic Data under the Multivariate Normal and Multiple Linear Regression Models,” Journal of Privacy and Confidentiality, 7: 43-98.

Klein, M. and Sinha, B. (2015). “Inference for Singly Imputed Synthetic Data Based on Posterior Predictive Sampling under Multivariate Normal and Multiple Linear Regression Models,” Sankhya B: The Indian Journal of Statistics, 77-B, 293-311.

Klein, M. and Sinha, B. (2015). “Likelihood-Based Inference for Singly and Multiply Imputed Synthetic Data under a Normal Model,” Statistics and Probability Letters, 105, 168-175.

Klein, M. and Sinha, B. (2015). “Likelihood-Based Finite Sample Inference for Synthetic Data Based on Exponential Model,” Thailand Statistician: Journal of The Thai Statistical Association, 13, 33-47.

Klein, M., Mathew, T., and Sinha, B. (2014). “Noise Multiplication for Statistical Disclosure Control of Extreme Values in Log- normal Regression Samples,” Journal of Privacy and Confidentiality, 6, 77-125.

Klein, M., Mathew, T., and Sinha, B. (2014). “Likelihood Based Inference under Noise Multiplication,” Thailand Statistician: Journal of The Thai Statistical Association, 12, 1-23.

Klein, M. and Sinha, B. (2013). “Statistical Analysis of Noise Multiplied Data Using Multiple Imputation,” Journal of Official Statistics, 29, 425-465.

Klein, M. and Linton, P. (2013). “On a Comparison of Tests of Homogeneity of Binomial Proportions,” Journal of Statistical Theory and Applications, 12, 208-224.

Shao, J., Klein, M., and Xu, J. (2012). “Imputation for Nonmonotone Nonresponse in the Survey of Industrial Research and Development,” Survey Methodology, 38, 143-155.

Klein, M. and Wright, T. (2011). “Ranking Procedures for Several Normal Populations: An Empirical Investigation,” International Journal of Statistical Sciences, 11, 37-58.

Nayak, T.K., Sinha, B., and Zayatz, L. (2011). “Statistical Properties of Multiplicative Noise Masking for Confidentiality Protection,” Journal of Official Statistics, 27 (3), 527-544.

Sinha, B., Nayak, T.K., and Zayatz, L. (2011). “Privacy Protection and Quantile Estimation from Noise Multiplied Data,” Sankhya, Series. B, 73, 297-315.

CSRM Research Reports, CSRM Studies, Proceedings Papers, and Other

Wright, T., Klein, M., and Wieczorek, J. (2014). “Ranking Populations Based on Sample Survey Data,” Center for Statistical Research and Methodology, Research and Methodology Directorate Research Report Series (Statistics #2014-12). U.S. Census Bureau. Available online: https://www.census.gov/srd/papers/pdf/rrs2014-12.pdf.

Klein, M., Lineback, J.F., and Schafer, J. (2014). “Evaluating Imputation Techniques in the Monthly Wholesale Trade Survey,” Proceedings of the Joint Statistical Meetings, Alexandria, VA: American Statistical Association.

Wright, T., Klein, M., and Wieczorek, J. (2013). “An Overview of Some Concepts for Potential Use in Ranking Populations Based on Sample Survey Data,” The 59th International Statistical Institute World Statistics Congress, Hong Kong, China.

Klein, M., Mathew, T., and Sinha, B. (2013). “A Comparison of Statistical Disclosure Control Methods: Multiple Imputation versus Noise Multiplication,” Research Report Series (Statistics #2013-02), Center for Statistical Research & Methodology, U.S. Census Bureau, Washington, D.C. Available online: https://www.census.gov/srd/papers/pdf/rrs2013-02.pdf.

Klein, M. and Creecy, R. (2010). “Steps toward Creating a Fully Synthetic Decennial Census Microdata File,” Proceedings of the Joint Statistical Meetings, Alexandria, VA: American Statistical Association.