An official website of the United States government
Here’s how you know
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS
A lock (
) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.
The Census Bureau, as a part of its decennial census must maintain and update all the addresses present within the United States and its territories. These addresses help formulate policies and allocate valuable resources from the federal government. For the 2020 Census, in-office staff manually canvassed address coverage in every block. While this process was effective, it also brought about challenges associated with cost and time. To help aide the Census Bureau in labelling and classifying blocks, we have proposed a machine learning approach via semi-supervised learning. We present a robust machine learning solution to improve both data labeling and classification of parcel data to enable new data-driven insight while reducing costs and effort for data assessment. Towards this goal, we have employed an active-learning scheme to make accurate and precise classifications using the <1% (~50,000) labelled blocks out of the 8,000,000+ blocks within the country. We utilized multiple machine learning models including Logistic Regression, Random Forest, Gradient Boosting, Extreme Gradient Boosting, Light Gradient Boosting, and Categorical Boosting to make predictions on unlabeled data by training the model on the smaller set of labelled data. Predictions from all the models are then compared to pinpoint the blocks where there is a mismatch between the different models. These blocks are then forwarded to the human labelers to make a final prediction. Once the subset of predicted data has been validated by human labelers, it is then added to the training data before making predictions on the next subset of the data. We also discuss the different challenges associated with working on real-world data at this scale such as class-imbalance and data completeness, integrity.
Share
Related Information
Some content on this site is available in several different electronic formats. Some of the files may require a plug-in or additional software to view.
Top