Brian Dumbacher, Mathematical Statistician, Economic Statistical Methods Division, and Cavan Capps, Big Data Lead, Associate Directorate for Research and Methodology

The Quarterly Summary of State and Local Government Tax Revenue is a sample survey conducted by the U.S. Census Bureau that collects data on tax revenue collections from state and local governments. Much of the data are publicly available on government websites. In fact, instead of responding via questionnaire, some respondents direct survey analysts to their websites to obtain the data. Going directly to websites for those data can reduce respondent burden and aid data review.

It would be useful to have a tool that automatically collects, or scrapes, relevant data from the web. Developing such a tool can be challenging. There are thousands of government websites but very little standardization in terms of structure and publications. A large majority of government publications are in Portable Document Format (PDF), a file type not easily analyzed. Finally, both web and PDF documents have constantly changing formats.

To solve this problem, researchers at the Census Bureau are studying and applying methods for unstructured data, text analytics and machine learning. These methods belong to the realm of “Big Data.” Big Data refers to large and frequently generated datasets representing a variety of structures. As opposed to designed survey data, Big Data are “found” or “organic” data. Typically, these data are created for a click log, a social media blog or an online PDF report, but are innovatively repurposed and used for something else such as inferring behavior. Since the data were not specifically designed to infer, they often have unique challenges.

The goal of this research is to develop a web crawler with machine learning that performs three tasks:

Crawls through a government website and discovers all PDFs.
Classifies each PDF as containing relevant data on tax revenue collections.
Extracts the relevant data, organizes it and stores it in a database.

For task 1, we used the open-source software called Apache Nutch. In a production environment, the process will scale up by distributing the work over many computers and then combining the results.

For task 2, we developed a technique to convert PDF documents to text and re-organize the output. A classifying model applied to the converted PDF determines if the document has relevant data on tax revenue collections. This model uses the occurrence of key sequences of words such as “statistical report” and “sales tax income” and other text analysis techniques.

For task 3, we are considering various ideas. Relevant data would probably be found in tables and in close proximity to key sequences of words. We will explore table identification methods based on the distribution of terminology in the PDF and additional modeling that maps the nonstandard data in PDFs to standard definitions in Census Bureau publications.

The Census Bureau looks forward to continuing this web scraping research and exploring new machine learning algorithms that reduce respondent burden, speed survey processing and improve data collection.

To learn more about the research methods for scraping government tax revenue from the web, please join us at the Joint Statistical Meetings on August 2, 2016.

Page Last Revised - December 16, 2021

Is this page helpful?
Thumbs Up Image

Yes

NO THANKS

255 characters maximum

255 characters maximum reached

Thank you for your feedback.
Comments or suggestions?

Top