Skip Header

We are hiring thousands of people for the 2020 Census. Click to learn more and apply.

Random Samplings: Unlocking the American Community Survey for Data Scientists

Sat Mar 11 2017
Written by: Jeff Meisel, Census Chief Marketing Officer and Zach Whitman, Census Chief Data Officer
Component ID: #ti1255924784

The U.S. Census Bureau’s American Community Survey (ACS) provides a timely snapshot of America’s people and economy. It’s used in a myriad of ways to create value in the public and private sectors. ACS data helps determine how more than $400 billion in federal funds are distributed to state and local areas each year, nonprofits better understand the populations they serve, and small and big businesses understand where to invest to grow their company.

This is an important dataset, and it’s no surprise that data scientists are eager to tap into its insights. With a new tool funded by research from the National Science Foundation (NSF), it’s now easier than ever for developers to create apps using the survey’s data.

Component ID: #ti732273396

What Data Scientists Want From the American Community Survey

For some time, ACS data has been available in several different formats. But for data scientists who want the most granular, low-level view of the data, there was a gap — until now, to get that low-level view, they needed to use the Public Use Microdata Sample (PUMS).

PUMS is stored in common, tabular data files, like .csv. This format is compact, lightweight, and therefore relatively easy to transfer from person to person, but its structure can make it hard to use. For example, in the Housing file, a value of “4” in the “AGS” column really means that a housing unit has $2,500-$4,999 in yearly sales of agricultural products.

As a result, the PUMS has a steep learning curve. Its metadata is stored in a separate, human-readable data dictionary that you have to constantly reference to explore and make sense of the data. This limits what a computer can do with the data, because the concepts in the data dictionary aren’t in a computer-readable format.

Additionally, PUMS files are bulky. Attempting to open the full U.S. person record file in Excel will exceed the memory in most home computers. All of these issues impede users’ ability to share the data and combine it with other datasets.

Finally, a big challenge with PUMS is its complexity. Among other hurdles, it contains:

  • Coded properties in which integer values in the raw data actually represent real world concepts or relationships to concepts.
  • Nested properties, like multiple levels of race and ethnicity that depend on each other.
  • Combined properties, including how the employment status of parents and their living arrangement are combined into one property
  • Mixed properties where the data show coded relationships intermixed with real values, as in the case of income.

Component ID: #ti659905982

Case Study: Deliverables From the NSF DataStart Program via data.world

To ameliorate these issues, data.world, a start-up based in Austin, Texas, created a method to make ACS data more consumable for data scientists. Through a grant from the NSF’s DataStart program, they brought on graduate research student Jonathan Ortiz to dive into this issue. Their work has converted ACS data into a graph database.

Here are the results:

  • Data that’s linked to the concepts, relationships and semantic meaning contained in the PUMS data dictionary. This linkage spans years and accounts for differences in the PUMS schema over time, so all metadata for all years is located in one place.
  • The ability to transform raw .csv to linked data using a series of SPARQL updates and a Java importer developed by data.world. Real-world concepts contained in the PUMS are now linked to external data sources like DBpedia and Wikidata to make it easier for developers to use the data.
  • Because correlations are explicit in the data, interoperability is higher and discovery is easier. Instead of encumbering users with a complex data dictionary, we built the metadata into the data itself. Instead of unwieldy raw .csv data files, we put everything into a secure, queryable database.

Because the connections are explicitly stated in a standard format, machines can process this linked data better — just as web browsers can show you any web page without needing custom code for each one.

Component ID: #ti659905981

How to Access This Public Good

Now, developers can use the linked ACS to create apps that deliver vital information to U.S. residents. This research created a body of knowledge that helps data scientists get up to speed with ACS much faster. The open data team at Amazon has taken this NSF-funded research from data.world and made it available as a free resource on Amazon Web Services.  Other companies are free to do the same.

We’ll feature this research in a session at the ACS User Conference in May 2017. To join the open community and have a dialogue on furthering this data capability, please join the conversation here.

X
  Is this page helpful?
Thumbs Up Image Yes    Thumbs Down Image No
X
Comments or suggestions?
No, thanks
255 characters remaining
X
Thank you for your feedback.
Comments or suggestions?
Back to Header