The U.S. Census Bureau’s American Community Survey (ACS) provides a timely snapshot of America’s people and economy. It’s used in a myriad of ways to create value in the public and private sectors. ACS data helps determine how more than $400 billion in federal funds are distributed to state and local areas each year, nonprofits better understand the populations they serve, and small and big businesses understand where to invest to grow their company.
This is an important dataset, and it’s no surprise that data scientists are eager to tap into its insights. With a new tool funded by research from the National Science Foundation (NSF), it’s now easier than ever for developers to create apps using the survey’s data.
For some time, ACS data has been available in several different formats. But for data scientists who want the most granular, low-level view of the data, there was a gap — until now, to get that low-level view, they needed to use the Public Use Microdata Sample (PUMS).
PUMS is stored in common, tabular data files, like .csv. This format is compact, lightweight, and therefore relatively easy to transfer from person to person, but its structure can make it hard to use. For example, in the Housing file, a value of “4” in the “AGS” column really means that a housing unit has $2,500-$4,999 in yearly sales of agricultural products.
As a result, the PUMS has a steep learning curve. Its metadata is stored in a separate, human-readable data dictionary that you have to constantly reference to explore and make sense of the data. This limits what a computer can do with the data, because the concepts in the data dictionary aren’t in a computer-readable format.
Additionally, PUMS files are bulky. Attempting to open the full U.S. person record file in Excel will exceed the memory in most home computers. All of these issues impede users’ ability to share the data and combine it with other datasets.
Finally, a big challenge with PUMS is its complexity. Among other hurdles, it contains:
To ameliorate these issues, data.world, a start-up based in Austin, Texas, created a method to make ACS data more consumable for data scientists. Through a grant from the NSF’s DataStart program, they brought on graduate research student Jonathan Ortiz to dive into this issue. Their work has converted ACS data into a graph database.
Here are the results:
Because the connections are explicitly stated in a standard format, machines can process this linked data better — just as web browsers can show you any web page without needing custom code for each one.
Now, developers can use the linked ACS to create apps that deliver vital information to U.S. residents. This research created a body of knowledge that helps data scientists get up to speed with ACS much faster. The open data team at Amazon has taken this NSF-funded research from data.world and made it available as a free resource on Amazon Web Services. Other companies are free to do the same.
We’ll feature this research in a session at the ACS User Conference in May 2017. To join the open community and have a dialogue on furthering this data capability, please join the conversation here.