U.S. flag

An official website of the United States government

Skip Header


How Technology Is Making it Possible to Build the Largest Dataset in History About the United States and the People Who Live Here

Written by:

Linking data sheds light on where and why people move, and generational changes in social class and family structure. And thanks to new and improved technology, researchers now have access to more data than ever. 

The recently published paper in IEEE Annals of the History of Computing — summarized in this blog — highlights technological advances that have made it possible to create a longitudinal infrastructure of census data from 1850 to the present.

From 1790 to 1880, the U.S. Census Bureau recorded census data using manual counting processes. Only household information was captured prior to 1850. Then starting in 1850, the names of all household members were recorded. In 1890, the Census Bureau began transferring information from response forms to punch cards processed with new electro-mechanical tabulators. In the late 1930s, it began transferring responses to microfilm to better preserve and reduce space needed to store them.

In 1960, the Census Bureau began using its then-new film optical sensing device for input to computers (FOSDIC) machine to read the “bubbles” from microfilmed images and create person-level data to be compiled by computers. This made counting and summarization faster. The Census Bureau has since improved the FOSDIC by increasing the data capture speed from 3,000 items per minute in 1960 to 70,000 per minute in 1990. Names were not digitally captured during these processes because capturing handwritten names was not necessary for enumeration purposes, was expensive, and took a lot of time. Without names, records cannot be reliably linked across the censuses.

 


This changed with the 2000 Census, thanks in part to perhaps the greatest advance in census data capture: development of optical character recognition (OCR) technology and optical mark recognition (OMR), the process of reading information people mark on surveys and other paper documents. The Census Bureau now scans forms into digital images, rather than microfilm, and reads the bubbles using OMR. It uses OCR to read written responses like names and makes corrections via manual keying. The 2020 Census marked the first time most people filled out the census online, significantly reducing the need for paper forms and the creation of digital images. The 2000, 2010 and 2020 Censuses contain full names, so data are already linkable at the person level. Researchers access the restricted data linked anonymously at the person level without the names on approved projects in a secure research environment.

Over the past three decades, IPUMS (Integrated Public Use Microdata Series, the world’s largest individual-level population database with microdata samples from U.S. and international census records and U.S. and international surveys) has partnered with genealogical companies to digitally capture the 1790–1940 Censuses — including names starting in 1850 — from the microfilm reels by manually entering digital images. Anyone can access the linked census files through the IPUMS internet archive, and qualified researchers with approved projects can access the 1940 Census (linked to the 2000–2020 Censuses anonymously) in the secure Federal Statistical Research Data Centers (FSRDCs). But there’s a catch: U.S. population data from the 1950–1990 Censuses cannot currently be linked, creating a gap.

The good news is the Census Bureau is working to solve this problem. The National Archives and Records Administration recently released the 1950 Census following the expiration of the 72-year embargo on public release of decennial census records. IPUMS is currently working to add the 1950 Census data to the existing database. And thanks to funding from several sources and academic partnerships, the Decennial Census Digitization and Linkage Project (DCDL), led by the Census Bureau, is set to digitize and link the remaining censuses from 1960–1990.

When combined with the historical censuses housed at IPUMS, these datasets will form a longitudinal historical U.S. census data infrastructure for research stretching back to the 1850 Census. The resulting restricted data without names but with unique anonymous linkage keys will open the door for future generations of researchers to explore a wide range of topics. Witness the journey to preserve history by following the digitization process on the DCDL website.

Check out the paper to learn more about the evolution of data capture. 

Page Last Revised - April 20, 2023
Is this page helpful?
Thumbs Up Image Yes Thumbs Down Image No
NO THANKS
255 characters maximum 255 characters maximum reached
Thank you for your feedback.
Comments or suggestions?

Top

Back to Header