Inside the American Community Survey: 2016 Language Data Overhaul

September 14, 2017

Written by:

Christine Gambino, Social, Economic and Housing Statistics Division

The American Community Survey is the most widely used source of language data for the United States. It is the only survey that provides data addressing local community, language and English-speaking ability. In addition, there exists a diverse group of language data users that include:

Government agencies complying with laws protecting non-English language speakers or planning for future language resource needs.
Businesses who utilize language data to make marketing plans or hiring decisions.
Linguists, demographers and other researchers.

In an effort to better serve these data users’ needs, the U.S. Census Bureau in 2016 made some improvements involving language data coding and presentation. These changes do not result in a time series break in the data. The presentation of the data has changed in some tables, and comparability guidance is provided for comparing estimates over time.

The goals of this overhaul included:

Producing more precise and granular language data without placing additional burden on respondents or data collection operations.
Providing better data products for a variety of users across different levels of detail.
Conforming language data to the industry standards most widely used by translators.
Maintaining comparability over time with past language data, whenever possible.

More Precise Coding and More Codes

Beginning with American Community Survey 1-year (2016) and 5-year (2012-2016) data, the coding of specific languages is now in accordance with the International Organization for Standardization’s ISO-639-3 standard. This enables the Census Bureau to present languages in terms that are understood by linguists and translators, and also helps ensure that the specificity and precision of the data coding process is the same whether a language has millions of speakers or just a few.

The language questions that now appear on the American Community Survey were originally devised for the 1980 Census and have remained unchanged ever since. Data from the 1980 Census were also used to create a detailed language code list that was later adopted for use in the survey. Some codes have been added over the years, but the structure of the language list has largely reflected what the population looked like in 1980.

With the goal of expanding the original language code list, Census Bureau researchers analyzed the write-in answers of every language that was reported on the 1980, 1990 and 2000 Censuses, and the American Community Survey since 2001. This process resulted in the addition of over 900 new language codes available in the survey data. Table 1 shows the impact of this update and displays the number of codes added for each language family or world region.

Table 1.

Number of Language Codes and Speakers by Language Category in the 2015 and 2016 American Community Survey Data

Language Category		Number of Language Codes in 2015 ACS	Number of Language Codes in 2016 ACS	Number of Speakers in 2016	Margin of Error (+/-)
Spanish		3	4	40,489,813	106,752
Other Indo-European languages		79	173	11,090,060	86,378
	Primarily spoken in Europe	62	88
	Primarily spoken in Southern and Western Asia	17	85
Asian and Pacific Island languages		91	280	10,604,324	63,476
	Asian continent	44	168
	Pacific Islands (including Philippine languages)	47	112
Other languages		211	876	3,334,741	49,897
	Afro-Asiatic (including Semitic) languages	7	67
	Languages of Western Africa	6	189
	Languages of Central, Eastern and Southern Africa	10	204
	Native North American languages	161	217
	Central and South American languages	12	87
	Creole languages	6	40
	Other (including Uralic and Caucasian languages and language isolates)	9	72
Total		384	1333	65,518,938	154,156
Source: U.S. Census Bureau, 2016 American Community Survey 1-year estimates.

Many languages that were already on the code list needed to receive updated definitions or labels that were more transparent to users. For example, in 2015, all speakers of French-based Creole languages (863,449 in 2015) were coded as “French Creole.” There was a lot of interest and confusion about what languages made up this group. The majority of French Creole speakers in the United States speak Haitian, and the 2016-based Census Bureau products now provide an estimate of the number of Haitian speakers (856,009).

Even though they had small populations, most Native North American languages were already on the code list, as were many creole languages inherent to the United States (e.g. Gullah, Cajun, and Hawaiian Pidgin). Increased immigration from Asia, the Pacific Islands and Africa has led to growth in languages from those areas. Most of the languages added to the list were from those world regions, and Africa received the most new codes.

Prior to 2016, most languages of Africa were aggregated into codes like “Kru, Ibo, Yoruba,” which reflected a geographical area stretching from Liberia to Nigeria, or codes such as “Cushite” which described a language family rather than an individual language such as Somali. This lack of specificity in the previous list is attributable to the fact that in 1980, very few people in the United States spoke a language from Africa other than Arabic — only about 50,000 foreign-born speakers and an unknown number of native-born speakers. In 2000, when the Census Bureau first published data on African languages for the total population, there were 418,505 speakers of a language of Africa (excluding Arabic). By 2010, this estimate had approximately doubled to 862,441, and in 2016 there were 1,183,453 speakers of a language of Africa (excluding Arabic).

Updated Data Products

The data tables provided by the Census Bureau will not provide separate estimates of all 1,333 languages. They will be grouped together into recognizable language groups large enough for us to provide reliable estimates. With the release of 2016 American Community Survey statistics, revised versions of detailed language tables — which list 42 languages and language groupings — are available on American FactFinder. In addition, we have added a new and expanded household language table. The tables contain estimates for Haitian, Punjabi, Bengali, Telugu and Tamil. These languages previously fell under the “French Creole,” “Other Indic languages” and “Other Asian languages” categories, respectively.

In October 2017, the Public Use Microdata Sample, known as PUMS, dataset will contain new codes for 36 languages that have previously not been available in Census Bureau data. For the first time, American Community Survey data will have codes for Akan (including Twi), Igbo, Oromo, Somali, Tigrinya, Wolof, Yoruba and other African languages that were previously grouped together. Those and other languages from around the world, from Dari to Marshallese, have been added to the 2016 PUMS datasets to enable custom tabulations for languages spoken by individuals and in households.

American Community Survey estimates are based on sample data and are subject to survey error. For more information on confidentiality protection, sampling error and nonsampling error, and definitions, visit <www.census.gov/programs-surveys/acs/technical-documentation/code-lists.html>.

###

Page Last Revised - December 16, 2021

Is this page helpful?
Thumbs Up Image

Yes

NO THANKS

255 characters maximum

255 characters maximum reached

Thank you for your feedback.
Comments or suggestions?

Top