Videos
Whoa, we somehow hit 50,000 subscribers! Since we probably have a bunch of relatively new people helping out, I thought I'd write up a quick overview of the international language and country codes people might see around on our community to serve as an introduction and as a refresher.
Language Codes
Every post on r/translator is assigned a language code from the ISO 639 standard. For those unfamiliar with the ISO 639 standard, it is a universal standard concerned with representing languages in consistent alphanumeric codes and is broadly used by international organizations, businesses, and websites to represent languages. For example, go to any Wikipedia page for a language and you'll see its ISO 639 code(s) in the infobox at the top of the page.
We use two parts of the ISO 639 standard on r/translator: ISO 639-1 and ISO 639-3.
Note: The ISO 639 standard is not perfect, but it's the best system that exists to unambiguously classify languages and is widely used. Some broader issues with the system include:
Language codes cannot be changed after their creation and may either:
Reflect a designation that is incorrect (
ase"American Signed English" is American Sign Language, but ASL is not English)Encode a pejorative for the people in a linguistic group (Morey et. al cite
clk“Idu” < "Chulikata" (pejorative))
Reflect political divisions rather than purely linguistic ones. For example, Bosnian, Croatian, and Serbian are all mutually intelligible but are defined separately in ISO 639-1, but all the Chinese and Arabic varieties (most of which are not mutually intelligible) are lumped under
zhandarrespectively.
ISO 639-1 (two-letter language codes)
ISO 639-1 codes consist of two letters and are intended to cover the world's most-spoken languages. The vast majority of the requests on this subreddit are for the 184 languages that have an ISO 639-1 code, for example:
ja- Japanesede- Germanru- Russian
Please note that despite some similarities, these language codes are not necessarily identical to their country code counterparts, which were defined separately several decades ago. Examples below:
| ISO 639-1 (Language) Codes | ISO 3166-1 (Country) Codes |
|---|---|
sq (Albanian) | AL (Albania) |
bn (Bengali) | BD (Bangladesh) |
ka (Georgian) | GE (Georgia) |
ko (Korean) | KR (South Korea) |
ja (Japanese) | JP (Japan) |
vi (Vietnamese) | VN (Vietnam) |
zh (Chinese) | CN (People's Republic of China) |
ISO 639-3 (three-letter language codes)
ISO 639-3 codes consist of three letters and are intended to cover all human languages, modern and ancient. There are almost 8,000 languages covered by this standard, so obviously most languages in this standard do not have an equivalent two-letter ISO 639-1 code. We tend to get fewer requests for languages that only have an ISO 639-3 code, but here are some of the more frequent examples seen here:
yue- Cantonesegrc- Ancient Greeklzh- Classical Chinesegsw- Swiss German
Please note that the ISO 639-3 standard also includes some utility codes which don't necessarily reference a specific language, including art for artificial (constructed) languages, and zxx for something that isn't a language at all.
Script Codes
ISO 15924 (four-letter script codes)
ISO 15924 codes consist of four letters and are intended to cover writing systems, as distinct from the languages they represent. We don't use them too much in r/translator - they're used to classify "Unknown" posts for which we don't know the language yet, but we know the writing system.
To give an example, the Sanskrit (sa) mantra auṃ maṇi padme hūṃ may be represented with the following scripts:
| Script name | Transliteration |
|---|---|
Deva | ॐ मणि पद्मे हूँ |
Latn | Oṃ Maṇi Padme Hūṃ |
Tibt | ཨོཾ་མ་ཎི་པ་དྨེ་ཧཱུྃ |
Hang | 옴 마니 반메 훔 |
Hani | 唵嘛呢叭咪吽 |
The underlying language is still Sanskrit; it's just written with different writing systems.
Script codes as mentioned above, are largely redundant for non-"Unknown" posts on r/translator. For example, there isn't any benefit to defining a Finnish post as fi-Latn (Finnish in the Latin script) since Finnish has always been written in the Latin script.
ISO639-3 is derived from The Ethnologue. The author of this work have a certain point of view on the definition of "language" (vs. "dialect") with a strong tendency to split languages into smaller units. Note that the authors of the Ethnologue have a open agenda that is different from doing science. Divide and Conquer.
Are the languages spoken in various Arabian countries actually mutually intelligible? If no then it makes more sense to regard them as separate languages.
In China the government likes to officially categorize various Chinese languages as "dialects", but the reality is that the difference is really huge between some of them, e.g. comparing Mandarin and Cantonese is more like comparing Italian with Spanish than American English with British English. Most Mandarin speakers don't understand nor speak Cantonese at all. Therefore there are many language codes for various Chinese languages as well and it makes sense to me.
I wonder whether the situation is similar in Arabic: If you can't even understand some of the Northern African Arabic then how can you claim they're the "same language"? Some people might try to do so politically but linguistically it would be far-fetched. The situation is just fundamentally different from American vs. British English I suppose.