A Quick Tutorial on Various State (Country) Classification Systems
My graduate studies program director asked me to teach an independent study for a graduate student this semester. The goal is to better train the student for their research agenda beyond what I could plausibly teach them in a given semester.1 Toward that end, I’m going to offer most (if not all) of the independent study sessions as posts on my blog. This should help the student and possibly help others who stumble onto my website. Going forward, I’m probably just going to copy-paste this introduction for future posts for this independent study.
The particular student is pursuing a research program in international political economy. Substantively, much of what they want to do is outside my wheelhouse. However, I can offer some things to help the student with their research. The first lesson will be about various state (country) classification systems.
Here’s a table of contents for what follows.
- The Issue: There Are So Many Different Classification Systems!
- Identify a Temporal Domain for a Cross-National Analysis (Because State Codes Change Over Time)
- Make One Classification System a “Master”, and Don’t Use the Country Name
- Use R to Create a Panel of States (and States over Time)
The Issue: There Are So Many Different Classification Systems!
It should not shock a graduate student in political science/policy analysis to learn that there is no universal standard for state classification. Indeed, various data sources and agencies will have varying definitions of what territorial unit counts as a state for classification purposes. Each data source/agency will also have a different coding scheme as well.
Take, for example, the following classification systems. The first, Correlates of War (CoW), leans on integers that range from 2 (the United States) to 990 (Samoa) to code states from 1816 to 2016. The second, the Gleditsch-Ward system, is a slight derivation of the CoW system. The overlap is substantial and the numerical range is effectively the same, but important distinctions emerge as Gleditsch-Ward interpret independent states differently. The third is two-character and three-character codes provided by the Organisation Internationale de Normalisation (ISO) 3166 Maintenance Agency, one that Americans will at least recognize as having tight integration with the American National Standards Institute as well as broad use elsewhere. The fourth is the United Nations’ M49 classification system. The fifth is the Geopolitical Entities, Names, and Codes (GENC) Standard (in both two-character and three-character form), which provides names and codes for U.S. recognized entities and subdivisions. GENC supplanted the Federal Information Processing Standard (FIPS) about 10 years ago for this purpose. To round things out, we’ll include the Eurostat classification system (which greatly resembles ISO’s two-character code), the FIPS codes (which also looks a lot like ISO’s two-character code), and the World Bank code (which is very similar to but slightly incompatible with ISO’s three-character code).
Here is how a few territorial units are coded, selected on whether their English country name starts with “T” and as these codes appear in the {countrycode}
package.
Country Name | CoW Code | Gleditsch-Ward Code | ISO (2) | ISO (3) | UN M49 | GENC (2) | GENC (3) | Eurostat | FIPS | World Bank |
---|---|---|---|---|---|---|---|---|---|---|
Taiwan | 713 | 713 | TW | TWN | TW | TWN | TW | TW | TWN | |
Tajikistan | 702 | 702 | TJ | TJK | 762 | TJ | TJK | TJ | TI | TJK |
Tanzania | 510 | 510 | TZ | TZA | 834 | TZ | TZA | TZ | TZ | TZA |
Thailand | 800 | 800 | TH | THA | 764 | TH | THA | TH | TH | THA |
Timor-Leste | 860 | 860 | TL | TLS | 626 | TL | TLS | TL | TT | TLS |
Togo | 461 | 461 | TG | TGO | 768 | TG | TGO | TG | TO | TGO |
Tokelau | TK | TKL | 772 | TK | TKL | TK | TL | |||
Tonga | 955 | TO | TON | 776 | TO | TON | TO | TN | TON | |
Trinidad & Tobago | 52 | 52 | TT | TTO | 780 | TT | TTO | TT | TD | TTO |
Tunisia | 616 | 616 | TN | TUN | 788 | TN | TUN | TN | TS | TUN |
Turkey | 640 | 640 | TR | TUR | 792 | TR | TUR | TR | TU | TUR |
Turkmenistan | 701 | 701 | TM | TKM | 795 | TM | TKM | TM | TX | TKM |
Turks & Caicos Islands | TC | TCA | 796 | TC | TCA | TC | TK | TCA | ||
Tuscany | 337 | |||||||||
Tuvalu | 947 | TV | TUV | 798 | TV | TUV | TV | TV | TUV | |
Two Sicilies | 329 |
It seems a bit daunting to see so many differences among these classification systems. With that in mind, I recommend a student (in particular, my student this semester) to do the following.
Identify a Temporal Domain for a Cross-National Analysis (Because State Codes Change Over Time)
My student is interested in a cross-national analysis of a group of states—regionally or globally, I can’t yet tell—with respect to a host of financial indicators. The extent to which the analysis involves financial indicators means the temporal domain of the analysis is not going to be that long, all things considered. However, my student is going to want to make explicit the temporal domain first because that will have some implications for state classification.
Namely, a state may undergo a massive transformation at some point in the data. Consider an analysis that leans on the full domain of data made available by the World Bank. World Bank data (e.g. GDP) are generally available as early as 1960 and may, in some cases, go to a very recently concluded calendar year (e.g. 2019, since 2020 just ended). If that’s the full domain, the student will want to be mindful of some major events that have important implications for state classification.
Consider the most obvious case here: the disintegration of the Soviet Union. Different classification systems code the disintegration of the Soviet Union differently.
- CoW, Gleditsch-Ward: CoW and Gleditsch-Ward code the creation of new states that followed in effectively the same way. Both understand the Soviet Union as effectively dominated by Russia, which precedes and succeeds the Soviet Union with the same code the Soviet Union had (365). Moldova (359), Estonia (366), Latvia (367), Lithuania (368), Ukraine (369), Belarus (370), Armenia (371), Georgia (372), Azerbaijan (373), Turkmenistan (371), Tajikistan (702), Kyrgyzstan (703), Uzbekistan (704), and Kazakhstan (705) emerge as independent states in 1991.
- UN M49: Per Wikipedia, the Soviet Union had a UN M49 code of 810. The disintegration of the Soviet Union creates new codes starting in 1991 for Armenia (051), Azerbaijan (031), Georgia (268), Kazakhstan (398), Kyrgyzstan (417), Tajikistan (762), Turkmenistan (795), Uzbekistan (860), Estonia (233), Latvia (428), Lithuania (440), Belarus (112), Moldova (498), Ukraine (804), and Russia (643). Importantly, the UN classification system sees the Russian Federation as a new entity entirely, and not just the dominant component of the Soviet Union.
- ISO: ISO does not readily advertise a temporal consideration to its classification scheme. Some digging identifies an “exceptional reservation” for the Soviet Union as
SU
for the two-character code andSUN
for the three-character code. The Russian Federation isRU
andRUS
, respectively. Whereas CoW, Gleditsch-Ward, and the UN M49 classifications end the Soviet Union in 1991, ISO appears to only note this code emerges in 2008 and is “transitionally reserved from September 1992.”
For these four systems, CoW and Gleditsch-Ward are in effective agreement. There might be a slight difference among the days, but not the years nor the codes. UN M49 treats Russia as separate from the Soviet Union, in contrast with CoW and Gleditsch-Ward, but is in agreement about the year of the change. ISO treats Russia as separate from the Soviet Union, in agreement with UN M49, but the year of the change is different. Different systems, different coding procedures, different results.
This is just the biggest case. However, there are other major events that lead to divergences in classification systems. Among them: the unification of Vietnam, the unification of Yemen, the Ethiopian Civil War (and creation of Eritrea), the unification of Germany, and—another biggie—the disintegration of Yugoslavia.
I mention this only to note that if the temporal domain is something like 2000 to 2019, there won’t be too many issues (other than some slight interpretations of the split between Serbia and Montenegro around 2006). If you want the full enchilada of a temporal domain—the Correlates of War domain from 1816 to the present—there will be plenty of peculiarities/oddities in the classification system you choose that are worth knowing (the extent to which you’re going to be merging in data from multiple sources).
No matter, take inventory of the temporal domain you want first. State codes change over time. You’ll want to take stock of what headaches you can expect in your travels.
Make One Classification System a “Master”, and Don’t Use the Country Name
Vincent Arel-Bundock’s {countrycode}
package—which I’ll discuss later—is going to be useful for getting different classification systems to integrate with each other. However, my student (and the reader) should be reticent to treat {countrycode}
as magic or to use it uncritically. Namely, my student and the reader should treat one classification system as a “master” system for the particular project.
The system that the student/reader makes the “master” system is to their discretion. However, the master system should probably be the system that emerges as a center of gravity for the particular project. For example, I do a lot of research on inter-state conflict across time and space. The bulk of the data I use is in the CoW ecosystem. Naturally, CoW’s state system membership is ultimately my “master” system. It integrates perfectly with other components of the CoW data ecosystem (e.g. trade, material capabilities). One data source I integrate into these projects—the Polity regime type data—has a different classification system. When that arises, I standardize—as well as I can—the Polity system codes to the CoW codes and integrate into my data based on the matching CoW codes. Again, {countrycode}
is wonderful for this purpose (more on that later), but it is not magic and there’s always going to be some cleanup issues to address in the process. But, it’s imperative on me, in my case, to treat the CoW system as a master system because it’s the center of gravity for what I’m doing. It makes my job ultimately easier.
A student doing a lot of cross-national financial analyses will probably lean on the ISO system as the master system. Namely, ISO classification is everywhere and prominently used in International Monetary Fund and World Bank data. I believe the Penn World Table also uses the ISO system for its data.
One caution, though. The student/reader should not treat the English country name as master system. A person who does this will be flagging discrepancies between a lot of countries/states, like “Bahamas, The”/”Bahamas”, “Brunei”/”Brunei Darussalam”, “Burma”/”Myanmar”, “Congo (Brazzaville)”/”Congo”/”Republic of Congo” and many, many more. To be fair, retaining country names in the data frame is going to be useful for diagnostic purposes, but it should not ever be the master system for classification.
Use a code, not a proper noun.
Use R to Create a Panel of States (and States over Time)
The remainder of this post will advise the student on how to use a few lines in R and some R packages to generate a panel of states (and states over time). First, here are the R packages we’ll be using.
library(tidyverse) # for all things workflow
library(countrycode) # for integration among different classification systems
library(peacesciencer) # my R package for peace science stuff
library(ISOcodes) # for ISO and UN M 49 codes
I do want the student/reader to notice one thing I’m doing here. Namely, I have an underlying code and a country name alongside it as well. Don’t use the country name for classification purposes, but do use it for debugging purposes. A reader may get fluent in CoW codes or ISO codes, but, in the event of a matching issue, sometimes it’s good to see the full country name.
Create a State-Year Panel of CoW States
This comes pre-processed in my {peacesciencer}
package. create_stateyears()
defaults to returning CoW state system members for all available years from 1816 to the most recently concluded calendar year.
create_stateyears()
#> # A tibble: 16,731 x 3
#> ccode statenme year
#> <dbl> <chr> <int>
#> 1 2 United States of America 1816
#> 2 2 United States of America 1817
#> 3 2 United States of America 1818
#> 4 2 United States of America 1819
#> 5 2 United States of America 1820
#> 6 2 United States of America 1821
#> 7 2 United States of America 1822
#> 8 2 United States of America 1823
#> 9 2 United States of America 1824
#> 10 2 United States of America 1825
#> # … with 16,721 more rows
Create a State-Year Panel of Gleditsch-Ward states
create_stateyears()
can do the same for Gleditsch-Ward states, but requires the user to specify they want states from the Gleditsch-Ward system.
create_stateyears(system="gw")
#> # A tibble: 18,289 x 3
#> gwcode statename year
#> <dbl> <chr> <int>
#> 1 2 United States of America 1816
#> 2 2 United States of America 1817
#> 3 2 United States of America 1818
#> 4 2 United States of America 1819
#> 5 2 United States of America 1820
#> 6 2 United States of America 1821
#> 7 2 United States of America 1822
#> 8 2 United States of America 1823
#> 9 2 United States of America 1824
#> 10 2 United States of America 1825
#> # … with 18,279 more rows
Create a Panel of ISO Codes
ISO codes are ubiquitous in economic data. I do have some misgivings about using {countrycode}
to create a panel of countries, even for the ISO codes. Recall my concern that ISO codes are not very transparent about when (or even if) a code changes at particular point in time. No matter, the {ISOcodes}
package has this information
Recall my earlier plea, though: pick one system as a “master” system, even among ISO codes. I’m partial to the three-character ISO codes so I’ll use that here.
ISO_3166_1 %>% as_tibble() %>%
# Alpha_2 = iso2c, if you wanted it.
# I want the three-character one.
select(Alpha_3, Name)
#> # A tibble: 249 x 2
#> Alpha_3 Name
#> <chr> <chr>
#> 1 ABW Aruba
#> 2 AFG Afghanistan
#> 3 AGO Angola
#> 4 AIA Anguilla
#> 5 ALA Åland Islands
#> 6 ALB Albania
#> 7 AND Andorra
#> 8 ARE United Arab Emirates
#> 9 ARG Argentina
#> 10 ARM Armenia
#> # … with 239 more rows
{ISOcodes}
does have another data frame for “retired” codes. This is ISO_3166_3
in the {ISOcodes}
package. I encourage my student to take stock of how applicable some of these observations are for their particular analysis. My previous point about ISO codes—they don’t neatly communicate a temporal dimension—still holds.
ISO_3166_3 %>% as_tibble() %>%
# Get rid of codes we don't want because we're focusing on three-character
select(-Alpha_4, -Numeric)
ISO (3) | Name | Date Withdrawn | Comment |
---|---|---|---|
AFI | French Afars and Issas | 1977 | |
ANT | Netherlands Antilles | 1993-07-12 | |
ATB | British Antarctic Territory | 1979 | |
BUR | Burma, Socialist Republic of the Union of | 1989-12-05 | |
BYS | Byelorussian SSR Soviet Socialist Republic | 1992-06-15 | |
CSK | Czechoslovakia, Czechoslovak Socialist Republic | 1993-06-15 | |
SCG | Serbia and Montenegro | 2006-06-05 | |
CTE | Canton and Enderbury Islands | 1984 | |
DDR | German Democratic Republic | 1990-10-30 | |
DHY | Dahomey | 1977 | |
ATF | French Southern and Antarctic Territories | 1979 | now split between AQ and TF |
FXX | France, Metropolitan | 1997-07-14 | |
GEL | Gilbert and Ellice Islands | 1979 | now split into Kiribati and Tuvalu |
HVO | Upper Volta, Republic of | 1984 | |
JTN | Johnston Island | 1986 | |
MID | Midway Islands | 1986 | |
NHB | New Hebrides | 1980 | |
ATN | Dronning Maud Land | 1983 | |
NTZ | Neutral Zone | 1993-07-12 | formerly between Saudi Arabia and Iraq |
PCI | Pacific Islands (trust territory) | 1986 | divided into FM, MH, MP, and PW |
PUS | US Miscellaneous Pacific Islands | 1986 | |
PCZ | Panama Canal Zone | 1980 | |
RHO | Southern Rhodesia | 1980 | |
SKM | Sikkim | 1975 | |
SUN | USSR, Union of Soviet Socialist Republics | 1992-08-30 | |
TMP | East Timor | 2002-05-20 | was Portuguese Timor |
VDR | Viet-Nam, Democratic Republic of | 1977 | |
WAK | Wake Island | 1986 | |
YMD | Yemen, Democratic, People's Democratic Republic of | 1990-08-14 | |
YUG | Yugoslavia, Socialist Federal Republic of | 1993-07-28 | |
ZAR | Zaire, Republic of | 1997-07-14 |
Create a State-Year Panel of ISO Codes
If I understand these data correctly, the last change to ISO classification (that could pose a problem for merging from a CoW perspective) concerns the separation between Serbia and Montenegro in 2006. Taking this information to heart, let’s assume we wanted a state-year panel based off ISO codes for all ISO observations from 2010 to 2020. Toward that end, we’d do something like this.
ISO_3166_1 %>% as_tibble() %>%
# Alpha_2 = iso2c, if you wanted it.
# I want the three-character one.
select(Alpha_3, Name) %>%
mutate(styear = 2010,
endyear = 2020) %>%
rowwise() %>%
mutate(year = list(seq(styear, endyear))) %>%
unnest(year) %>%
select(-styear, -endyear)
#> # A tibble: 2,739 x 3
#> Alpha_3 Name year
#> <chr> <chr> <int>
#> 1 ABW Aruba 2010
#> 2 ABW Aruba 2011
#> 3 ABW Aruba 2012
#> 4 ABW Aruba 2013
#> 5 ABW Aruba 2014
#> 6 ABW Aruba 2015
#> 7 ABW Aruba 2016
#> 8 ABW Aruba 2017
#> 9 ABW Aruba 2018
#> 10 ABW Aruba 2019
#> # … with 2,729 more rows
Create a Panel of UN M49 Codes
{ISOcodes}
also has UN M49 codes as well (UN_M.49_Countries
) , though this requires some light cleaning.
UN_M.49_Countries %>% as_tibble() %>%
select(-ISO_Alpha_3) %>%
mutate(Name = str_trim(Name, side="left"))
#> # A tibble: 249 x 2
#> Code Name
#> <chr> <chr>
#> 1 004 Afghanistan
#> 2 248 Åland Islands
#> 3 008 Albania
#> 4 012 Algeria
#> 5 016 American Samoa
#> 6 020 Andorra
#> 7 024 Angola
#> 8 660 Anguilla
#> 9 010 Antarctica
#> 10 028 Antigua and Barbuda
#> # … with 239 more rows
Use {countrycode}
for Matching/Merging Across Classification Systems
While I encourage the student/reader to treat one classification system as a “master”, it’s highly unlikely the classification system that is the “master” will be the only one encountered in a particular project. For example, let’s assume our master system is the three-character ISO code. However, we’re going to merge in data (say: CoW’s trade data) that uses the CoW state system classification. {countrycode}
will be very useful in matching one classification to another.
countrycode()
is the primary function in Arel-Bundock’s package for that purpose. The user should create a column using the countrycode()
function that identifies the source column (here: Alpha_3
), identifies what type of classification that is (here: "iso3c"
), and returns the equivalent code we want ("cown"
, for Correlates of War numeric code).
ISO_3166_1 %>% as_tibble() %>%
# Alpha_2 = iso2c, if you wanted it.
# I want the three-character one.
select(Alpha_3, Name) %>%
mutate(ccode = countrycode(Alpha_3, "iso3c", "cown"))
#> Warning in countrycode(Alpha_3, "iso3c", "cown"): Some values were not matched unambiguously: ABW, AIA, ALA, ASM, ATA, ATF, BES, BLM, BMU, BVT, CCK, COK, CUW, CXR, CYM, ESH, FLK, FRO, GGY, GIB, GLP, GRL, GUF, GUM, HKG, HMD, IMN, IOT, JEY, MAC, MAF, MNP, MSR, MTQ, MYT, NCL, NFK, NIU, PCN, PRI, PSE, PYF, REU, SGS, SHN, SJM, SPM, SRB, SXM, TCA, TKL, UMI, VGB, VIR, WLF
#> # A tibble: 249 x 3
#> Alpha_3 Name ccode
#> <chr> <chr> <dbl>
#> 1 ABW Aruba NA
#> 2 AFG Afghanistan 700
#> 3 AGO Angola 540
#> 4 AIA Anguilla NA
#> 5 ALA Åland Islands NA
#> 6 ALB Albania 339
#> 7 AND Andorra 232
#> 8 ARE United Arab Emirates 696
#> 9 ARG Argentina 160
#> 10 ARM Armenia 371
#> # … with 239 more rows
I do want the reader to observe something. countrycode()
cannot perfectly match observations. The extent to which there are important differences among classification systems, perfect one-to-one matching is impossible (and it’s why I recommend treating one classification as a master system). When countrycode()
cannot find a one-to-one match, it returns an NA and will tell you which inputs were not matched for your own diagnostic purposes. In our case, these are the NAs.
ISO (3) | Name | CoW Code |
---|---|---|
ABW | Aruba | |
AIA | Anguilla | |
ALA | Åland Islands | |
ASM | American Samoa | |
ATA | Antarctica | |
ATF | French Southern Territories | |
BES | Bonaire, Sint Eustatius and Saba | |
BLM | Saint Barthélemy | |
BMU | Bermuda | |
BVT | Bouvet Island | |
CCK | Cocos (Keeling) Islands | |
COK | Cook Islands | |
CUW | Curaçao | |
CXR | Christmas Island | |
CYM | Cayman Islands | |
ESH | Western Sahara | |
FLK | Falkland Islands (Malvinas) | |
FRO | Faroe Islands | |
GGY | Guernsey | |
GIB | Gibraltar | |
GLP | Guadeloupe | |
GRL | Greenland | |
GUF | French Guiana | |
GUM | Guam | |
HKG | Hong Kong | |
HMD | Heard Island and McDonald Islands | |
IMN | Isle of Man | |
IOT | British Indian Ocean Territory | |
JEY | Jersey | |
MAC | Macao | |
MAF | Saint Martin (French part) | |
MNP | Northern Mariana Islands | |
MSR | Montserrat | |
MTQ | Martinique | |
MYT | Mayotte | |
NCL | New Caledonia | |
NFK | Norfolk Island | |
NIU | Niue | |
PCN | Pitcairn | |
PRI | Puerto Rico | |
PSE | Palestine, State of | |
PYF | French Polynesia | |
REU | Réunion | |
SGS | South Georgia and the South Sandwich Islands | |
SHN | Saint Helena, Ascension and Tristan da Cunha | |
SJM | Svalbard and Jan Mayen | |
SPM | Saint Pierre and Miquelon | |
SRB | Serbia | |
SXM | Sint Maarten (Dutch part) | |
TCA | Turks and Caicos Islands | |
TKL | Tokelau | |
UMI | United States Minor Outlying Islands | |
VGB | Virgin Islands, British | |
VIR | Virgin Islands, U.S. | |
WLF | Wallis and Futuna |
Some of this is by design. For example, there’s no CoW code for Aruba (ABW
) because Aruba does not exist in the CoW system. That’ll be the bulk of the warnings returned by countrycode()
for a case like this and you can safely ignore those. Some of this is, well, a headache you’ll need to fix yourself. For example, Serbia (SRB
) always throws countrycode()
for a loop, but Serbia has always been 345 in the CoW system. You can fix that yourself with an addendum to the mutate()
wrapper. Something like ccode = ifelse(Alpha_3 == "SRB", 345, ccode)
will work.
ISO_3166_1 %>% as_tibble() %>%
# Alpha_2 = iso2c, if you wanted it.
# I want the three-character one.
select(Alpha_3, Name) %>%
mutate(ccode = countrycode(Alpha_3, "iso3c", "cown"),
ccode = ifelse(Alpha_3 == "SRB", 345, ccode))
#> # A tibble: 249 x 3
#> Alpha_3 Name ccode
#> <chr> <chr> <dbl>
#> 1 ABW Aruba NA
#> 2 AFG Afghanistan 700
#> 3 AGO Angola 540
#> 4 AIA Anguilla NA
#> 5 ALA Åland Islands NA
#> 6 ALB Albania 339
#> 7 AND Andorra 232
#> 8 ARE United Arab Emirates 696
#> 9 ARG Argentina 160
#> 10 ARM Armenia 371
#> # … with 239 more rows
I use this to underscore that {countrycode}
is one of the most useful R packages merging and matching across different state/country classification systems. However, it is not magic and should not be used uncritically. Always inspect the output.
-
I’ll be using they/them pronouns here mostly for maximum anonymity. ↩
Disqus is great for comments/feedback but I had no idea it came with these gaudy ads.