A Quick Tutorial on Various State (Country) Classification Systems

My graduate studies program director asked me to teach an independent study for a graduate student this semester. The goal is to better train the student for their research agenda beyond what I could plausibly teach them in a given semester.¹ Toward that end, I’m going to offer most (if not all) of the independent study sessions as posts on my blog. This should help the student and possibly help others who stumble onto my website. Going forward, I’m probably just going to copy-paste this introduction for future posts for this independent study.

The particular student is pursuing a research program in international political economy. Substantively, much of what they want to do is outside my wheelhouse. However, I can offer some things to help the student with their research. The first lesson will be about various state (country) classification systems.

Here’s a table of contents for what follows.

The Issue: There Are So Many Different Classification Systems!
Identify a Temporal Domain for a Cross-National Analysis (Because State Codes Change Over Time)
Make One Classification System a “Master”, and Don’t Use the Country Name
Use R to Create a Panel of States (and States over Time)

The Issue: There Are So Many Different Classification Systems!

It should not shock a graduate student in political science/policy analysis to learn that there is no universal standard for state classification. Indeed, various data sources and agencies will have varying definitions of what territorial unit counts as a state for classification purposes. Each data source/agency will also have a different coding scheme as well.

Take, for example, the following classification systems. The first, Correlates of War (CoW), leans on integers that range from 2 (the United States) to 990 (Samoa) to code states from 1816 to 2016. The second, the Gleditsch-Ward system, is a slight derivation of the CoW system. The overlap is substantial and the numerical range is effectively the same, but important distinctions emerge as Gleditsch-Ward interpret independent states differently. The third is two-character and three-character codes provided by the Organisation Internationale de Normalisation (ISO) 3166 Maintenance Agency, one that Americans will at least recognize as having tight integration with the American National Standards Institute as well as broad use elsewhere. The fourth is the United Nations’ M49 classification system. The fifth is the Geopolitical Entities, Names, and Codes (GENC) Standard (in both two-character and three-character form), which provides names and codes for U.S. recognized entities and subdivisions. GENC supplanted the Federal Information Processing Standard (FIPS) about 10 years ago for this purpose. To round things out, we’ll include the Eurostat classification system (which greatly resembles ISO’s two-character code), the FIPS codes (which also looks a lot like ISO’s two-character code), and the World Bank code (which is very similar to but slightly incompatible with ISO’s three-character code).

Here is how a few territorial units are coded, selected on whether their English country name starts with “T” and as these codes appear in the {countrycode} package.

Select Territorial Units and Their Various Codes
Country Name	CoW Code	Gleditsch-Ward Code	ISO (2)	ISO (3)	UN M49	GENC (2)	GENC (3)	Eurostat	FIPS	World Bank
Taiwan	713	713	TW	TWN		TW	TWN	TW	TW	TWN
Tajikistan	702	702	TJ	TJK	762	TJ	TJK	TJ	TI	TJK
Tanzania	510	510	TZ	TZA	834	TZ	TZA	TZ	TZ	TZA
Thailand	800	800	TH	THA	764	TH	THA	TH	TH	THA
Timor-Leste	860	860	TL	TLS	626	TL	TLS	TL	TT	TLS
Togo	461	461	TG	TGO	768	TG	TGO	TG	TO	TGO
Tokelau			TK	TKL	772	TK	TKL	TK	TL
Tonga	955		TO	TON	776	TO	TON	TO	TN	TON
Trinidad & Tobago	52	52	TT	TTO	780	TT	TTO	TT	TD	TTO
Tunisia	616	616	TN	TUN	788	TN	TUN	TN	TS	TUN
Turkey	640	640	TR	TUR	792	TR	TUR	TR	TU	TUR
Turkmenistan	701	701	TM	TKM	795	TM	TKM	TM	TX	TKM
Turks & Caicos Islands			TC	TCA	796	TC	TCA	TC	TK	TCA
Tuscany	337
Tuvalu	947		TV	TUV	798	TV	TUV	TV	TV	TUV
Two Sicilies	329

It seems a bit daunting to see so many differences among these classification systems. With that in mind, I recommend a student (in particular, my student this semester) to do the following.

Identify a Temporal Domain for a Cross-National Analysis (Because State Codes Change Over Time)

My student is interested in a cross-national analysis of a group of states—regionally or globally, I can’t yet tell—with respect to a host of financial indicators. The extent to which the analysis involves financial indicators means the temporal domain of the analysis is not going to be that long, all things considered. However, my student is going to want to make explicit the temporal domain first because that will have some implications for state classification.

Namely, a state may undergo a massive transformation at some point in the data. Consider an analysis that leans on the full domain of data made available by the World Bank. World Bank data (e.g. GDP) are generally available as early as 1960 and may, in some cases, go to a very recently concluded calendar year (e.g. 2019, since 2020 just ended). If that’s the full domain, the student will want to be mindful of some major events that have important implications for state classification.

Consider the most obvious case here: the disintegration of the Soviet Union. Different classification systems code the disintegration of the Soviet Union differently.

CoW, Gleditsch-Ward: CoW and Gleditsch-Ward code the creation of new states that followed in effectively the same way. Both understand the Soviet Union as effectively dominated by Russia, which precedes and succeeds the Soviet Union with the same code the Soviet Union had (365). Moldova (359), Estonia (366), Latvia (367), Lithuania (368), Ukraine (369), Belarus (370), Armenia (371), Georgia (372), Azerbaijan (373), Turkmenistan (371), Tajikistan (702), Kyrgyzstan (703), Uzbekistan (704), and Kazakhstan (705) emerge as independent states in 1991.
UN M49: Per Wikipedia, the Soviet Union had a UN M49 code of 810. The disintegration of the Soviet Union creates new codes starting in 1991 for Armenia (051), Azerbaijan (031), Georgia (268), Kazakhstan (398), Kyrgyzstan (417), Tajikistan (762), Turkmenistan (795), Uzbekistan (860), Estonia (233), Latvia (428), Lithuania (440), Belarus (112), Moldova (498), Ukraine (804), and Russia (643). Importantly, the UN classification system sees the Russian Federation as a new entity entirely, and not just the dominant component of the Soviet Union.
ISO: ISO does not readily advertise a temporal consideration to its classification scheme. Some digging identifies an “exceptional reservation” for the Soviet Union as SU for the two-character code and SUN for the three-character code. The Russian Federation is RU and RUS, respectively. Whereas CoW, Gleditsch-Ward, and the UN M49 classifications end the Soviet Union in 1991, ISO appears to only note this code emerges in 2008 and is “transitionally reserved from September 1992.”

For these four systems, CoW and Gleditsch-Ward are in effective agreement. There might be a slight difference among the days, but not the years nor the codes. UN M49 treats Russia as separate from the Soviet Union, in contrast with CoW and Gleditsch-Ward, but is in agreement about the year of the change. ISO treats Russia as separate from the Soviet Union, in agreement with UN M49, but the year of the change is different. Different systems, different coding procedures, different results.

This is just the biggest case. However, there are other major events that lead to divergences in classification systems. Among them: the unification of Vietnam, the unification of Yemen, the Ethiopian Civil War (and creation of Eritrea), the unification of Germany, and—another biggie—the disintegration of Yugoslavia.

I mention this only to note that if the temporal domain is something like 2000 to 2019, there won’t be too many issues (other than some slight interpretations of the split between Serbia and Montenegro around 2006). If you want the full enchilada of a temporal domain—the Correlates of War domain from 1816 to the present—there will be plenty of peculiarities/oddities in the classification system you choose that are worth knowing (the extent to which you’re going to be merging in data from multiple sources).

No matter, take inventory of the temporal domain you want first. State codes change over time. You’ll want to take stock of what headaches you can expect in your travels.

Make One Classification System a “Master”, and Don’t Use the Country Name

Vincent Arel-Bundock’s {countrycode} package—which I’ll discuss later—is going to be useful for getting different classification systems to integrate with each other. However, my student (and the reader) should be reticent to treat {countrycode} as magic or to use it uncritically. Namely, my student and the reader should treat one classification system as a “master” system for the particular project.

The system that the student/reader makes the “master” system is to their discretion. However, the master system should probably be the system that emerges as a center of gravity for the particular project. For example, I do a lot of research on inter-state conflict across time and space. The bulk of the data I use is in the CoW ecosystem. Naturally, CoW’s state system membership is ultimately my “master” system. It integrates perfectly with other components of the CoW data ecosystem (e.g. trade, material capabilities). One data source I integrate into these projects—the Polity regime type data—has a different classification system. When that arises, I standardize—as well as I can—the Polity system codes to the CoW codes and integrate into my data based on the matching CoW codes. Again, {countrycode} is wonderful for this purpose (more on that later), but it is not magic and there’s always going to be some cleanup issues to address in the process. But, it’s imperative on me, in my case, to treat the CoW system as a master system because it’s the center of gravity for what I’m doing. It makes my job ultimately easier.

A student doing a lot of cross-national financial analyses will probably lean on the ISO system as the master system. Namely, ISO classification is everywhere and prominently used in International Monetary Fund and World Bank data. I believe the Penn World Table also uses the ISO system for its data.

One caution, though. The student/reader should not treat the English country name as master system. A person who does this will be flagging discrepancies between a lot of countries/states, like “Bahamas, The”/”Bahamas”, “Brunei”/”Brunei Darussalam”, “Burma”/”Myanmar”, “Congo (Brazzaville)”/”Congo”/”Republic of Congo” and many, many more. To be fair, retaining country names in the data frame is going to be useful for diagnostic purposes, but it should not ever be the master system for classification.

Use a code, not a proper noun.

Use R to Create a Panel of States (and States over Time)

The remainder of this post will advise the student on how to use a few lines in R and some R packages to generate a panel of states (and states over time). First, here are the R packages we’ll be using.

library(tidyverse) # for all things workflow
library(countrycode) # for integration among different classification systems
library(peacesciencer) # my R package for peace science stuff
library(ISOcodes) # for ISO and UN M 49 codes

I do want the student/reader to notice one thing I’m doing here. Namely, I have an underlying code and a country name alongside it as well. Don’t use the country name for classification purposes, but do use it for debugging purposes. A reader may get fluent in CoW codes or ISO codes, but, in the event of a matching issue, sometimes it’s good to see the full country name.

Create a State-Year Panel of CoW States

This comes pre-processed in my {peacesciencer} package. create_stateyears() defaults to returning CoW state system members for all available years from 1816 to the most recently concluded calendar year.

create_stateyears()
#> # A tibble: 16,731 x 3
#>    ccode statenme                  year
#>    <dbl> <chr>                    <int>
#>  1     2 United States of America  1816
#>  2     2 United States of America  1817
#>  3     2 United States of America  1818
#>  4     2 United States of America  1819
#>  5     2 United States of America  1820
#>  6     2 United States of America  1821
#>  7     2 United States of America  1822
#>  8     2 United States of America  1823
#>  9     2 United States of America  1824
#> 10     2 United States of America  1825
#> # … with 16,721 more rows

Create a State-Year Panel of Gleditsch-Ward states

create_stateyears() can do the same for Gleditsch-Ward states, but requires the user to specify they want states from the Gleditsch-Ward system.

create_stateyears(system="gw")
#> # A tibble: 18,289 x 3
#>    gwcode statename                 year
#>     <dbl> <chr>                    <int>
#>  1      2 United States of America  1816
#>  2      2 United States of America  1817
#>  3      2 United States of America  1818
#>  4      2 United States of America  1819
#>  5      2 United States of America  1820
#>  6      2 United States of America  1821
#>  7      2 United States of America  1822
#>  8      2 United States of America  1823
#>  9      2 United States of America  1824
#> 10      2 United States of America  1825
#> # … with 18,279 more rows

Create a Panel of ISO Codes

ISO codes are ubiquitous in economic data. I do have some misgivings about using {countrycode} to create a panel of countries, even for the ISO codes. Recall my concern that ISO codes are not very transparent about when (or even if) a code changes at particular point in time. No matter, the {ISOcodes} package has this information

Recall my earlier plea, though: pick one system as a “master” system, even among ISO codes. I’m partial to the three-character ISO codes so I’ll use that here.

ISO_3166_1 %>% as_tibble() %>%
  # Alpha_2 = iso2c, if you wanted it.
  # I want the three-character one.
  select(Alpha_3, Name)
#> # A tibble: 249 x 2
#>    Alpha_3 Name                
#>    <chr>   <chr>               
#>  1 ABW     Aruba               
#>  2 AFG     Afghanistan         
#>  3 AGO     Angola              
#>  4 AIA     Anguilla            
#>  5 ALA     Åland Islands       
#>  6 ALB     Albania             
#>  7 AND     Andorra             
#>  8 ARE     United Arab Emirates
#>  9 ARG     Argentina           
#> 10 ARM     Armenia             
#> # … with 239 more rows

{ISOcodes} does have another data frame for “retired” codes. This is ISO_3166_3 in the {ISOcodes} package. I encourage my student to take stock of how applicable some of these observations are for their particular analysis. My previous point about ISO codes—they don’t neatly communicate a temporal dimension—still holds.

ISO_3166_3 %>% as_tibble() %>%
  # Get rid of codes we don't want because we're focusing on three-character
  select(-Alpha_4, -Numeric)

A Table of Retired ISO Countries/Observations
ISO (3)	Name	Date Withdrawn	Comment
AFI	French Afars and Issas	1977
ANT	Netherlands Antilles	1993-07-12
ATB	British Antarctic Territory	1979
BUR	Burma, Socialist Republic of the Union of	1989-12-05
BYS	Byelorussian SSR Soviet Socialist Republic	1992-06-15
CSK	Czechoslovakia, Czechoslovak Socialist Republic	1993-06-15
SCG	Serbia and Montenegro	2006-06-05
CTE	Canton and Enderbury Islands	1984
DDR	German Democratic Republic	1990-10-30
DHY	Dahomey	1977
ATF	French Southern and Antarctic Territories	1979	now split between AQ and TF
FXX	France, Metropolitan	1997-07-14
GEL	Gilbert and Ellice Islands	1979	now split into Kiribati and Tuvalu
HVO	Upper Volta, Republic of	1984
JTN	Johnston Island	1986
MID	Midway Islands	1986
NHB	New Hebrides	1980
ATN	Dronning Maud Land	1983
NTZ	Neutral Zone	1993-07-12	formerly between Saudi Arabia and Iraq
PCI	Pacific Islands (trust territory)	1986	divided into FM, MH, MP, and PW
PUS	US Miscellaneous Pacific Islands	1986
PCZ	Panama Canal Zone	1980
RHO	Southern Rhodesia	1980
SKM	Sikkim	1975
SUN	USSR, Union of Soviet Socialist Republics	1992-08-30
TMP	East Timor	2002-05-20	was Portuguese Timor
VDR	Viet-Nam, Democratic Republic of	1977
WAK	Wake Island	1986
YMD	Yemen, Democratic, People's Democratic Republic of	1990-08-14
YUG	Yugoslavia, Socialist Federal Republic of	1993-07-28
ZAR	Zaire, Republic of	1997-07-14

Create a State-Year Panel of ISO Codes

If I understand these data correctly, the last change to ISO classification (that could pose a problem for merging from a CoW perspective) concerns the separation between Serbia and Montenegro in 2006. Taking this information to heart, let’s assume we wanted a state-year panel based off ISO codes for all ISO observations from 2010 to 2020. Toward that end, we’d do something like this.

ISO_3166_1 %>% as_tibble() %>%
  # Alpha_2 = iso2c, if you wanted it.
  # I want the three-character one.
  select(Alpha_3, Name) %>%
  mutate(styear = 2010,
         endyear = 2020) %>%
  rowwise() %>%
  mutate(year = list(seq(styear, endyear))) %>%
  unnest(year) %>%
  select(-styear, -endyear)
#> # A tibble: 2,739 x 3
#>    Alpha_3 Name   year
#>    <chr>   <chr> <int>
#>  1 ABW     Aruba  2010
#>  2 ABW     Aruba  2011
#>  3 ABW     Aruba  2012
#>  4 ABW     Aruba  2013
#>  5 ABW     Aruba  2014
#>  6 ABW     Aruba  2015
#>  7 ABW     Aruba  2016
#>  8 ABW     Aruba  2017
#>  9 ABW     Aruba  2018
#> 10 ABW     Aruba  2019
#> # … with 2,729 more rows

Create a Panel of UN M49 Codes

{ISOcodes} also has UN M49 codes as well (UN_M.49_Countries) , though this requires some light cleaning.

UN_M.49_Countries %>% as_tibble() %>% 
  select(-ISO_Alpha_3) %>%
  mutate(Name = str_trim(Name, side="left"))
#> # A tibble: 249 x 2
#>    Code  Name               
#>    <chr> <chr>              
#>  1 004   Afghanistan        
#>  2 248   Åland Islands      
#>  3 008   Albania            
#>  4 012   Algeria            
#>  5 016   American Samoa     
#>  6 020   Andorra            
#>  7 024   Angola             
#>  8 660   Anguilla           
#>  9 010   Antarctica         
#> 10 028   Antigua and Barbuda
#> # … with 239 more rows

Use `{countrycode}` for Matching/Merging Across Classification Systems

While I encourage the student/reader to treat one classification system as a “master”, it’s highly unlikely the classification system that is the “master” will be the only one encountered in a particular project. For example, let’s assume our master system is the three-character ISO code. However, we’re going to merge in data (say: CoW’s trade data) that uses the CoW state system classification. {countrycode} will be very useful in matching one classification to another.

countrycode() is the primary function in Arel-Bundock’s package for that purpose. The user should create a column using the countrycode() function that identifies the source column (here: Alpha_3), identifies what type of classification that is (here: "iso3c"), and returns the equivalent code we want ("cown", for Correlates of War numeric code).

ISO_3166_1 %>% as_tibble() %>%
  # Alpha_2 = iso2c, if you wanted it.
  # I want the three-character one.
  select(Alpha_3, Name) %>%
  mutate(ccode = countrycode(Alpha_3, "iso3c", "cown"))
#> Warning in countrycode(Alpha_3, "iso3c", "cown"): Some values were not matched unambiguously: ABW, AIA, ALA, ASM, ATA, ATF, BES, BLM, BMU, BVT, CCK, COK, CUW, CXR, CYM, ESH, FLK, FRO, GGY, GIB, GLP, GRL, GUF, GUM, HKG, HMD, IMN, IOT, JEY, MAC, MAF, MNP, MSR, MTQ, MYT, NCL, NFK, NIU, PCN, PRI, PSE, PYF, REU, SGS, SHN, SJM, SPM, SRB, SXM, TCA, TKL, UMI, VGB, VIR, WLF
#> # A tibble: 249 x 3
#>    Alpha_3 Name                 ccode
#>    <chr>   <chr>                <dbl>
#>  1 ABW     Aruba                   NA
#>  2 AFG     Afghanistan            700
#>  3 AGO     Angola                 540
#>  4 AIA     Anguilla                NA
#>  5 ALA     Åland Islands           NA
#>  6 ALB     Albania                339
#>  7 AND     Andorra                232
#>  8 ARE     United Arab Emirates   696
#>  9 ARG     Argentina              160
#> 10 ARM     Armenia                371
#> # … with 239 more rows

I do want the reader to observe something. countrycode() cannot perfectly match observations. The extent to which there are important differences among classification systems, perfect one-to-one matching is impossible (and it’s why I recommend treating one classification as a master system). When countrycode() cannot find a one-to-one match, it returns an NA and will tell you which inputs were not matched for your own diagnostic purposes. In our case, these are the NAs.

ISO Codes Without CoW Codes
ISO (3)	Name	CoW Code
ABW	Aruba
AIA	Anguilla
ALA	Åland Islands
ASM	American Samoa
ATA	Antarctica
ATF	French Southern Territories
BES	Bonaire, Sint Eustatius and Saba
BLM	Saint Barthélemy
BMU	Bermuda
BVT	Bouvet Island
CCK	Cocos (Keeling) Islands
COK	Cook Islands
CUW	Curaçao
CXR	Christmas Island
CYM	Cayman Islands
ESH	Western Sahara
FLK	Falkland Islands (Malvinas)
FRO	Faroe Islands
GGY	Guernsey
GIB	Gibraltar
GLP	Guadeloupe
GRL	Greenland
GUF	French Guiana
GUM	Guam
HKG	Hong Kong
HMD	Heard Island and McDonald Islands
IMN	Isle of Man
IOT	British Indian Ocean Territory
JEY	Jersey
MAC	Macao
MAF	Saint Martin (French part)
MNP	Northern Mariana Islands
MSR	Montserrat
MTQ	Martinique
MYT	Mayotte
NCL	New Caledonia
NFK	Norfolk Island
NIU	Niue
PCN	Pitcairn
PRI	Puerto Rico
PSE	Palestine, State of
PYF	French Polynesia
REU	Réunion
SGS	South Georgia and the South Sandwich Islands
SHN	Saint Helena, Ascension and Tristan da Cunha
SJM	Svalbard and Jan Mayen
SPM	Saint Pierre and Miquelon
SRB	Serbia
SXM	Sint Maarten (Dutch part)
TCA	Turks and Caicos Islands
TKL	Tokelau
UMI	United States Minor Outlying Islands
VGB	Virgin Islands, British
VIR	Virgin Islands, U.S.
WLF	Wallis and Futuna

Some of this is by design. For example, there’s no CoW code for Aruba (ABW) because Aruba does not exist in the CoW system. That’ll be the bulk of the warnings returned by countrycode() for a case like this and you can safely ignore those. Some of this is, well, a headache you’ll need to fix yourself. For example, Serbia (SRB) always throws countrycode() for a loop, but Serbia has always been 345 in the CoW system. You can fix that yourself with an addendum to the mutate() wrapper. Something like ccode = ifelse(Alpha_3 == "SRB", 345, ccode) will work.

ISO_3166_1 %>% as_tibble() %>%
  # Alpha_2 = iso2c, if you wanted it.
  # I want the three-character one.
  select(Alpha_3, Name) %>%
  mutate(ccode = countrycode(Alpha_3, "iso3c", "cown"),
         ccode = ifelse(Alpha_3 == "SRB", 345, ccode)) 
#> # A tibble: 249 x 3
#>    Alpha_3 Name                 ccode
#>    <chr>   <chr>                <dbl>
#>  1 ABW     Aruba                   NA
#>  2 AFG     Afghanistan            700
#>  3 AGO     Angola                 540
#>  4 AIA     Anguilla                NA
#>  5 ALA     Åland Islands           NA
#>  6 ALB     Albania                339
#>  7 AND     Andorra                232
#>  8 ARE     United Arab Emirates   696
#>  9 ARG     Argentina              160
#> 10 ARM     Armenia                371
#> # … with 239 more rows

I use this to underscore that {countrycode} is one of the most useful R packages merging and matching across different state/country classification systems. However, it is not magic and should not be used uncritically. Always inspect the output.

I’ll be using they/them pronouns here mostly for maximum anonymity. ↩