KibetKoech.com — Machine Learning Model To Predict Coups

At 3 am on Sunday, 1 August 1982, a group of Kenya Air Force soldiers led by Senior Private Hezekiah Ochuka attempted to overthrow President Daniel arap Moi’s government. The Kenyan Army foiled the attempt, and President Moi’s government remained in power, although about 100 soldiers and more than 200 civilians died that day. This event would significantly influence the future of Kenya.

Instability in a country involves political, social, or economic upheaval. It manifests as a coup or other illegal regime change, breakdown of institutions, widespread systemic corruption and state capture, organised crime, loss of territorial control, economic crisis, large-scale public unrest, and involuntary mass population displacement, including war, civil war, or other forms of conflict.

I am fascinated by how society behaves, evolves, and reacts to external and internal events, as well as machine learning. Therefore, I decided to develop a machine learning model that can predict coups based on historical and socio-economic data.

Factors that lead to economic instability

Full democracies are least likely to undergo significant political upheaval and instability. Countries undergoing political transitions face high risks of political instability and violent conflict. Luckily, the Economist Group publishes an annual democracy index that tracks the quality of democracy per country. The lower the value, the worse the democratic standards. Denmark, for example, had a near-perfect 9.28 (full democracy) while Syria had a dismal 1.32 (authoritarian) in 2024.
Democracy is often preceded by economic development, but this alone is not enough to sustain a democracy, especially in the presence of natural resources like oil, which lends massive financial resources to the regime. A study conducted in the period 1950-1990 showed that no democracy with a per capita income above US $6,000 (in 1985 prices) has had its democratic system collapse. However, economic growth neither leads necessarily to democratisation nor to improvements in other aspects of governance. For our investigation, this means that we will have to find data on GDP per capita to measure the economic development of countries.
As hinted at before, immense natural resource wealth like oil deposits can retard democratisation, by increasing the resources available to authoritarian regimes and simultaneously reducing the need for ruling elites to be responsive to the needs of their citizens. For this, we found a dataset of all oil-producing nations to represent natural wealth deposits.
The history of a coup is also a factor that will contribute to a future coup. Violence begets violence, and conflict breeds more conflict. This is our main focus, and we got data from Kaggle.
High inequality also fuels tensions and conflict. Here we will use the Gini Index as published here. This index measures the income distribution of a country, i.e. the gap between the rich and the poor. The lower the score, the more equal the country and vice versa.

Datasets

Historic Data(previous coups) : Kaggle
Economic/GDP Data: World Bank
Democracy Index : Economist Intelligence and Wikipedia
Oil Deposits in Countries : Wikipedia
Inequality Data (class & tribal): gini index
List of all the countries in the world: Wikipedia

Importing the Libraries and Loading the Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
# from numpy import isnan

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score

import seaborn as sns
sns.set_style("whitegrid", {'axes.grid':False})

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# load the goggle drive connector
from google.colab import drive
drive.mount('/content/drive')

# read the data
df = pd.read_csv(r'/content/drive/MyDrive/Coup_Data_v2.0.0.csv')
df.sample(10)

Coup data — Feature importance for the basic random forest model

Exploring the Data (EDA)

The column ‘realized’ when equal to 1, and ‘coup’ mean the same thing.

df.columns

Index([‘coup_id’, ‘cowcode’, ‘country’, ‘year’, ‘month’, ‘day’, ‘event_type’, ‘unrealized’, ‘realized’, ‘conspiracy’, ‘attempt’, ‘military’, ‘dissident’, ‘rebel’, ‘palace’, ‘foreign’, ‘auto’, ‘resign’, ‘popular’, ‘counter’, ‘other’, ‘noharm’, ‘injured’, ‘killed’, ‘harrest’, ‘jailed’, ‘tried’, ‘fled’, ‘exile’], dtype=’object’)

df.shape

(943, 29)

There are 943 events recorded in the dataset.

df.info()

<class RangeIndex: Data # Column --- ------ 0 coup_id 1 cowcode 2 country 3 year 4 month 5 day 6 7 8 realized 9 10 attempt 11 military 12 dissident 13 rebel 14 palace 15 foreign 16 auto 17 resign 18 popular 19 counter 20 other 21 noharm 22 injured 23 killed 24 harrest 25 jailed 26 tried 27 fled 28 exile dtypes: memory 'pandas.core.frame.DataFrame'> 943 entries, 0 to 942 columns (total 29 columns): Non-Null Count Dtype -------------- ----- 943 non-null object 943 non-null int64 943 non-null object 943 non-null int64 943 non-null int64 943 non-null int64 event_type 943 non-null object unrealized 943 non-null int64 943 non-null int64 conspiracy 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 943 non-null int64 int64(26), object(3) usage: 213.8+ KB

This dataset had data on instability events, including coups, conspiracies, etc. We shall focus only on coups as they are an obvious measurable event and they are the worst form of regime change.

pd.pivot_table(df.sample(10), values = 'realized', index=['country'])

The column ‘realized’ when equal to 1, and ‘coup’ means the same thing. We shall now slice the dataframe to include only coup events, then we shall count the coups per country.

final_coups = coups.iloc[2:,2:6]
final_coups.head(10)

We don’t need the dates.

final_coups.drop(['year','month','day'], axis='columns', inplace=True)
# then we can group by 'country'
final_coups.groupby(['country'])['country'].count()

country
Afghanistan	8
Algeria	7
Angola	1
Argentina	9
Azerbaijan	2
…	…
Venezuela	4
Yemen	1
Yemen Arab Republic	3
Yemen PDR	4
Zimbabwe	1

To clean this up, we need to drop the added index, then reset it and insert proper column names.

# remove index name
country_coups.index.name = None 
# reset the index
country_coups = country_coups.reset_index() 
# insert the column names
country_coups.columns = ["country", "Number_of_coups"] 

index	country	Number_of_coups
69	Paraguay	6
44	Honduras	7
8	Brazil	5
90	Thailand	11
73	Portugal	3
31	Ethiopia	7
51	Laos	7
4	Azerbaijan	2
3	Argentina	9
87	Swaziland	1
92	Tunisia	3
88	Syria	13
15	Chile	1
20	Costa Rica	2
80	Sao Tome and Principe	2
21	Cote d’Ivoire	2
64	Niger	4
102	Zimbabwe	1
94	USSR	2
63	Nicaragua	2

This looks much better. Let’s make sure we don’t have any missing values.

country_coups.isna()

index	country	Number_of_coups
0	false	false
1	false	false
2	false	false
3	false	false
4	false	false
5	false	false
100	false	false
101	false	false
102	false	false

No missing values.

#Checking for duplicate values
country_coups.duplicated().sum()

np.int64(0)

Good, no missing values or duplicates.

Next, we get a list of oil-producing countries. I have an R script that is short and sweet for scraping Wikipedia data.

# load the rvest library
library(rvest)
oilurl = read_html('https://en.wikipedia.org/
wiki/List_of_countries_by_proven_oil_reserves')

# extract a list of tables in this page
oil_tables = html_table(oilurl)
oil_tables

# select the right table
oil_countries = data.frame(oil_tables[2])
oil_countries
getwd()

# save the table locally as a csv file
setwd('C:/Users/me/coup')
write.csv(oil_countries, 'oil_countries.csv', row.names = TRUE)
dim(oil_countries)

This yields a table that looks like this:

Oil production data — Oil Production Data

We do not need the production capacities, just a list of oil-producing nations. We will drop all the extra columns and create a column ‘oil_producing’ which we will fill with 1 for an oil-producing nation and 0 for the others.

oil_countries.drop(['Oil_production', 
'Years_of_production_in_reserve'], axis = 1, inplace=True)
oil_countries['oil_producing'] = 1
oil_countries.to_csv('/content/drive/MyDrive/oil_countries.csv')
oil_countries.sample(5)

index	country	oil_producing
34	Australia	1
20	Azerbaijan	1
90	Belize	1
24	India	1
75	Japan	1

On to the Gini Index. To get the data, I did the same thing as the oil-producing countries’ data and loaded it into my notebook. This data also came with population data, which I will leave in there.

# Data on inequality as measured by the gini index 
gini_index = pd.read_csv(r'/content/drive/MyDrive/gini.csv')
gini_index.head()

index	population	country	gini_index
11	52085168	Colombia	51.3
71	1425671352	China	38.5
75	334506	Vanuatu	37.6
94	654768	Luxembourg	35.4
117	3210847	Bosnia and Herzegovina	33.0

Next was the democratic index data and I got it the same way as the others. Here is a sample.

index	country	democracy_index
0	Afghanistan	0.32
1	Albania	6.41
2	Algeria	3.66
3	Angola	3.96
4	Argentina	6.85

Finally, I got a list of all countries in the world to compare against. Looking at all these disparate datasets:

Democracy index data has 167 countries.
All countries dataset has 192 countries.
Gini index dataset has 177 countries.
Oil oil-producing nations dataset has 105 countries.
How do we reconcile all this data?
We will use the democracy data as our base; countries not in there will be dropped.
Oil data will be used in totality, as this is just the list of countries which produce oil; all others do not.
We can join all these data frames by using the country columns, then drop all the null values.

I combined these datasets using pandas merge and an outer join on the countries column as exemplified below:

combo = pd.merge(demo_index, all_countries, 
on='country', how='outer')
combo.sample(10)

index	country	democracy_index
13	Barbados	NaN
75	Guinea	2.32
156	San Marino	NaN
40	Costa Rica	8.29
158	Saudi Arabia	2.08
52	Dominican Republic	6.39
146	Portugal	7.95
62	Fiji	5.55
141	Papua New Guinea	5.97
154	Saint Lucia	NaN

After combining all of these datasets, this is what we have:

The NaN values in the ‘oil_producing’ and ‘Number_of_coups’ columns were filled with 0 to represent no oil or no coup. All other rows with NaN values were dropped.

Modeling

Random Forests

# select features and target from the dataset
# features
X = final.loc[:,['democracy_index', 'oil_producing',
'population','gini_index','Number_of_coups', 'gdp_per_capita']]
# target
y = final['coup']

# create the train and test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
 test_size=0.3)

#import random forest model
from sklearn.ensemble import RandomForestClassifier

# create a Gaussian clasiffier
clf = RandomForestClassifier(n_estimators=100)

# Train the model using the training data
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)

# model accuracy, how often is the classifier correct?
print('Accuracy: ', metrics.accuracy_score(y_test, 
y_predicted))

Accuracy: 1.0

An accuracy of 1 means our model is perfect, but it could just be overfitting. So let’s try using cross validation.

# Using StratifiedKFold
from sklearn.model_selection import StratifiedKFold, 
cross_val_score

# create train/test split - stratified
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# set up StratifiedKFold CV for training set
skf = StratifiedKFold(n_splits=5, shuffle=True, 
random_state=42)

# create model and run data
rf_model = RandomForestClassifier(random_state=42)
cv_scores = cross_val_score(rf_model, X_train, 
y_train, cv=skf, scoring='accuracy')

print('CV accuracy per fold: ', cv_scores)
print('Mean CV accuracy: ', cv_scores.mean())

# fit model on full training set
rf_model.fit(X_train, y_train)

# Evaluate on holdout test set
test_accuracy = rf_model.score(X_test, y_test)
print('Holdout test accuracy:', test_accuracy)

CV accuracy per fold: [0.96 1. 1. 1. 1. ]
Mean CV accuracy: 0.992
Holdout test accuracy: 1.0

We get the same high accuracy even with cross-validation. Since the number of countries in the world is limited, we can increase the complexity if the data by increasing the columns to capture other dimesions of countries like the unemployment rate, the youth population, the average income, the economic comlplexity, the hapiness indes, the UN human development index, trade data, geographical data, poverty data etc.

Feature Importance

# feature importance on the basic random forest model
feature_imp = pd.Series(clf.feature_importances_, 
index=X.columns).sort_values(ascending=False)
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()

From the bar chart above, the most important predictor for a coup is a history of previous coups. Remember at the beginning of this post when I said that the foiled Kenya Air Force coup of August 1982 would be important? This is it. The fact that the coup was unsuccessful meant that the chance of illegal and probably violent regime change was greatly reduced.
The GDP per capita is the next most important factor, and it makes sense; the poorer a country is, the more likely that they are dissatisfied enough with the current regime to attempt a violent takeover.* Next is the democracy levels, which look obvious but are of lower importance. You would think that it would be a much more important factor, but just look at communist China, 2.11 on the democracy index, but never had a coup, and they are definitely getting richer (GPD per capita of $18820.51 in 2023).
The wealth gap, as measured by the Gini index, is also surprisingly not that big a factor.
The presence of natural resources is the lowest predictor, which seems to go against the literature; it seems you can’t always trust the machines.

Let us see what the cross-validated model has to say about feature importance.

Feature importance for the cross validated random forest model

Here we see a similar order, the top three most important factors are a history of coups, the country’s economic conditions and the democracy level. The presence of natural resources also features last here.

To predict whether a country may experience a coup, we need to pass its democracy index, whether it is oil-producing (1) or not (0), population, Gini index, number of coups, and GDP per capita in that order, to this prediction function.

# example prediction
clf.predict([[0.1, 1, 4569875, 64.5, 1, 956.32]])

array([1])

This predicts a coup in this fictitious country. Not surprising considering its abysmal democracy index, history of a coup and the low GDP per capita.

I will try this data on a linear regressor and a neural net to see how it plays out.