Comparison of methods used to predict mass killings
1. Comparison of methods used to predict
Mass-killing
Surui Sun
Chufeng Hu
Yidan Sun
Department of Statistics
UCLA
March 8, 2016
2. Outline
Mass killing has long been a focus for worldwide concern
Source of data and reference:
http://www.earlywarningproject.com/data
Purpose: to predict the probability of occurence
Approaches to modeling:
Logistic, Lasso, Ridge
Missing data, Random Forest, Random Uniform Forest
Roc curve, Heatmap
1 / 11
3. Data Dictionary
Introduction:
to generate statistical assessments of the risk of onsets of
state-led mass killing epsiodes in countries worldwide
output variable:
mkl.start.1: yi ∈ {−1, 1}
input variable:
reg.afr: reg.amr different region
mkl.ongoing: any ongoing episodes of state-led mass kill
mkl.ever: any state-led mass killing since WWII
countryage.ln: log of country age
wdi.popsize.ln: log of population size
. . .
2 / 11
4. Data Dictionary
Introduction:
to generate statistical assessments of the risk of onsets of
state-led mass killing epsiodes in countries worldwide
output variable:
mkl.start.1: yi ∈ {−1, 1}
input variable:
reg.afr: reg.amr different region
mkl.ongoing: any ongoing episodes of state-led mass kill
mkl.ever: any state-led mass killing since WWII
countryage.ln: log of country age
wdi.popsize.ln: log of population size
. . .
Variable: 32 Variables 9163 Observations
2 / 11
5. Lasso vs. Ridge
Lambda produced by cv.glmnet in glmnet package: K-fold
CV
Lasso regularization doesn’t work very well:
coefficients are very small. If penalized, all of them will
shrinkage to 0
for N >> P, under highly correlated variables, ridge does
better than lasso
Logistic: Fitting Logistic Regression Models and predicting
Logistic probability
3 / 11
6. Missing Data
the Imputation approach:
impute the missing value with the mean or median of the
nonmissing values for that feature.
Use R package imputeMissings
4 / 11
7. Random Forest and Random Uniform Forest
Random Forest:
R package randomForest
mtry=3 ntree=1000
Random Uniform Forest:
R package randomUniformForest
metry=3, ntree=1000
5 / 11
8. AdaBoost
function:
(βm, Gm) = argminβ,G
N
i=1
exp[−yi(fm−1(xi) + βG(xi))]
the individual classifiers Gm(x) ∈ {−1, 1}
R package: adabag
6 / 11
9. Roc
Why Roc?
Since rare event with lots of 0 by random forest will result
predicted error –>1. Thus, we use Roc curve to get the
influence of different threshold to classification.
7 / 11
11. Roc curve
use ROCR package
We find that RUF(Random Uniform Forest) is the best method
Truepositiverate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
RF = 0.813
RUF = 0.821
Ridge = 0.813
Logistic = 0.790
Adaboosting = 0.738
8 / 11
12. Heatmap
Why Heatmap?
It is difficult to predict 1, thus we use probability instead of
classification to show the relative prob in different countries.
And we can plot top 30 countries with largest prob.
9 / 11