SlideShare a Scribd company logo
1 of 14
Download to read offline
Comparison of methods used to predict
Mass-killing
Surui Sun
Chufeng Hu
Yidan Sun
Department of Statistics
UCLA
March 8, 2016
Outline
Mass killing has long been a focus for worldwide concern
Source of data and reference:
http://www.earlywarningproject.com/data
Purpose: to predict the probability of occurence
Approaches to modeling:
Logistic, Lasso, Ridge
Missing data, Random Forest, Random Uniform Forest
Roc curve, Heatmap
1 / 11
Data Dictionary
Introduction:
to generate statistical assessments of the risk of onsets of
state-led mass killing epsiodes in countries worldwide
output variable:
mkl.start.1: yi ∈ {−1, 1}
input variable:
reg.afr: reg.amr different region
mkl.ongoing: any ongoing episodes of state-led mass kill
mkl.ever: any state-led mass killing since WWII
countryage.ln: log of country age
wdi.popsize.ln: log of population size
. . .
2 / 11
Data Dictionary
Introduction:
to generate statistical assessments of the risk of onsets of
state-led mass killing epsiodes in countries worldwide
output variable:
mkl.start.1: yi ∈ {−1, 1}
input variable:
reg.afr: reg.amr different region
mkl.ongoing: any ongoing episodes of state-led mass kill
mkl.ever: any state-led mass killing since WWII
countryage.ln: log of country age
wdi.popsize.ln: log of population size
. . .
Variable: 32 Variables 9163 Observations
2 / 11
Lasso vs. Ridge
Lambda produced by cv.glmnet in glmnet package: K-fold
CV
Lasso regularization doesn’t work very well:
coefficients are very small. If penalized, all of them will
shrinkage to 0
for N >> P, under highly correlated variables, ridge does
better than lasso
Logistic: Fitting Logistic Regression Models and predicting
Logistic probability
3 / 11
Missing Data
the Imputation approach:
impute the missing value with the mean or median of the
nonmissing values for that feature.
Use R package imputeMissings
4 / 11
Random Forest and Random Uniform Forest
Random Forest:
R package randomForest
mtry=3 ntree=1000
Random Uniform Forest:
R package randomUniformForest
metry=3, ntree=1000
5 / 11
AdaBoost
function:
(βm, Gm) = argminβ,G
N
i=1
exp[−yi(fm−1(xi) + βG(xi))]
the individual classifiers Gm(x) ∈ {−1, 1}
R package: adabag
6 / 11
Roc
Why Roc?
Since rare event with lots of 0 by random forest will result
predicted error –>1. Thus, we use Roc curve to get the
influence of different threshold to classification.
7 / 11
Roc curve
use ROCR package
8 / 11
Roc curve
use ROCR package
We find that RUF(Random Uniform Forest) is the best method
Truepositiverate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
RF = 0.813
RUF = 0.821
Ridge = 0.813
Logistic = 0.790
Adaboosting = 0.738
8 / 11
Heatmap
Why Heatmap?
It is difficult to predict 1, thus we use probability instead of
classification to show the relative prob in different countries.
And we can plot top 30 countries with largest prob.
9 / 11
Headmap
use rworldmap package
10 / 11
dotplot
11 / 11

More Related Content

What's hot

Community ecology lab - species diversity
Community ecology lab - species diversityCommunity ecology lab - species diversity
Community ecology lab - species diversityjamisona
 
Change Point Analysis
Change Point AnalysisChange Point Analysis
Change Point AnalysisMark Conway
 
(Big) data for env. monitoring, public health and verifiable risk assessment-...
(Big) data for env. monitoring, public health and verifiable risk assessment-...(Big) data for env. monitoring, public health and verifiable risk assessment-...
(Big) data for env. monitoring, public health and verifiable risk assessment-...BigData_Europe
 
One Weird Trick to Increase A/B Test Power
One Weird Trick to Increase A/B Test PowerOne Weird Trick to Increase A/B Test Power
One Weird Trick to Increase A/B Test PowerJohn Clevenger
 
The physics background of the BDE SC5 pilot cases
The physics background of the BDE SC5 pilot casesThe physics background of the BDE SC5 pilot cases
The physics background of the BDE SC5 pilot casesBigData_Europe
 
Luo-IGARSS2011-2385.ppt
Luo-IGARSS2011-2385.pptLuo-IGARSS2011-2385.ppt
Luo-IGARSS2011-2385.pptgrssieee
 
Spatial presentation of prognosis models in plant protection
Spatial presentation of prognosis models in plant protectionSpatial presentation of prognosis models in plant protection
Spatial presentation of prognosis models in plant protectionCAPIGI
 
NDGeospatialSummit2019 - Earthquake Vulnerability in the Region of Arequipa, ...
NDGeospatialSummit2019 - Earthquake Vulnerability in the Region of Arequipa, ...NDGeospatialSummit2019 - Earthquake Vulnerability in the Region of Arequipa, ...
NDGeospatialSummit2019 - Earthquake Vulnerability in the Region of Arequipa, ...North Dakota GIS Hub
 
Evaluating aboveground terrestrial carbon flux as ecosystem planning
Evaluating aboveground terrestrial carbon flux as ecosystem planningEvaluating aboveground terrestrial carbon flux as ecosystem planning
Evaluating aboveground terrestrial carbon flux as ecosystem planningWorld Agroforestry (ICRAF)
 
U.S.NationalAgriculturalLandCoverMonitoring_Mueller.pptx
U.S.NationalAgriculturalLandCoverMonitoring_Mueller.pptxU.S.NationalAgriculturalLandCoverMonitoring_Mueller.pptx
U.S.NationalAgriculturalLandCoverMonitoring_Mueller.pptxgrssieee
 
Change detection of forest fire in los angeles
Change detection of forest fire in los angelesChange detection of forest fire in los angeles
Change detection of forest fire in los angelesParthipan S
 
final project poster final
final project poster finalfinal project poster final
final project poster finalZachary Woodcock
 
Geospatial Techniques for Measuring SI Assessment Indicators
Geospatial Techniques for Measuring SI Assessment IndicatorsGeospatial Techniques for Measuring SI Assessment Indicators
Geospatial Techniques for Measuring SI Assessment Indicatorsafrica-rising
 
ADOpresentation.lv20150714
ADOpresentation.lv20150714ADOpresentation.lv20150714
ADOpresentation.lv20150714Luana Valentini
 
Pangeo climpred presentation
Pangeo climpred presentationPangeo climpred presentation
Pangeo climpred presentationRiley X. Brady
 

What's hot (20)

Community ecology lab - species diversity
Community ecology lab - species diversityCommunity ecology lab - species diversity
Community ecology lab - species diversity
 
Change Point Analysis
Change Point AnalysisChange Point Analysis
Change Point Analysis
 
(Big) data for env. monitoring, public health and verifiable risk assessment-...
(Big) data for env. monitoring, public health and verifiable risk assessment-...(Big) data for env. monitoring, public health and verifiable risk assessment-...
(Big) data for env. monitoring, public health and verifiable risk assessment-...
 
One Weird Trick to Increase A/B Test Power
One Weird Trick to Increase A/B Test PowerOne Weird Trick to Increase A/B Test Power
One Weird Trick to Increase A/B Test Power
 
MUMS: Coupling Uncertain Geophysical Hazards Workshop: Coastal Flooding Uncer...
MUMS: Coupling Uncertain Geophysical Hazards Workshop: Coastal Flooding Uncer...MUMS: Coupling Uncertain Geophysical Hazards Workshop: Coastal Flooding Uncer...
MUMS: Coupling Uncertain Geophysical Hazards Workshop: Coastal Flooding Uncer...
 
The physics background of the BDE SC5 pilot cases
The physics background of the BDE SC5 pilot casesThe physics background of the BDE SC5 pilot cases
The physics background of the BDE SC5 pilot cases
 
Luo-IGARSS2011-2385.ppt
Luo-IGARSS2011-2385.pptLuo-IGARSS2011-2385.ppt
Luo-IGARSS2011-2385.ppt
 
Undergraduate Modeling Workshop - Southeastern US Rainfall Working Group Fina...
Undergraduate Modeling Workshop - Southeastern US Rainfall Working Group Fina...Undergraduate Modeling Workshop - Southeastern US Rainfall Working Group Fina...
Undergraduate Modeling Workshop - Southeastern US Rainfall Working Group Fina...
 
Spatial presentation of prognosis models in plant protection
Spatial presentation of prognosis models in plant protectionSpatial presentation of prognosis models in plant protection
Spatial presentation of prognosis models in plant protection
 
Modeling the water food-energy nexus in the Arab world: The NASA land informa...
Modeling the water food-energy nexus in the Arab world: The NASA land informa...Modeling the water food-energy nexus in the Arab world: The NASA land informa...
Modeling the water food-energy nexus in the Arab world: The NASA land informa...
 
NDGeospatialSummit2019 - Earthquake Vulnerability in the Region of Arequipa, ...
NDGeospatialSummit2019 - Earthquake Vulnerability in the Region of Arequipa, ...NDGeospatialSummit2019 - Earthquake Vulnerability in the Region of Arequipa, ...
NDGeospatialSummit2019 - Earthquake Vulnerability in the Region of Arequipa, ...
 
Predictive modelling
Predictive modellingPredictive modelling
Predictive modelling
 
Evaluating aboveground terrestrial carbon flux as ecosystem planning
Evaluating aboveground terrestrial carbon flux as ecosystem planningEvaluating aboveground terrestrial carbon flux as ecosystem planning
Evaluating aboveground terrestrial carbon flux as ecosystem planning
 
Crop Scan - Daniel Linklater
Crop Scan - Daniel LinklaterCrop Scan - Daniel Linklater
Crop Scan - Daniel Linklater
 
U.S.NationalAgriculturalLandCoverMonitoring_Mueller.pptx
U.S.NationalAgriculturalLandCoverMonitoring_Mueller.pptxU.S.NationalAgriculturalLandCoverMonitoring_Mueller.pptx
U.S.NationalAgriculturalLandCoverMonitoring_Mueller.pptx
 
Change detection of forest fire in los angeles
Change detection of forest fire in los angelesChange detection of forest fire in los angeles
Change detection of forest fire in los angeles
 
final project poster final
final project poster finalfinal project poster final
final project poster final
 
Geospatial Techniques for Measuring SI Assessment Indicators
Geospatial Techniques for Measuring SI Assessment IndicatorsGeospatial Techniques for Measuring SI Assessment Indicators
Geospatial Techniques for Measuring SI Assessment Indicators
 
ADOpresentation.lv20150714
ADOpresentation.lv20150714ADOpresentation.lv20150714
ADOpresentation.lv20150714
 
Pangeo climpred presentation
Pangeo climpred presentationPangeo climpred presentation
Pangeo climpred presentation
 

Similar to Comparison of methods used to predict mass killings

d-VMP: Distributed Variational Message Passing (PGM2016)
d-VMP: Distributed Variational Message Passing (PGM2016)d-VMP: Distributed Variational Message Passing (PGM2016)
d-VMP: Distributed Variational Message Passing (PGM2016)AMIDST Toolbox
 
Gamma and inverse Gaussian frailty models: A comparative study
Gamma and inverse Gaussian frailty models: A comparative studyGamma and inverse Gaussian frailty models: A comparative study
Gamma and inverse Gaussian frailty models: A comparative studyinventionjournals
 
Data exploration and graphics with R
Data exploration and graphics with RData exploration and graphics with R
Data exploration and graphics with RAlberto Labarga
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence Muhammad Ahad
 
Mitigating Interference to GPS Operation Using Variable Forgetting Factor Bas...
Mitigating Interference to GPS Operation Using Variable Forgetting Factor Bas...Mitigating Interference to GPS Operation Using Variable Forgetting Factor Bas...
Mitigating Interference to GPS Operation Using Variable Forgetting Factor Bas...IJCNCJournal
 
Egg deposition for Bloaters, aged 3, in Lake Huron and predicting next five y...
Egg deposition for Bloaters, aged 3, in Lake Huron and predicting next five y...Egg deposition for Bloaters, aged 3, in Lake Huron and predicting next five y...
Egg deposition for Bloaters, aged 3, in Lake Huron and predicting next five y...Arpan Kumar
 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation MaximizationAndres Mendez-Vazquez
 
Modeling the Chlorophyll-a from Sea Surface Reflectance in West Africa by Dee...
Modeling the Chlorophyll-a from Sea Surface Reflectance in West Africa by Dee...Modeling the Chlorophyll-a from Sea Surface Reflectance in West Africa by Dee...
Modeling the Chlorophyll-a from Sea Surface Reflectance in West Africa by Dee...gerogepatton
 
An Introduction To Basic Statistics And Probability
An Introduction To Basic Statistics And ProbabilityAn Introduction To Basic Statistics And Probability
An Introduction To Basic Statistics And ProbabilityMaria Perkins
 
ABC short course: final chapters
ABC short course: final chaptersABC short course: final chapters
ABC short course: final chaptersChristian Robert
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfPo-Chuan Chen
 
Robust Portfolio Optimization with Multivariate Copulas: A Worst-Case CVaR Ap...
Robust Portfolio Optimization with Multivariate Copulas: A Worst-Case CVaR Ap...Robust Portfolio Optimization with Multivariate Copulas: A Worst-Case CVaR Ap...
Robust Portfolio Optimization with Multivariate Copulas: A Worst-Case CVaR Ap...Fernando A. B. Sabino da Silva
 
Scalable MAP inference in Bayesian networks based on a Map-Reduce approach (P...
Scalable MAP inference in Bayesian networks based on a Map-Reduce approach (P...Scalable MAP inference in Bayesian networks based on a Map-Reduce approach (P...
Scalable MAP inference in Bayesian networks based on a Map-Reduce approach (P...AMIDST Toolbox
 

Similar to Comparison of methods used to predict mass killings (20)

CLIM: Transition Workshop - Investigating Precipitation Extremes in the US Gu...
CLIM: Transition Workshop - Investigating Precipitation Extremes in the US Gu...CLIM: Transition Workshop - Investigating Precipitation Extremes in the US Gu...
CLIM: Transition Workshop - Investigating Precipitation Extremes in the US Gu...
 
d-VMP: Distributed Variational Message Passing (PGM2016)
d-VMP: Distributed Variational Message Passing (PGM2016)d-VMP: Distributed Variational Message Passing (PGM2016)
d-VMP: Distributed Variational Message Passing (PGM2016)
 
Presentation
PresentationPresentation
Presentation
 
Gamma and inverse Gaussian frailty models: A comparative study
Gamma and inverse Gaussian frailty models: A comparative studyGamma and inverse Gaussian frailty models: A comparative study
Gamma and inverse Gaussian frailty models: A comparative study
 
Data exploration and graphics with R
Data exploration and graphics with RData exploration and graphics with R
Data exploration and graphics with R
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence
 
Hmm and neural networks
Hmm and neural networksHmm and neural networks
Hmm and neural networks
 
Mitigating Interference to GPS Operation Using Variable Forgetting Factor Bas...
Mitigating Interference to GPS Operation Using Variable Forgetting Factor Bas...Mitigating Interference to GPS Operation Using Variable Forgetting Factor Bas...
Mitigating Interference to GPS Operation Using Variable Forgetting Factor Bas...
 
Egg deposition for Bloaters, aged 3, in Lake Huron and predicting next five y...
Egg deposition for Bloaters, aged 3, in Lake Huron and predicting next five y...Egg deposition for Bloaters, aged 3, in Lake Huron and predicting next five y...
Egg deposition for Bloaters, aged 3, in Lake Huron and predicting next five y...
 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
 
Modeling the Chlorophyll-a from Sea Surface Reflectance in West Africa by Dee...
Modeling the Chlorophyll-a from Sea Surface Reflectance in West Africa by Dee...Modeling the Chlorophyll-a from Sea Surface Reflectance in West Africa by Dee...
Modeling the Chlorophyll-a from Sea Surface Reflectance in West Africa by Dee...
 
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
 
An Introduction To Basic Statistics And Probability
An Introduction To Basic Statistics And ProbabilityAn Introduction To Basic Statistics And Probability
An Introduction To Basic Statistics And Probability
 
ABC short course: final chapters
ABC short course: final chaptersABC short course: final chapters
ABC short course: final chapters
 
ResearchPaper
ResearchPaperResearchPaper
ResearchPaper
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
 
Robust Portfolio Optimization with Multivariate Copulas: A Worst-Case CVaR Ap...
Robust Portfolio Optimization with Multivariate Copulas: A Worst-Case CVaR Ap...Robust Portfolio Optimization with Multivariate Copulas: A Worst-Case CVaR Ap...
Robust Portfolio Optimization with Multivariate Copulas: A Worst-Case CVaR Ap...
 
Ica group 3[1]
Ica group 3[1]Ica group 3[1]
Ica group 3[1]
 
Scalable MAP inference in Bayesian networks based on a Map-Reduce approach (P...
Scalable MAP inference in Bayesian networks based on a Map-Reduce approach (P...Scalable MAP inference in Bayesian networks based on a Map-Reduce approach (P...
Scalable MAP inference in Bayesian networks based on a Map-Reduce approach (P...
 
2009 asilomar
2009 asilomar2009 asilomar
2009 asilomar
 

Comparison of methods used to predict mass killings

  • 1. Comparison of methods used to predict Mass-killing Surui Sun Chufeng Hu Yidan Sun Department of Statistics UCLA March 8, 2016
  • 2. Outline Mass killing has long been a focus for worldwide concern Source of data and reference: http://www.earlywarningproject.com/data Purpose: to predict the probability of occurence Approaches to modeling: Logistic, Lasso, Ridge Missing data, Random Forest, Random Uniform Forest Roc curve, Heatmap 1 / 11
  • 3. Data Dictionary Introduction: to generate statistical assessments of the risk of onsets of state-led mass killing epsiodes in countries worldwide output variable: mkl.start.1: yi ∈ {−1, 1} input variable: reg.afr: reg.amr different region mkl.ongoing: any ongoing episodes of state-led mass kill mkl.ever: any state-led mass killing since WWII countryage.ln: log of country age wdi.popsize.ln: log of population size . . . 2 / 11
  • 4. Data Dictionary Introduction: to generate statistical assessments of the risk of onsets of state-led mass killing epsiodes in countries worldwide output variable: mkl.start.1: yi ∈ {−1, 1} input variable: reg.afr: reg.amr different region mkl.ongoing: any ongoing episodes of state-led mass kill mkl.ever: any state-led mass killing since WWII countryage.ln: log of country age wdi.popsize.ln: log of population size . . . Variable: 32 Variables 9163 Observations 2 / 11
  • 5. Lasso vs. Ridge Lambda produced by cv.glmnet in glmnet package: K-fold CV Lasso regularization doesn’t work very well: coefficients are very small. If penalized, all of them will shrinkage to 0 for N >> P, under highly correlated variables, ridge does better than lasso Logistic: Fitting Logistic Regression Models and predicting Logistic probability 3 / 11
  • 6. Missing Data the Imputation approach: impute the missing value with the mean or median of the nonmissing values for that feature. Use R package imputeMissings 4 / 11
  • 7. Random Forest and Random Uniform Forest Random Forest: R package randomForest mtry=3 ntree=1000 Random Uniform Forest: R package randomUniformForest metry=3, ntree=1000 5 / 11
  • 8. AdaBoost function: (βm, Gm) = argminβ,G N i=1 exp[−yi(fm−1(xi) + βG(xi))] the individual classifiers Gm(x) ∈ {−1, 1} R package: adabag 6 / 11
  • 9. Roc Why Roc? Since rare event with lots of 0 by random forest will result predicted error –>1. Thus, we use Roc curve to get the influence of different threshold to classification. 7 / 11
  • 10. Roc curve use ROCR package 8 / 11
  • 11. Roc curve use ROCR package We find that RUF(Random Uniform Forest) is the best method Truepositiverate 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 RF = 0.813 RUF = 0.821 Ridge = 0.813 Logistic = 0.790 Adaboosting = 0.738 8 / 11
  • 12. Heatmap Why Heatmap? It is difficult to predict 1, thus we use probability instead of classification to show the relative prob in different countries. And we can plot top 30 countries with largest prob. 9 / 11