SlideShare a Scribd company logo
1 of 12
UN Millennium
Development Goals
FORECASTING MULTIPLE TIME SERIES USING
AUTOREGRESSIVE MODELS
HEATHER E. ADAMS
DS-DC-14
3 OCTOBER 2016
The Question
Forecast change in indicators of U.N.’s eight goals,
e.g. ensuring environmental sustainability, reducing
child mortality
Hypotheses:
◦ Worldwide macroeconomic indicators can be predicted
from other collected data
◦ Time series may be forecasted from past trends
Heather E. Adams
https://www.drivendata.org/competitions/1/
The Data
Competition on DrivenData, utilizes World Bank
Open Data
◦ Zipped CSV file
◦ Training data are World Bank macroeconomic indicators
(1972-2007) for 214 countries
◦ Missing data
◦ Predict 2008 and 2012 for 737 parameter series
◦ 1 to 6 parameters per country
◦ Out of 195,402 parameters
◦ E.g. “Generosity of All Social Protection,” “Net official flows from
UN agencies,” and “Presence of peace keepers.”
Heather E. Adams
http://data.worldbank.org/
Exploratory Analysis
Heather E. Adams
Exploratory Analysis
All measures contain at least some missing values
◦ .dropna() removes full dataset
Sierra Leone
◦ Environmental sustainability
◦ Lagged correlation matrix of data from single country
◦ Most correlated with a change in “Gross National
Expenditure” from the previous year.
However, goal is to predict 737 parameters
◦ Exploratory graphs indicate most may be simple trends
Heather E. Adams
The Model
Time series AR Model
For a single series
◦ Select years with data
◦ AR model
◦ Test/train split
◦ Forecast 2008, 2012
Predicting test split of 2000-2007 for ‘753’ yielded
mean absolute error of 0.015 when trained on
1972-1999.
Heather E. Adams
Scaling Up
Applying to entire dataframe
◦ Prediction series only (737)
AR model
◦ On data through 2007 to forecast 2008 and 2012
Discrete periods of data collection
◦ 1972-1989 (506 to 565 missing values)
◦ 1990-1995 (77 to 233 missing values)
◦ 1996-2007 (up to 80 missing values)
Replace missing data with series mean
Heather E. Adams
Model Evaluation
Prediction of 2008 and 2012 for 737 series
Submissions to DrivenData scored by root mean
square error- all missing data replaced with series
mean.
ConvergenceWarning: Maximum Likelihood optimization failed to converge.
Heather E. Adams
Training date
range
Most missing
data per year
RMSE Convergence
warnings
1972-2007 567 0.0930 33
1990-2007 233 0.0888 4
1996-2007 80 0.0806 14
Strengths and Weaknesses
Best root mean squared error: 0.0806
◦ Rank 115 out of 909
Benchmarks
◦ for status quo: 0.0734
◦ for simple linear regression (uses data from 2006 and
2007 only) : 0.0678
◦ Best submission so far: 0.0457
Problems:
◦ Overlooks interactions with other parameters
◦ Convergence issue
Heather E. Adams
Conclusions
Developed model that can be applied across the
entire dataset
Largest impact
◦ How missing data are handled
◦ Confirms that the majority of the data can be subset to
recent data
◦ Simple models applicable
Interaction with other parameters to be
determined
Heather E. Adams
Next Steps
4 months left in the competition, for glory
Moving forward
◦ Optimize year range of training dataset
◦ Missing value replacement method
◦ Address convergence issue
Alternate model
◦ Random forest with interactions from the larger data set
Heather E. Adams
Acknowledgements
Alex Egorenkov
John Tate
DrivenData
General Assembly
Stack Overflow
Heather E. Adams

More Related Content

Similar to Final Project Part 5 HEAdams

dss-110806075247-phpapp02.pptx
dss-110806075247-phpapp02.pptxdss-110806075247-phpapp02.pptx
dss-110806075247-phpapp02.pptxAshutoshDas233
 
Data mining & predictive analytics for US Airlines' performance
Data mining & predictive analytics for US Airlines' performanceData mining & predictive analytics for US Airlines' performance
Data mining & predictive analytics for US Airlines' performanceAkiso Yadav
 
Finding & accessing data
Finding & accessing dataFinding & accessing data
Finding & accessing dataISSDA
 
The Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemsThe Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemstaimur hafeez
 
Data quality presentation oct 2006 23092006
Data quality presentation oct 2006 23092006Data quality presentation oct 2006 23092006
Data quality presentation oct 2006 23092006Anastasia Govan Kuusk
 
Carolyn Engstrom - IT Data Analytics: Why the Cobbler's Children Have No Shoes
Carolyn Engstrom - IT Data Analytics: Why the Cobbler's Children Have No ShoesCarolyn Engstrom - IT Data Analytics: Why the Cobbler's Children Have No Shoes
Carolyn Engstrom - IT Data Analytics: Why the Cobbler's Children Have No Shoescentralohioissa
 
Optim test data management for IMS 2011
Optim test data management for IMS 2011Optim test data management for IMS 2011
Optim test data management for IMS 2011evgeni77
 
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...Edureka!
 
Systems Analysis of the Data and Models Used for US Federal Emergency Managem...
Systems Analysis of the Data and Models Used for US Federal Emergency Managem...Systems Analysis of the Data and Models Used for US Federal Emergency Managem...
Systems Analysis of the Data and Models Used for US Federal Emergency Managem...Global Risk Forum GRFDavos
 
When Data Become News. A Content Analysis of Data Journalism Pieces.
When Data Become News. A Content Analysis of Data Journalism Pieces.When Data Become News. A Content Analysis of Data Journalism Pieces.
When Data Become News. A Content Analysis of Data Journalism Pieces.Julius Reimer
 
When Data Become News. A Content Analysis of Data Journalism Pieces.
When Data Become News. A Content Analysis of Data Journalism Pieces.When Data Become News. A Content Analysis of Data Journalism Pieces.
When Data Become News. A Content Analysis of Data Journalism Pieces.jpub 2.0
 
Computer Assisted Data Analysis (Hands-on Practice)
Computer Assisted Data Analysis (Hands-on Practice)Computer Assisted Data Analysis (Hands-on Practice)
Computer Assisted Data Analysis (Hands-on Practice)Dr. Amjad Ali Arain
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Denodo
 
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATADATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATAAishwarya Saseendran
 
A general framework for predicting the optimal computing configuration for cl...
A general framework for predicting the optimal computing configuration for cl...A general framework for predicting the optimal computing configuration for cl...
A general framework for predicting the optimal computing configuration for cl...Scott Farley
 
Heart Disease Prediction Using Random Forest Algorithm
Heart Disease Prediction Using Random Forest AlgorithmHeart Disease Prediction Using Random Forest Algorithm
Heart Disease Prediction Using Random Forest AlgorithmIRJET Journal
 

Similar to Final Project Part 5 HEAdams (20)

Quantitative data essentials for charities - Learning Lab
Quantitative data essentials for charities - Learning LabQuantitative data essentials for charities - Learning Lab
Quantitative data essentials for charities - Learning Lab
 
Les5e ppt 01
Les5e ppt 01Les5e ppt 01
Les5e ppt 01
 
les5e_ppt_01.ppt
les5e_ppt_01.pptles5e_ppt_01.ppt
les5e_ppt_01.ppt
 
dss-110806075247-phpapp02.pptx
dss-110806075247-phpapp02.pptxdss-110806075247-phpapp02.pptx
dss-110806075247-phpapp02.pptx
 
Data mining & predictive analytics for US Airlines' performance
Data mining & predictive analytics for US Airlines' performanceData mining & predictive analytics for US Airlines' performance
Data mining & predictive analytics for US Airlines' performance
 
Finding & accessing data
Finding & accessing dataFinding & accessing data
Finding & accessing data
 
The Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemsThe Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystems
 
Data quality presentation oct 2006 23092006
Data quality presentation oct 2006 23092006Data quality presentation oct 2006 23092006
Data quality presentation oct 2006 23092006
 
Carolyn Engstrom - IT Data Analytics: Why the Cobbler's Children Have No Shoes
Carolyn Engstrom - IT Data Analytics: Why the Cobbler's Children Have No ShoesCarolyn Engstrom - IT Data Analytics: Why the Cobbler's Children Have No Shoes
Carolyn Engstrom - IT Data Analytics: Why the Cobbler's Children Have No Shoes
 
Quantitative data essentials for charities - Learning Lab
Quantitative data essentials for charities - Learning LabQuantitative data essentials for charities - Learning Lab
Quantitative data essentials for charities - Learning Lab
 
Optim test data management for IMS 2011
Optim test data management for IMS 2011Optim test data management for IMS 2011
Optim test data management for IMS 2011
 
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
 
Systems Analysis of the Data and Models Used for US Federal Emergency Managem...
Systems Analysis of the Data and Models Used for US Federal Emergency Managem...Systems Analysis of the Data and Models Used for US Federal Emergency Managem...
Systems Analysis of the Data and Models Used for US Federal Emergency Managem...
 
When Data Become News. A Content Analysis of Data Journalism Pieces.
When Data Become News. A Content Analysis of Data Journalism Pieces.When Data Become News. A Content Analysis of Data Journalism Pieces.
When Data Become News. A Content Analysis of Data Journalism Pieces.
 
When Data Become News. A Content Analysis of Data Journalism Pieces.
When Data Become News. A Content Analysis of Data Journalism Pieces.When Data Become News. A Content Analysis of Data Journalism Pieces.
When Data Become News. A Content Analysis of Data Journalism Pieces.
 
Computer Assisted Data Analysis (Hands-on Practice)
Computer Assisted Data Analysis (Hands-on Practice)Computer Assisted Data Analysis (Hands-on Practice)
Computer Assisted Data Analysis (Hands-on Practice)
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
 
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATADATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
 
A general framework for predicting the optimal computing configuration for cl...
A general framework for predicting the optimal computing configuration for cl...A general framework for predicting the optimal computing configuration for cl...
A general framework for predicting the optimal computing configuration for cl...
 
Heart Disease Prediction Using Random Forest Algorithm
Heart Disease Prediction Using Random Forest AlgorithmHeart Disease Prediction Using Random Forest Algorithm
Heart Disease Prediction Using Random Forest Algorithm
 

Final Project Part 5 HEAdams

  • 1. UN Millennium Development Goals FORECASTING MULTIPLE TIME SERIES USING AUTOREGRESSIVE MODELS HEATHER E. ADAMS DS-DC-14 3 OCTOBER 2016
  • 2. The Question Forecast change in indicators of U.N.’s eight goals, e.g. ensuring environmental sustainability, reducing child mortality Hypotheses: ◦ Worldwide macroeconomic indicators can be predicted from other collected data ◦ Time series may be forecasted from past trends Heather E. Adams https://www.drivendata.org/competitions/1/
  • 3. The Data Competition on DrivenData, utilizes World Bank Open Data ◦ Zipped CSV file ◦ Training data are World Bank macroeconomic indicators (1972-2007) for 214 countries ◦ Missing data ◦ Predict 2008 and 2012 for 737 parameter series ◦ 1 to 6 parameters per country ◦ Out of 195,402 parameters ◦ E.g. “Generosity of All Social Protection,” “Net official flows from UN agencies,” and “Presence of peace keepers.” Heather E. Adams http://data.worldbank.org/
  • 5. Exploratory Analysis All measures contain at least some missing values ◦ .dropna() removes full dataset Sierra Leone ◦ Environmental sustainability ◦ Lagged correlation matrix of data from single country ◦ Most correlated with a change in “Gross National Expenditure” from the previous year. However, goal is to predict 737 parameters ◦ Exploratory graphs indicate most may be simple trends Heather E. Adams
  • 6. The Model Time series AR Model For a single series ◦ Select years with data ◦ AR model ◦ Test/train split ◦ Forecast 2008, 2012 Predicting test split of 2000-2007 for ‘753’ yielded mean absolute error of 0.015 when trained on 1972-1999. Heather E. Adams
  • 7. Scaling Up Applying to entire dataframe ◦ Prediction series only (737) AR model ◦ On data through 2007 to forecast 2008 and 2012 Discrete periods of data collection ◦ 1972-1989 (506 to 565 missing values) ◦ 1990-1995 (77 to 233 missing values) ◦ 1996-2007 (up to 80 missing values) Replace missing data with series mean Heather E. Adams
  • 8. Model Evaluation Prediction of 2008 and 2012 for 737 series Submissions to DrivenData scored by root mean square error- all missing data replaced with series mean. ConvergenceWarning: Maximum Likelihood optimization failed to converge. Heather E. Adams Training date range Most missing data per year RMSE Convergence warnings 1972-2007 567 0.0930 33 1990-2007 233 0.0888 4 1996-2007 80 0.0806 14
  • 9. Strengths and Weaknesses Best root mean squared error: 0.0806 ◦ Rank 115 out of 909 Benchmarks ◦ for status quo: 0.0734 ◦ for simple linear regression (uses data from 2006 and 2007 only) : 0.0678 ◦ Best submission so far: 0.0457 Problems: ◦ Overlooks interactions with other parameters ◦ Convergence issue Heather E. Adams
  • 10. Conclusions Developed model that can be applied across the entire dataset Largest impact ◦ How missing data are handled ◦ Confirms that the majority of the data can be subset to recent data ◦ Simple models applicable Interaction with other parameters to be determined Heather E. Adams
  • 11. Next Steps 4 months left in the competition, for glory Moving forward ◦ Optimize year range of training dataset ◦ Missing value replacement method ◦ Address convergence issue Alternate model ◦ Random forest with interactions from the larger data set Heather E. Adams
  • 12. Acknowledgements Alex Egorenkov John Tate DrivenData General Assembly Stack Overflow Heather E. Adams