Susanne: World Recession Susceptibility Analysis
This presentation covers my journey in making Susanne, to be found here: http://susanne.bitballoon.com/.
This is a statistical computation and semantic web project, and I sort of went the extra mile. We find that export and banking economic variables best predict recession susceptibility, which can be gauged with up to 92% accuracy using SVM classifiers.
This is using World Bank data and R tooling. The website is made using D3.js and AngularJS, and does not use a backend.
2. Overview
Objective and Motive
Process
Data
Data Collection
Data Cleanup
Dependent Variable
Analysis
Regression, Classification
Clustering and Results
Further Work
Presentation
Drawbacks
2
3. Objective
Key Questions
To study what economic
composition makes economies
more susceptible to
Global Recession
By how much? How significantly?
Can we predict recession impact?
Studying similarity of economies
How to quantify
Recession impact
Susceptibility
Key Goals
Empirically ascertain significance
and impact of certain economic
traits vis-à-vis expert opinions
Develop a powerful model to
predict recession susceptibility
Present a global and intuitive view
across parameters and their
significance, i.e., Susanne.
3
6. Data Collection
60+ economic variables linked to Recession
Allegedly, as per sources like Economist, Forbes, World Bank, WWW
Or as per our suspicion
Preferable country-specific ratios (Normalized and Structural Information)
13 years, starting from 2000 to 2013
210 Countries
2500 rows of 60+ columns, each row is a country-year identified
Source:
World Bank (OECD National Account File): http://data.worldbank.org/
United Nations Comtrade Database, International Monetary Fund, Direction
of Trade Database, Balance of Payments Database, and more.
6
7. Simple Enough?
No, its blistering gunk.
Only 86 out of ~2500 rows have
complete data (no NA’s)
Non-normalized values
What now?
7
8. Data Cleanup
Manually add values for nearly complete columns
Compress and remove years 2008-2010
Observed Class Variables – not Causal
Remove countries with almost no data (Afghanistan and 30 others)
Down to 2100 rows
Drop columns if:
Significant, and have very little data available
Year values for super-specific variables like
“Merchandise Exports to Scandinavian Countries as % of exports”
We surely don’t have this for most countries, especially those like Albania
Insignificant
Determined from Regressional Analysis (MLR) p-values
8
9. Still too many missing values
What to do?
Drop row (done)
Weighted expansion of row (did not consider)
Infer a value
Average value for parameter for country
over 13 years
Result: Fixed 5000 out of 70K cells
Still only 86 full rows
Why?
No value for property exists for a country at
all…
Solution: Global Average – the sacrilege!
Awful. Pull towards the mean, misclassify.
But no directional bias.
Trade-off: Unlocks a world of data
9
For a specific country…
Year V1 Pred V2 Pred
2001 15 15 15 15
2002 17 17 17 17
2003 15 15 19 19
2004 16 16 21 21
2005 ? 16 ? 18
Avg 16 18
10. Dependent Variable
Goal: Capture Recession Impact between 2008-10
Technical Definition: Absolute Growth Rate
Problem: 15% to 1%, still not recession
Average Growth Rate drop between 2008 – 2010?
Problem: 1% drop for UK (at 0.5%) vs. AFG (at 15%)
Percentage Drop in Growth Rate
Problem: 0.1% to 3%, 3000% change!
Solution?
Drop.SD: Drop in Growth Rate in number of Standard
Deviations over 2008 – 2010.
Variance in Growth rate over last 20 years.
Variance: Lot of manual data collection
10
11. Distribution and Discretization of
Drop.SD
Corresponded well with web information about
countries that “avoided recession” and those
“hit worst”
21% of the countries labelled as unsusceptible
Less than -0.25 Drop in GDP during 2007-2010
Middle 36% labelled as relatively unaffected
between -0.25 to 0.75 SD GDP Drop in Recession
Highest 43% adversely affected
>0.75 SD drop in GDP
How good was this division?
lm R-squared rose from 41.58% to 44.14% (no loss in
predictive power, i.e. reasonable classification)
Good split. Most countries were affected horribly.
11
Other Options
• Equal Density Split
• Maximise Classification Accuracy
• Purely Contextual
13. Classification: Can we predict
Recession Susceptibility?
Assumption (non trivial)
Drop.SD correctly represents Recession Impact in 08-10
Recession Impact in 08-10 correctly represents Recession Susceptibility
Can’t do better but guess a few things.
SVM has 92% accuracy. Seems like it.
Caveat:
Bootstrap Analysis: Training Data = Test Data
Workaround (can’t generate new countries or years):
5-Fold Cross Validation
13
14. Important Variables and their
Impact
Using Multilinear Regression (MLR),
for each Economic Variable, we get
degree
and direction
of impact on Drop.SD
14
15. Clustering
Motivation:
“Are we brute-engineering a predictor, or is there an actual underlying
economic structural pattern of recession-susceptibility?”
70% accuracy (consistent clusters)with k-means, using multiple k
values.
You decide.
15
17. Allowing similarity checks between
Economies
Overall
Over Economic Categories
Using a Semantic Web compliant
Cardinality Checks and Ontology
Classify Economic Variables into
one or more of:
Central Government
Economic Structure
Net Exports
Banking
Manufacturing
GDP
Discretize them into , and over:
Value
And Impact
Using middle 80 percentile cut-offs
Further Work - Accessibility 17