Human-AI Collaborationfor Virtual Capacity in Emergency Operation Centers (E...
Global Indicators of High Growth Economies
1. Global Indicators of High-Growth
Economies
Predicting high GDP growth factors for national economies
MIE 465 — Analytics in Action
April 13, 2018
Oghosa Igbinakenzua
Kamil Yilanci
Minja Zhu
Chris Zhu
Department of Mechanical and Industrial Engineering
University of Toronto
2. Global Indicators of High-Growth Economies
Predicting high GDP growth factors for national economies
Abstract
This project aims to understand how a nation can effectively allocate resources to drive economic growth.
First we predict the key features that determine growth, then identify the countries on the verge of high
growth and what they can invest in to further drive their GDP growth. We used the full World Bank
Development indicators database, which featured 217 countries & territories’ data from 1960-2016 across
1574 indicators. A target binary variable of “high-growth” is defined as countries who sustained an annual
GDP growth above the world average for 9/10 consecutive years with an initial GDP threshold above US$10
billion. Several iterations of logistic regression were used to identify the significant features and their weights
to predict our “high-growth” binary variable. Several CART models were also created to cross validate key
features and produce a more interpretable storyline. Using logistic regression, we identified some key features
to be total fisheries production, urban population growth, and life expectancy. Fisheries was a surprising
finding at first, but it represents a country’s industrialization, utilization of natural resources, and is a proxy
for access to seagoing trade. Applying key features back to our countries and indicator data, we predicted
countries that will experience high growth between 2018-2024 to include: Brazil, Ukraine, Turkey, Panama,
and Cuba. A full map is illustrated below highlighting developed “high-income” countries, the 2006-16 high
growth countries, and our predicted 2018-24 high growth countries.
i
5. Global Indicators of High-Growth Economies
O. Igbinakenzua, K.Yilanci, M. Zhu, C.Zhu
April 13, 2018
1 Introduction
Our goal is to determine what makes high-growth developing countries unique and how other countries could
leverage similar characteristics. The BRICS (Brazil, Russia, India, China and South Africa) countries have
been recognized for their economic size and growth in the past 40 years which has led to their new-found
economic and political influence. We must first identify a subset of countries who experienced “high growth”
and use these as our target variable in identifying the most relevant features that drive “high growth”. Then,
we can identify the next economies on the verge of experiencing this prosperity. The results will be cross-
referenced to existing economic growth groupings and can be useful for governments in validating resource
allocation decisions for driving growth.
2 Data
The target variable in our regression will be a binary indicator of “high-growth”, to be identified in section
2.2. The World Development Indicators provides 1574 diverse indicators across categories such as economic,
demographic, health, and infrastructure information across every country. This provides us with a wealth of
information to work with and a challenge to organize.
Table 1: Data summary from World Development Indicators (World Bank)
Years # Countries ”High-Growth” Total Features % Complete # Rows
1960-2016 196 24 1574 60% 11484
2.1 Data Cleaning
Due to the quantity of data across 5 decades, over 200 countries, and over 1500 indicators to work with, we
had a hard time organizing the data. A significant amount of time was dedicated to inverting and organizing
the data. However, the biggest challenge in this project was incomplete data - only 60% of the data was
complete in our dataset. We attempted to impute the missing data two ways: by computing the mean for
certain features and also computing a similarity matrix between countries. However, both methods were
better suited to filling in relatively complete data. In the end, we selected 18 relevant features from those
that are over 80% complete and then handpicked 10 more that were relevant according to the World Bank
Featured Indicators list for a total of 28 features [1]. The distribution of data completeness is in Appendix
A.
1
6. 2.2 Labelling the Data (Definition of High Growth)
Since we are considering growth, we will be looking at data in 10-year increments (eg. the 2005 literacy rate
predicts high growth 2005-2015, and 2006 literacy rate predicts high growth 2006-2016.)
We also cycled through several iterations of defining our target variable “High Growth”. The BRICS
countries are a somewhat ambiguous grouping of large economies, while ranking high annual GDP growth
captures tiny island nations like Nauru that don’t represent economic influence. In the end, we settled on 10
year conditional growth This takes only countries with annual GDP growth greater than the world average
for 9 out of 10 consecutive years, with an annual GDP above 10B US$ (40th percentile threshold). This
allows flexibility for a temporary downturn and also removes small economies who don’t quality for regional
influence.
3 Methods and Results
The team utilized Logistic Regression and CART methods to identify significant features for High Growth
and predict the countries that will experience growth between 2018-2028. The models are created by using
pandas, numpy, scipy, and sklearn libraries in Python.
3.1 Logistic Regression
Logistic regression is used to identify significant and most impactful features for High Growth. The team
cross-validated the model by running it with 10 random test-train splits and printing the resulting confusion
matrices. Afterwards, threshold of the model is changed by utilizing ROC curve (Appendix C) to improve
true positive rate of the model. Then, the resulting model is used to predict the countries that will experience
high growth in between 2018-2024.
The first model had an accuracy of 91.675%. The model is shared in Appendix B. According to the model the
most impactful features were: Total fisheries production (metric tons); Net official development assistance
and official aid received (current US$); Urban population growth (annual %); and Life expectancy at birth,
total (years). The team utilized the formula below to approximate the impact of one unit change in the
features mentioned to the probability of being high-growth:
impact =| eunitofchange×coefficient
− 1 |
The most impactful features and their impact are visualized in Figure 1. However, the confusion matrix
showed that there was a high bias towards not predicting high-growth. While the accuracy was 91.675%,
the true positive rate was only 44%. This was due to high data percentage classified as not high-growth. In
fact, 88% of all data was classified as not high-growth. The confusion matrix is shown in Figure 2a.
To overcome this data bias issue, we have plotted the ROC curve (Appendix C) to optimize the threshold for
our prediction. The AUC of the ROC curve is 0.867. The second iteration of the model utilized a threshold
of 0.3 to increase the true positive rate from 44% to 62.7%. The new iteration of the logistic regression
model resulted in the confusion matrix shown in figure 2b.
The accuracy stayed at a similar level, while the true positive rate improved to 62.7%. There was an increase
2
7. Figure 1: Impact of significant features
(a) Logistic Model with 0.5 threshold (b) Logistic Model with 0.3 threshold
Figure 2: Confusion Matrices for Logistic Regression Models
of 4.5% in false positive rate as well. A visualization of a sample prediction from the logistic regression models
are available in Appendix G.
3.2 CART
CART method was chosen to further identify the features and their importance, and to cross-validate the
results obtained from the logistic regression. Furthermore, the team preferred CART method because, often,
CART method leads to more interpretable results. The team utilized two different iterations of CART
model. The first model, “unbalanced model”, had equal weights for data classified as high-growth and non-
high-growth. The second model, “balanced model,” had different weights for both classes to ensure they had
equal representation in the data.
3
8. The results for both models were cross-validated with 10 random test-train splits. The resulting confusion
matrices are plotted for both of the models below. Figure 3a is for the unbalanced model and figure 3b is
for the balanced model.
(a) Unbalanced CART Model (b) Balanced CART Model
Figure 3: Confusion Matrices for CART Models
While the unbalanced model had true positive rate of 70%, the balanced model had a true positive rate of
88.59%. Accordingly, the false positive rate also increased from 3.5% to 10.3%.
The first three levels of the unbalanced model is presented in Figure 4. The first three levels of the balanced
model is presented in Figure 5.
Figure 4: Unbalanced CART Model
Both models tagged similar features as important: Total fisheries production (metric tons); Fertility rate,
total (births per woman); Population, female (% of total); Rural population (% of total population) or Rural
population growth (annual %). Some of these features were also identified by the logistic regression model
as statistically significant: Total fisheries production (metric tons) and Net official development assistance
and official aid received (current US$). One of the surprising findings in both CART models was Fertility
4
9. Rate, which was identified as not statistically significant by the logistic regression model but a second level
result in the CART.
Figure 5: Balanced CART Model
Our predictions from the unbalanced CART models is visualized in figure 6 and animated here. Results of
the predictions from the CART models are available in Appendix H.
Figure 6: Prediction from Unbalanced CART model
5
10. 4 Discussion
In this section we will compare the key features from the different models and evaluate which conclusions
make sense and which do not. We will also explore the limitations of the models and discuss features that
were thought of as critical that turned out to be insignificant.
4.1 Total Fisheries Production
Although this indicator was very significant in both the logistic regression and the CART trees, it was a
surprise to us at first. Out of all the indicators, we expected categories involving urbanization, trade, debt,
and even cell phone usage to be high. However, qualitatively this can make sense because commercial fishing
signifies an industrialized economy scaling up its ocean resource utilization beyond the classic fishing villages,
which is similar to an agricultural economy going towards industrialized.
A map of global fisheries production (Appendix D) shows the Southeast China Sea, Western South America,
and Scandinavia with heavy fisheries production, which relates well to their rapid development status [2].
Having a country close to the world’s oceans facilitates trade, is a natural border, and provides an abundance
of resources including fishing, oil, and alternative energy [3]. In fact, as seen in the concluding predictions,
a high number of our predicted high growth countries have long coastlines.
In figure 1, the impact of a one unit increase for fisheries is 51%, meaning if country A originally had 30%
probability of being high growth, increasing their annual fisheries production by 1M tons would push their
probability to about 45%. Looking at a country like Vietnam, with consistently growing fisheries of 6M met-
ric tonnes in 2016, a 1M ton increase would be very significant and would represent the Vietnamese fisheries
industry increasing by 17% [4]. This makes sense with the high impact shown in the logistic regression.
4.2 Net official development assistance and official aid received
This is one of the most highly anticipated features from our team, as foreign aid ties directly with stimu-
lating development and thus economic activity. It was a high impact feature in the logistic regression and
a smaller variable in the bottom portion of the CART tree. Looking at the OECD Development Assistance
Committee’s list for OECD aid recipients 2014-2017, many of the low - medium income countries qualify as
high growth in our model while the “least developed countries” do not [5]. Thus, the countries in the upper
tail of development assistance, such as Ethiopia and Pakistan is a good indicator of future high growth as
these are economies that the “west” have invested in to stimulate growth. On the other hand, the lower tail,
such as Syria are not because aid would have mostly been to resolve humanitarian, war, and health crises
[6]. Ultimately, development aid is a good binary indicator of high growth, but not a numerical indicator
because aid amounts vary based on current events and assistance objectives.
6
11. 4.3 Urban population growth
This is another highly anticipated feature, ranking high on the logistic regression and at the 3rd level on the
CART tree. According to the World Bank’s Commission of Urbanization and Growth, the agglomeration
economy has been a major cause for growth in the last three decades, especially in the case of China’s tripling
urban population percentage [7]. Although this is probably one of the most direct drivers of economic growth
in our list, further research suggests that some types of urbanization work better than others. Turok and
McGranahan’s 2013 journal suggests that removing rural-urban movement barriers and having the right
supportive market policies are key to enabling urban economic growth [8].
4.4 Life expectancy at birth
Life expectancy is significant in both our logistic and CART models. In the CART, it sets a cap of around
69 years where anything above would be the developed world and thus no longer high growth. Since life
expectancy is such a close approximator to economic development (low indicates humanitarian crises or war,
middle indicates developing, and high indicates developed), it is a good indicator for the model to use in
filtering out the lower tail and upper tail.
4.5 Fertility Rate
Fertility rate is interesting because it is not significant in the logistic model, but it is the 2nd level feature
in the CART tree. Fertility rate is a reflection of a country’s healthcare and career opportunities. In an
agricultural system with high rates of disease and little opportunity, a family will want many kids to ensure
a few are successful. As both factors improve, children decrease in each household. Thus, it is definitely a
significant feature and a representation of a change in quality of life.
4.6 Features with negative coefficients
There were several significant indicators that returned negative coefficients, meaning the increments would
lead to a negative impact on likelihood of high growth. For example, adolescent fertility rate’s negative
logistic regression coefficient is sensible in indicating that adolescent pregnancies can mean poor sexual ed-
ucation or a low age for maturity. Starting families young can be a sign for a large rural agricultural based
population. On the other hand, the negative coefficient on arable land makes less sense, meaning the more
arable land (% of land mass), the less likelihood of high growth. Perhaps this forces the nation to industri-
alize more quickly and rely less on agriculture.
4.7 Insignificant features
There were a few indices that we hypothesized to be very significant that did not do well in our logistic
regression. Education (school enrollment %) and trade were insignificant but in literature and research
they are important. Many intergovernmental organizations such as the WEF and World Bank have run
campaigns around education and primary completion is a key component of the Sustainable Development
7
12. Goals [9]. A few hypotheses for the discrepancy: one of our largest challenges was missing data, and
education completion rates were especially incomplete, which could have been a culprit in why some indicators
were worse predictors. School completion does not directly relate to economic activity. In the short term,
education participation is a result of government policy and can be a completing priority for governments in
allocating budget. A case in point is Cuba, where healthcare and education are almost 100% as they are the
main focus of the Communist government but trade sanctions and collective ownership has stifled economic
opportunity in our traditional sense.
8
13. 5 Conclusion
In summary, our project produced four models: two iterations each of logistic regression and CART tree.
The logistic regression coefficients were especially useful in prioritizing impactful indices, and lowering our
threshold to identify more positives was a necessary adjustment during tuning of the model. The inter-
pretability of the CART tree was helpful in grasping how features interacted together in the first few levels,
but as the tree got taller and variables showed up on multiple levels we lost interpretability. Sample predic-
tion maps of all four models are in Appendices G and H. Overall, the unbalanced CART tree produced the
best accuracy score and best matched economists conclusions and current economic growth groupings. Using
this tree, we predicted the high growth economies for 2018-2024 in Appendix H.1. We can see that parts of
Latin America and Southeast Asia are consistent, representing great investment opportunities. Africa also
shows up frequently but is not consistent, which is in line with the political instability in the region.
The predictions from the unbalanced CART tree show on average 31% countries per year to experience high
growth between 2018 to 2024. Comparing our predictions to known economic development groupings, our
model’s high growth prediction matched with 8/11 of the NEXT11 countries, 3/4 of the MINT countries,
and 11/15 of the EAGLES emerging growth countries [10][11][12]. Nigeria, Turkey, and Iran are the standout
countries that are consistently in these groupings but were not highlighted in our predictions.
While the models showed positive results, there are still outstanding issues and limitations with our process:
• Involving expert opinion - the lack of economic expertise means we had to rely heavily on math and
technical techniques. For example, it would have been better to begin with a stronger hypothesis and
conduct feature selection based on expertise rather than the availability of data or feature selection
algorithms.
• Better data - our complete World Bank dataset was only 60% complete which means we had to impute
certain missing data and also eliminate indicators based on incomplete census data. Supplement-
ing additional datasets and computing the similarities between countries through tensor factorization
techniques such as CANDECOM/PARFAC [13] could have produced more complete results.
Due to these limitations, the team would not be confident with having any detailed government resource
allocation decisions based on our results, however the exercise did do a good job of showing which fields were
important drivers of growth and achieved our primary goal of predicting future high growth countries.
Given a revised model with improved data and expert hypotheses, resource allocation optimization is a
logical next step. Based on a country’s growth target and its available resources, an optimization model can
be developed to effectively identify how to distribute the available capital, human and natural resources to
achieve high-growth. The team is excited about the results and learnings from this project and look forward
to future opportunities to further implement and revise on these results in the global development space.
9
14. References
[1] T. W. Bank. World bank indicators, [Online]. Available: https://data.worldbank.org/indicator.
[2] F. Carr´e. Ressources menac´ees de l’oc´ean mondial, [Online]. Available: https://www.monde-diplomatique.
fr/publications/l_atlas_geopolitique/a53308.
[3] T. W. Bank. Oceans, fisheries and coastal economies, [Online]. Available: http://www.worldbank.
org/en/topic/environment/brief/oceans.
[4] ——, Total fisheries production (metric tons), [Online]. Available: https://data.worldbank.org/
indicator/ER.FSH.PROD.MT.
[5] OECD. Dac list of oda recipients, [Online]. Available: http://www.oecd.org/dac/financing-
sustainable-development/development-finance-standards/DAC_List_ODA_Recipients2014to2017_
flows_En.pdf.
[6] T. W. Bank. Net official development assistance received, [Online]. Available: https://data.worldbank.
org/indicator/DT.ODA.ODAT.CD?year_high_desc=true.
[7] ——, Urbanization and growth, [Online]. Available: https : / / siteresources . worldbank . org /
EXTPREMNET/Resources/489960-1338997241035/Growth_Commission_Vol1_Urbanization_Growth.
pdf.
[8] I. Turok and G. McGranahan. Urbanization and economic growth: The arguments and evidence
for africa and asia, [Online]. Available: http : / / journals . sagepub . com / doi / full / 10 . 1177 /
0956247813490908.
[9] U. Nations. Sustainable development goal 4, [Online]. Available: https://sustainabledevelopment.
un.org/sdg4.
[10] Investopedia. Eagles, [Online]. Available: https://www.investopedia.com/terms/e/eagles.asp.
[11] BBC. The mint countries: Next economic giants?, [Online]. Available: http://www.bbc.com/news/
magazine-25548060.
[12] Goldmansachs. Beyond the brics: A look at the next 11, [Online]. Available: http://www.goldmansachs.
com/our-thinking/archive/archive-pdfs/brics-book/brics-chap-13.pdf.
[13] E. Acar, D. M. Dunlavy, and T. G. Kolda. Fitting a tensor decomposition is a nonlinear optimization
problem, [Online]. Available: http://www.cs.cornell.edu/cv/tenwork/Slides/Kolda.pdf.
10
15. A Histogram Plot of Feature Completeness
Figure 7: Histogram of feature completeness
11
17. C ROC for Logistic Regression Model
Figure 9: ROC for Logistic Regression Model
13
18. D Global Fisheries Production
Figure 10: Global fisheries production [3]
14
19. E Full Unbalanced CART Model
Figure 11: Full Unbalanced CART Model
15
20. F Full Balanced CART Model
Figure 12: Full Balanced CART Model
16
21. G Predictions from Logistic Regression Models
G.1 Predictions from Logistic Regression Model with 0.5 threshold
Figure 13: Prediction from Logistics Regression model with 0.5 threshold
17
22. G.2 Predictions from Logistic Regression Model with 0.3 threshold
Figure 14: Prediction from Logistics Regression model with 0.3 threshold
18
23. H Predictions from CART Models
H.1 Predictions from unbalanced CART model
An animated version of the predictions is available here: https://goo.gl/7SbpD1
Figure 15: Prediction from Unbalanced CART model
19
24. H.2 Predictions from balanced CART model
Figure 16: Prediction from Balanced CART model
20