In this project, I have used various big data and analytics technologies like SQL, Excel, Apache Pig, Apache Hive, Tableau, H20 in order to analyse datasets and creating analytical relationships between the number of students studying abroad from a country and try to relate it to a number of factors of that country.
Big Data Technologies were used to transform the data, and gain insights. Effective visualisations were created using Tableau. Also, predictive analysis was done using various Machine Learning Algorithms in H2O in order to create a predictive analysis, which can be used in the future to predict the number of students studying abroad, from all or part of the given factors.
3. ARJUN SEHGAL 1
1. ABSTRACT
As we all know the number of students studying abroad, is increasing every year on a global
scale. This flow of students from different cultures, is detrimental to the growth of the world
economy. For countries like USA, which are considered to be the hotspots of foreign
education, the number of foreign students also has a major financial advantage. As, a project
for my course CS-GY 9223 Big Data Analytics, I have decided to undertake a project in which,
I have identified few factors which might affect the number of students studying abroad. And
then using various technologies taught throughout this course, I have tried to gain insights
into the datasets obtained.
2. INTRODUCTION
In this project I have obtained the data on the number of students studying abroad, the gross
domestic product(GDP) of various countries, the expenditure on education by the
government, the rate of unemployment within the youth of that country and the number of
internet users within the country as a percentage of the total population.
I considered these three factors to be detrimental to the number of students going abroad
for education as, the GDP is an economic indicator which shows us the total monetary value
of all the goods and services produced within a country in a given time frame. It can be useful
to determine the economic health of a country. The second factor I chose is the expenditure
on education by the country. This can be used as a tool to determine whether the government
is devoting enough resources to education and its development. Naturally, if the quality of
education is poor we should be expecting a greater number of students to study abroad. The
values for these have been represented as the expenditure on education as a percentage of
the total expenditure by the government. The third and final indicator that I chose is the
unemployment rate amongst the youth of that country, i.e. the population that is aged 18-
24. I felt that this factor was also important, as it helps us in describing whether the youth
which primarily consists of students is able to obtain jobs in their country, or do they have to
search for better opportunities abroad, which can be the motivation for studying abroad. I
have also used the dataset for internet users from amongst the population, because I feel
that the greater the percentage of population that has access to internet the more
knowledgeable the population will be and hence have increased chances for studying abroad.
I received the dataset for all the four from the United Nations Dataset.
3. DATA SOURCES
The links from where the datasets were obtained is as follows:
o Dataset for students studying abroad from a
country: http://data.un.org/Data.aspx?q=student&d=UNESCO&f=series%3aED_FS
OABS
4. ARJUN SEHGAL 2
o Dataset for GDP of a
country: http://data.un.org/Data.aspx?q=GDP&d=WDI&f=Indicator_Code%3aNY.G
DP.MKTP.CD
o Dataset for Youth unemployment rates (ages 15-
24): http://data.un.org/Data.aspx?q=unemployment&d=MDG&f=seriesRowID%3a63
0
o Dataset for Expenditure by Government on Education in home
country: http://data.un.org/Data.aspx?d=UNESCO&f=series%3aXGDP_FSGOV
o Dataset for Percentage of Internet Users in home country:
http://data.un.org/Data.aspx?d=ITU&f=ind1Code%3aI99H
4. SQL PRE-PROCESSING
All the dataset’s that I downloaded were in .csv format and were stored using comma
delimiter. Firstly, the data was loaded onto SQL. In this I processed the data in order to ensure
that the data integrity was maintained. In this process of pre-processing the data, I have used
SQL & Excel to clean the data and transform it into a suitable format.
In order to do this, I created relevant tables in SQL with the respective data types for each
column for the .csv files. While pre-processing the data, I observed that the data for the
country name column was creating problems, as some of the countries had comma’s in their
name, which was also being used as the delimited thus causing confusion when loading the
data. Whenever SQL incorrectly processed a column, it encountered an error, as an incorrect
data type would be placed in the next column. Also, some files had comments loaded at the
end along with footnote values creating unequal column widths. From the errors observed
in SQL I then corrected the data and subsequently loaded the data in SQL. When the data was
successfully loaded into SQL, it was then ready to be loaded in other applications like Pig and
Hive. Also, for the table of GDP of a country, I noticed that loading the data in Hadoop
technologies like Pig & Hive was creating problems as it wasn’t able to always correctly detect
the values, as they were of extremely large magnitudes. As a workaround for it, I first loaded
the data in SQL and then created the ID column which will be explained ahead. Once the new
column was created and populated, I then exported the new table and used it in Hive along
with the other data sets.
Once the data-sets were pre processed and cleaned as shown in the previous steps, the data
was then loaded into HDFS by using the Hue UI. Once all the data-sets were loaded onto HDFS,
then the data was processed in Hive. In this I had to create a key within all the tables so that
the individual records could be matched and identified uniquely. In order to achieve this, I
created a new column called ID, which has been derived from two pre existing columns
Country Name and Year. By concatenating the two fields, I created a new column which was
unique for each record. The benefit from this is that, when we are required to perform joins,
we now have a unique column to be referenced.
8. ARJUN SEHGAL 6
The Data from other countries can be seen is falling in a single range, indicating that these
two countries are contributing a heavy majority of the students studying abroad throughout
the world. Also, we see that the countries for which data isn’t available have been greyed out.
The total sum of all students studying abroad for all years has been color coded, which can be
decoded using the key given above.
The next visualization created is presenting the number of students which are studying
abroad as compared with the GDP of that country for that particular year. This visualization
has been color coded according to the number of students studying abroad for that particular
year for which the GDP has been plotted. Color coding this figure is especially important, as
we have a lot of plot points in the start of the figure, which can cause confusion. However,
using color coding, we can identify a variation amongst those point by the change in color.
The above visualization of GDP and students, helps us to get an idea of the fact that for a
majority of the countries and plot points, we can summarize the graph using a polynomial
trend line of order three. However, as we can see that countries like India and China, which
have an abnormally high number of students studying abroad, those points don’t fall on this
trend line and create an anomaly.
The next visualization is created to represent the variation between the number of internet
users in the population of a country and the number of students going abroad from that
country. Yet again as the number of students has been color coded to ensure that we are able
to identify the various different levels of students studying abroad for closely located levels
of internet usage in countries.
9. ARJUN SEHGAL 7
Again, a polynomial trend line has been used with degree three to estimate the data.
However, the data that is falling out of the trend line is for the countries in which the number
of students studying abroad is abnormal like India and China. These countries can be
identified as the high orange colored peaks in the above figure.
In the following visualization showing us the number of students studying abroad and the rate
of unemployment amongst youth aged 15-24 in a country also follows a similar pattern, like
the last graph of internet users vs. students. In this plot also, we have estimated the data
points using a polynomial trend line, however the exceptions for countries with extremely
high students abroad are present.
10. ARJUN SEHGAL 8
7. PREDICTIVE ANALYTICS IN H2O
Once the data was analyzed using the visualizations created previously, predictive analytics
were performed on the data. This was done so that we can predict and further emulate
various scenarios which might affect the number of students studying abroad.
For this purpose, the software H2O has been used, which can be used for performing
predictive analytics using the local machine, or on top of R, Tableau or Hadoop. Mainly two
different models have been used while preparing different analysis, and from these the model
in which the predictions had the minimum error. Also two different datasets were used for
this purpose, in order to increase the efficiency of the models and identify the most relevant
factors, with which the best results were obtained.
The first dataset that was used, had all for factors that have been previously discussed, that
is unemployment, educational expenditure internet users and GDP. In the second dataset,
the column for unemployment has been omitted. This has been done as for a large number
of countries the unemployment percentage wasn’t available and trying this database against
the same models might give a better result due the the greater versatility of the dataset.
However, at the same time there is trade off between greater number of results and
increased number of factors which can affect the result.
The two models which have been use are Gradient Boosting Learning Model and the second
one is Deep Learning Model. Both the models can be used for regressions, and give us the
importance of the variables which we specify should be tested for predicting the values of the
target variable.
A Gradient Boosting Machine (GBM) is an ensemble of tree models (either regression or
classification). Both are forward-learning ensemble methods that obtain predictive results
through gradually improved estimates. Boosting is a flexible nonlinear regression procedure
that helps improve the accuracy of trees. By sequentially applying weak classification
algorithms to incrementally changing data, a series of decision trees are created that
produce an ensemble of weak prediction models.
GBM is the most accurate general purpose algorithm. It can be used for analysis on
numerous types of models and will always present relatively accurate results. Additionally,
Gradient Boosting Machines are extremely robust, meaning that the user does not have to
impute values or scale data (they can disregard distribution). This makes GBM the go-to
choice for many users, as little tweaking is required in order to get accurate results.
In the below figures, Gradient Boosting Model has been applied to the dataset that contained
all the four fields. Firstly, the data from the file all_fields.csv was loaded to h2o as a data
frame. This frame was then spit into 25:75 in order to create a validation frame, which is to
ensure that the model has converged. While specifying the model parameters, the value of
n-folds was set at 8, which is used to determine the number of folds for cross-validation.
13. ARJUN SEHGAL 11
The second model that was created for the dataset that contained all the fields, was Deep
Learning. Deep Learning is another popular model that is being developed. Its algorithms are
based on distributed representations with the underlying assumption behind distributed
representations is that observed data are generated by the interactions of factors organized
in layers.
Deep Learning with H2O features automatic adaptive weight initialization, automatic data
standardization, expansion of categorical data, automatic handling of missing values,
automatic adaptive learning rates, various regularization techniques, automatic performance
tuning, load balancing, grid-search, N-fold cross-validation, checkpointing and different
distributed training modes on clusters for large data sets. The technology does not require
complicated configuration files and H2O Deep Learning is highly optimized for maximum
performance.
Like the last model in this model also we have used the same learning frame and validation
frame, the n-folds value has been kept same. Also the response column has been selected to
be that of students and the columns to be ignored have been selected. Also, the option to
specify the importance of various variables that have been specified has also been marked,
to see the difference between the previous model and this one on how they are differently
assigning importance’s to various different variables.
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
BRAZIL BRAZIL ALBANIA DENMARK DENMARK SOUTH
KOREA
INDIA MALAYSIA
Comparison of Real vs Predicted Values for
Gradient Boosted Learning Model
Real Value Predicted Value