Analyzing the trend of students studying abroad as a result of various parameters of home country

ANALYZING THE
TREND OF STUDENTS
STUDYING ABROAD AS
A RESULT OF VARIOUS
PARAMETERS OF
HOME COUNTRY

Arjun Sehgal – N15529324
AS9710@NYU.EDU

TABLE OF CONTENTS

1. Abstract 1

2. Introduction 1

3. Data Sources 1

4. SQL Pre-Processing 2

5. Hive Processing 3

6. Tableau Visualizations 5

7. Predictive Analytics in H2O 8

8. Conclusions 16

9. Future Scope 16

10.References 16

ARJUN SEHGAL 1

1. ABSTRACT

As we all know the number of students studying abroad, is increasing every year on a global
scale. This flow of students from different cultures, is detrimental to the growth of the world
economy. For countries like USA, which are considered to be the hotspots of foreign
education, the number of foreign students also has a major financial advantage. As, a project
for my course CS-GY 9223 Big Data Analytics, I have decided to undertake a project in which,
I have identified few factors which might affect the number of students studying abroad. And
then using various technologies taught throughout this course, I have tried to gain insights
into the datasets obtained.

2. INTRODUCTION

In this project I have obtained the data on the number of students studying abroad, the gross
domestic product(GDP) of various countries, the expenditure on education by the
government, the rate of unemployment within the youth of that country and the number of
internet users within the country as a percentage of the total population.

I considered these three factors to be detrimental to the number of students going abroad
for education as, the GDP is an economic indicator which shows us the total monetary value
of all the goods and services produced within a country in a given time frame. It can be useful
to determine the economic health of a country. The second factor I chose is the expenditure
on education by the country. This can be used as a tool to determine whether the government
is devoting enough resources to education and its development. Naturally, if the quality of
education is poor we should be expecting a greater number of students to study abroad. The
values for these have been represented as the expenditure on education as a percentage of
the total expenditure by the government. The third and final indicator that I chose is the
unemployment rate amongst the youth of that country, i.e. the population that is aged 18-
24. I felt that this factor was also important, as it helps us in describing whether the youth
which primarily consists of students is able to obtain jobs in their country, or do they have to
search for better opportunities abroad, which can be the motivation for studying abroad. I
have also used the dataset for internet users from amongst the population, because I feel
that the greater the percentage of population that has access to internet the more
knowledgeable the population will be and hence have increased chances for studying abroad.
I received the dataset for all the four from the United Nations Dataset.

3. DATA SOURCES

The links from where the datasets were obtained is as follows:
o Dataset for students studying abroad from a
country: http://data.un.org/Data.aspx?q=student&d=UNESCO&f=series%3aED_FS
OABS

ARJUN SEHGAL 2

o Dataset for GDP of a
country: http://data.un.org/Data.aspx?q=GDP&d=WDI&f=Indicator_Code%3aNY.G
DP.MKTP.CD
o Dataset for Youth unemployment rates (ages 15-
24): http://data.un.org/Data.aspx?q=unemployment&d=MDG&f=seriesRowID%3a63
0
o Dataset for Expenditure by Government on Education in home
country: http://data.un.org/Data.aspx?d=UNESCO&f=series%3aXGDP_FSGOV
o Dataset for Percentage of Internet Users in home country:
http://data.un.org/Data.aspx?d=ITU&f=ind1Code%3aI99H
4. SQL PRE-PROCESSING

All the dataset’s that I downloaded were in .csv format and were stored using comma
delimiter. Firstly, the data was loaded onto SQL. In this I processed the data in order to ensure
that the data integrity was maintained. In this process of pre-processing the data, I have used
SQL & Excel to clean the data and transform it into a suitable format.

In order to do this, I created relevant tables in SQL with the respective data types for each
column for the .csv files. While pre-processing the data, I observed that the data for the
country name column was creating problems, as some of the countries had comma’s in their
name, which was also being used as the delimited thus causing confusion when loading the
data. Whenever SQL incorrectly processed a column, it encountered an error, as an incorrect
data type would be placed in the next column. Also, some files had comments loaded at the
end along with footnote values creating unequal column widths. From the errors observed
in SQL I then corrected the data and subsequently loaded the data in SQL. When the data was
successfully loaded into SQL, it was then ready to be loaded in other applications like Pig and
Hive. Also, for the table of GDP of a country, I noticed that loading the data in Hadoop
technologies like Pig & Hive was creating problems as it wasn’t able to always correctly detect
the values, as they were of extremely large magnitudes. As a workaround for it, I first loaded
the data in SQL and then created the ID column which will be explained ahead. Once the new
column was created and populated, I then exported the new table and used it in Hive along
with the other data sets.

Once the data-sets were pre processed and cleaned as shown in the previous steps, the data
was then loaded into HDFS by using the Hue UI. Once all the data-sets were loaded onto HDFS,
then the data was processed in Hive. In this I had to create a key within all the tables so that
the individual records could be matched and identified uniquely. In order to achieve this, I
created a new column called ID, which has been derived from two pre existing columns
Country Name and Year. By concatenating the two fields, I created a new column which was
unique for each record. The benefit from this is that, when we are required to perform joins,
we now have a unique column to be referenced.

ARJUN SEHGAL 3

5. HIVE PROCESSING

In order to load the data into Hive we can use two methods. The first one is that we can go to
Metastore Tables from the Hue UI and create a new table from a file loaded on the HDFS.
Once the file is selected, you can specify the delimiter used and mention the column names
along with their respective data types. Once all this has been specified, then the table can be
operated upon by executing queries in Hive. The second method is to create the table using
Hive itself, by writing a query to firstly create a table, and then another query to populate
that table. I have used Hive commands to execute the creation of the table and then
populating it with the relevant data.

In Hive, I created a new column called ID as mentioned previously for the other tables so that
all records can be uniquely accessed and identified. Once the query is executed and the

ARJUN SEHGAL 4

results have been obtained, we can obtain those results in .csv format or any other format as
well. Another option available is to load the results obtained from the query directly into a
new table in Hive. This option was again a more convenient way to save executed queries into
new tables, so that they may be used further.

Once I created the ID column for all the tables loaded in Hive, and then saved the results in a
new table for each of them, then I performed a join on all the separate tables in order to
ensure that all the relevant data for a particular country for that year is together in a single
table. Also, using this method, I was able to filter out all those records which had null values
in them, or for whom the data wasn’t available for that particular year.

As the United Nations Dataset, hasn’t been updated recently, and does not contain all
information for all the countries, the incomplete information can cause problems later on
when we are analyzing the data. For this purpose, I joined the tables using inner join, as it
ensures that only those fields will be joined which are sharing the common unique identifier
column ID. These operations could have been performed in SQL as well, however I chose to
do it in Hive instead of SQL as in the case of large datasets, which typically involve millions of
rows, performing such joins involving 3-4 different tables can put significant pressure on the
system and affect system performance. However, when performed in Hive, the same job is
run on top of Hadoop and thus computes results much faster and more efficiently.

Once I obtained the results for the query involving joining all table in Hive, I saved them
similarly by using the option in the Hue UI, by saving it as a .txt file on my local computer.
When the final txt was obtained from Hive, we have obtained a dataset which contains the
information like Country Name, Year of Observation, Number of Students Abroad, GDP,
Expenditure on Education, Youth Unemployment Rate, Internet Users for each country. In
this dataset that we obtained, we notice that all data from some countries is available only
for 1 or 2 years, which may be not that relevant and can distort the averages while we perform
predictive analytics on them. In order to help with this, I used Pig to transform the data again.

ARJUN SEHGAL 5

I loaded the .txt file I saved from Hive to my local desktop by making use of the Hue UI. Once
I loaded it, then I made use of a pig script to filter out only those countries which have a
significant number of results in the output. I have assumed here, that any country with greater
than or equal to three observations can be considered significant. The rest of the observations
have been ignored.

This step is performed for all the datasets that were created in Hive, as a combination of
various different factors gives us different lists of countries because the information of all the
fields has not been made public by all countries, which led to inconsistencies within the data.

The data once processed and transformed in Pig, has then been stored locally. It is
downloaded to the local desktop using the Hue UI. After all the data processing has been
completed, Tableau was used to create various different visualizations from the obtained data
and H2O was used to perform predictive analytics on the obtained data.

6. TABLEAU VISUALIZATIONS

From the final analyzed dataset as obtained from Pig, the following visualizations were
obtained in Tableau. The following figure is representing the number of students which are
going to study abroad from each country as shown in the final dataset. From the concluded
data e can correctly infer that amongst all countries, China has the highest number of students
studying abroad than any other country, with India falling second in that position

ARJUN SEHGAL 6

The Data from other countries can be seen is falling in a single range, indicating that these
two countries are contributing a heavy majority of the students studying abroad throughout
the world. Also, we see that the countries for which data isn’t available have been greyed out.
The total sum of all students studying abroad for all years has been color coded, which can be
decoded using the key given above.

The next visualization created is presenting the number of students which are studying
abroad as compared with the GDP of that country for that particular year. This visualization
has been color coded according to the number of students studying abroad for that particular
year for which the GDP has been plotted. Color coding this figure is especially important, as
we have a lot of plot points in the start of the figure, which can cause confusion. However,
using color coding, we can identify a variation amongst those point by the change in color.

The above visualization of GDP and students, helps us to get an idea of the fact that for a
majority of the countries and plot points, we can summarize the graph using a polynomial
trend line of order three. However, as we can see that countries like India and China, which
have an abnormally high number of students studying abroad, those points don’t fall on this
trend line and create an anomaly.

The next visualization is created to represent the variation between the number of internet
users in the population of a country and the number of students going abroad from that
country. Yet again as the number of students has been color coded to ensure that we are able
to identify the various different levels of students studying abroad for closely located levels
of internet usage in countries.

ARJUN SEHGAL 7

Again, a polynomial trend line has been used with degree three to estimate the data.
However, the data that is falling out of the trend line is for the countries in which the number
of students studying abroad is abnormal like India and China. These countries can be
identified as the high orange colored peaks in the above figure.

In the following visualization showing us the number of students studying abroad and the rate
of unemployment amongst youth aged 15-24 in a country also follows a similar pattern, like
the last graph of internet users vs. students. In this plot also, we have estimated the data
points using a polynomial trend line, however the exceptions for countries with extremely
high students abroad are present.

ARJUN SEHGAL 8

7. PREDICTIVE ANALYTICS IN H2O

Once the data was analyzed using the visualizations created previously, predictive analytics
were performed on the data. This was done so that we can predict and further emulate
various scenarios which might affect the number of students studying abroad.

For this purpose, the software H2O has been used, which can be used for performing
predictive analytics using the local machine, or on top of R, Tableau or Hadoop. Mainly two
different models have been used while preparing different analysis, and from these the model
in which the predictions had the minimum error. Also two different datasets were used for
this purpose, in order to increase the efficiency of the models and identify the most relevant
factors, with which the best results were obtained.

The first dataset that was used, had all for factors that have been previously discussed, that
is unemployment, educational expenditure internet users and GDP. In the second dataset,
the column for unemployment has been omitted. This has been done as for a large number
of countries the unemployment percentage wasn’t available and trying this database against
the same models might give a better result due the the greater versatility of the dataset.
However, at the same time there is trade off between greater number of results and
increased number of factors which can affect the result.

The two models which have been use are Gradient Boosting Learning Model and the second
one is Deep Learning Model. Both the models can be used for regressions, and give us the
importance of the variables which we specify should be tested for predicting the values of the
target variable.

A Gradient Boosting Machine (GBM) is an ensemble of tree models (either regression or
classification). Both are forward-learning ensemble methods that obtain predictive results
through gradually improved estimates. Boosting is a flexible nonlinear regression procedure
that helps improve the accuracy of trees. By sequentially applying weak classification
algorithms to incrementally changing data, a series of decision trees are created that
produce an ensemble of weak prediction models.

GBM is the most accurate general purpose algorithm. It can be used for analysis on
numerous types of models and will always present relatively accurate results. Additionally,
Gradient Boosting Machines are extremely robust, meaning that the user does not have to
impute values or scale data (they can disregard distribution). This makes GBM the go-to
choice for many users, as little tweaking is required in order to get accurate results.

In the below figures, Gradient Boosting Model has been applied to the dataset that contained
all the four fields. Firstly, the data from the file all_fields.csv was loaded to h2o as a data
frame. This frame was then spit into 25:75 in order to create a validation frame, which is to
ensure that the model has converged. While specifying the model parameters, the value of
n-folds was set at 8, which is used to determine the number of folds for cross-validation.

ARJUN SEHGAL 9

The response column was then specified to be students, and the columns to be ignored were
marked. We also specify the number of trees to be created and the maximum depth for the
trees. Also we can change a parameter called learning rate, which varies from 0 to 1.0. This
rate has been set to 0.12. the default value for it is 0.1.

Once the model was created, then its parameters were noted. As we can see from above the
blue line represents the scoring history for the training frame and the orange one for the
validation frame specified.

Also, in this model we obtain the relative importance for the variables that we had specified.
From the following two figures we can note that GDP has the greatest importance at 54%,
Unemployment at 16%, Internet users at 10%, Expenditure at 10.12% and the year of
observation at 8.31%. Thus we can conclude according to this model that the year of the
observation is not that relevant towards predicting the number of students studying abroad.

ARJUN SEHGAL 10

After creating the model, another data frame was created which had sample values for which
the number of students studying abroad was already known. These values were then fed into
all the models so that they can be judged on deviation from a common data source.

For this data model, the results are can be seen below are impressive, with the percentage of
error between the real and predicted value being low throughout, with the exception of a few
entries.

Country Year Real Value Error %Error
Predicted
Value
Brazil 2011 29218 6348.00 21.73 35566.00
Brazil 2012 30235 8856.39 29.29 39091.39
Albania 2013 24147 -9505.45 39.36 14641.55
Denmark 2011 6064 760.04 12.53 6824.04
Denmark 2010 5328 1341.83 25.18 6669.83
South Korea 2012 121023 3751.17 3.10 124774.17
India 2012 188791 -9810.31 5.20 178980.69
Malaysia 2011 59855 -6918.67 11.56 52936.33

From the above table the real values of the number of students studying abroad and their
values as predicted by the model have been compared using a 3-D Bar Graph. From this
graph also we can view that there is not much error between the real and predicted values
of the model.

ARJUN SEHGAL 11

The second model that was created for the dataset that contained all the fields, was Deep
Learning. Deep Learning is another popular model that is being developed. Its algorithms are
based on distributed representations with the underlying assumption behind distributed
representations is that observed data are generated by the interactions of factors organized
in layers.

Deep Learning with H2O features automatic adaptive weight initialization, automatic data
standardization, expansion of categorical data, automatic handling of missing values,
automatic adaptive learning rates, various regularization techniques, automatic performance
tuning, load balancing, grid-search, N-fold cross-validation, checkpointing and different
distributed training modes on clusters for large data sets. The technology does not require
complicated configuration files and H2O Deep Learning is highly optimized for maximum
performance.

Like the last model in this model also we have used the same learning frame and validation
frame, the n-folds value has been kept same. Also the response column has been selected to
be that of students and the columns to be ignored have been selected. Also, the option to
specify the importance of various variables that have been specified has also been marked,
to see the difference between the previous model and this one on how they are differently
assigning importance’s to various different variables.

0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
BRAZIL BRAZIL ALBANIA DENMARK DENMARK SOUTH
KOREA
INDIA MALAYSIA
Comparison of Real vs Predicted Values for
Gradient Boosted Learning Model
Real Value Predicted Value

ARJUN SEHGAL 12

As we can see from the following two figures, in the Deep Learning Model the importance’s
that have been assigned to the variables that we specified are different from the Gradient
Boosting Model.

ARJUN SEHGAL 13

The deep learning model is heavily favoring the variables of expenditure and unemployment
as compared to the Gradient Boosting Model which is evident as there is a 6% increase in
Unemployment, 11% increase in Expenditure, 4% increase in Internet Users.

For the deep learning model, however when we predict values by using the same prediction
frame as we used in the gradient boosting model, we notice that the values that it predicts
are extremely far away from the real values which we already know. This can be observed as
the average error in this case is far greater than that of the previous model. It can also be
noticed from the difference of heights between the real and predicted values in the graph
generated below the given table.

Country Year Real Value Predicted Value Error %ERROR
Brazil 2011 29218 34999.26 5781.26 19.79
Brazil 2012 30235 33222.91 2987.91 9.88
Albania 2013 24147 13657.50 -10489.50 43.44
Denmark 2011 6064 16227.10 10163.10 167.60
Denmark 2010 5328 13790.95 8462.95 158.84
South Korea 2012 121023 23365.10 -97657.90 80.69
India 2012 188791 108642.89 -80148.11 42.45
Malaysia 2011 59855 27459.23 -32395.77 54.12

0
50000
100000
150000
200000
BRAZIL BRAZIL ALBANIA DENMARK DENMARK SOUTH
KOREA
INDIA MALAYSIA
Comparison of Real vs Predicted Values for
Deep Learning Model

ARJUN SEHGAL 14

In the last model that was tested, the dataset used is different. In this dataset we have not
used the column for youth unemployment, as it was not available for a large number of the
countries from the UN Dataset which was used as our source. In this case the Gradient
Boosting Model has been used as the Deep Learning Model was having greater error than
what can be allowed in a predictive model.

In this model also, we see that the basic parameters fro the model are still the same. Also we
can see that GDP is still the most important variable, having61.77% importance, and the
number of internet users has 19.65% importance, 11% importance for the year of
observation, with Educational expenditure being placed at 7.5%.

ARJUN SEHGAL 15

After creating the model, the same prediction frame is used in this model also to predict the
responses for a pre-defines set of values. In this model also, we see that although the error
and error percentages are low, the model in which we had considered all the four variables
and used the gradient boosting model, was having better results with a lesser value of error.
Country Year Real Value Predicted Value Error %Error
Brazil 2011 35566.00 44358.25 8792.24 24.72
Brazil 2012 39091.39 43721.19 4629.80 11.84
Albania 2013 14641.55 18386.80 3745.25 25.58
Denmark 2011 6824.04 8166.10 1342.06 19.67
Denmark 2010 6669.83 11110.61 4440.78 66.58
South Korea 2012 124774.17 107499.65 -17274.52 13.84
India 2012 178980.69 167062.88 -11917.81 6.66
Malaysia 2011 52936.33 31271.67 -21664.67 40.93

0.00
50000.00
100000.00
150000.00
200000.00
Brazil Brazil Albania Denmark Denmark South
Korea
India Malaysia
Comparison of Real vs Predicted Values for Gradient
Boosted Model (Excluding Youth Unemployment)

ARJUN SEHGAL 16

We can now successfully infer that the best fitting model for the dataset that we obtained is
the gradient boosting model, and in order to get the best results we should use the dataset
in which all the four variables are present.

8. CONCLUSION

Hence, we can successfully infer that GDP of a country pays the most dominant role in the
decision of a student to study abroad. Also, Unemployment amongst the youth and the
number of internet users although might not be that significant factors in terms of
percentages, they are also a factor which should be kept in mind while predicting the values
for future years for various countries.

9. FUTURE SCOPE

Future Scope for this project can be increased to adding further variables, which can be
relevant to the matter. Adding a greater number of variables will no doubted decrease the
percentage importance of various factors like GDP, which are currently enjoying a high
percentage. However, adding more diverse factors will increase the chance to predict the
value more accurately. It will also help us stabilize the effect of rogue values like a spike in
any value which might create an anomaly and give us an incorrect prediction for the values.

10. REFERENCES

• discuss.analyticsvidhya.com
• en.wikipedia.org
• pig.apache.org
• cwiki.apache.org
• hortonworks.com/hadoop-tutorial
• www.stackoverflow.com/
• www.h2o.ai/verticals/algos

Analyzing the trend of students studying abroad as a result of various parameters of home country

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Analyzing the trend of students studying abroad as a result of various parameters of home country

Similar to Analyzing the trend of students studying abroad as a result of various parameters of home country (20)

Recently uploaded

Recently uploaded (20)

Analyzing the trend of students studying abroad as a result of various parameters of home country