This project explored the trends in inflows of foreign population to the U.S., or the changes in numbers of green card recipients from 2000 to 2016; and figured out an appropriate predictive method to forecast future figures.
1. United States Immigration Analysis
from 2000-2016
ITEC-620
Fall 18
Group 8: Aneeshinder Antal, Anh Do,
Ann-Sophie Kouadio IV, Jennifer Wong
2. Agenda
Identifying the Problem
Collecting Data
Cleaning Data
Analyzing Data
Descriptive Analytics
Predictive Analytics
Report Results
3. Identifying the Problem
1. Descriptive Analytics
• Countries with similar trends and patterns
• Is there enough diversity?
2. Predictive Analytics
• Predict top countries’ inflow of immigration for
2017
4. Data Collection
Source: Organization for Economic Co-operation and Development (OECD)
https://stats.oecd.org/Index.aspx?DataSetCode=MIG
The original dataset : 422,276 rows, 8 variables
6. Data Cleaning – Missing Values
1. Completely at random
• < 100 immigrants
• Political reasons
o North Korea
o Congo
2. At random
• Double counting for countries with “Former” in their names
o USSR
o Yugoslavia
o Czechoslovakia
• Serbia & Montenegro
7. K-Means Clustering
Cluster by groups of countries with similar migration
trends and patterns over the years
Treated each year as a variable, 17 total variables
10 clusters
Resulting cluster centroids = avg. inflow # for the
countries in that specific cluster for each year
13. Country Regression RMSE SES RMSE DES RMES
Mexico 22,661 28,247 18,341
China 16,476 5,630 9,148
Philippines 14,477 3,667 4,394
India 6,054 6,270 9,206
Cuba 10,486 12,900 13,476
Dominican Republic 10,336 6,597 8,821
Vietnam 4,931 5,477 5,907
El Salvador 5,081 2,461 1,918
Haiti 6,013 3,975 4,301
Jamaica 3,779 2,479 2,922
Korea 7,559 2,875 3,493
Colombia 15,329 1,679 3,347
Models Comparison
14. 1 https://www.dhs.gov/sites/default/files/publications/Lawful_Permanent_Residents_2017.pdf
Total Over-Prediction
9,567
RMSE
4,783
Country Our Prediction Actual1 Variance
Mexico 172,716 170,581 2,135
China 80,046 71,565 8,481
Philippines 53,722 49,147 4,575
India 66,628 60,394 6,234
Cuba 55,054 65,028 (9,974)
Dominican Republic 59,385 58,520 865
Vietnam 32,560 38,231 (5,671)
El Salvador 24,734 25,109 (375)
Haiti 23,583 21,824 1,759
Jamaica 21,412 21,905 (493)
Korea 20,576 19,194 1,382
Colombia 18,605 17,956 649
How Accurate Were the Predictions?
15. Key Takeaways
2017 predictions close to actuals
Prediction model depends on country
SES: More appropriate
Regression: Less appropriate
16. Future Studies
Consider other immigration parameters
E.g. Total foreign born population by nationality
Add more independent variables for prediction
Socio-economic factors
Political factors
22. Appendix 5 : Regression Model for All Countries
650000
750000
850000
950000
1050000
1150000
1250000
1350000
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
#ofPeople
Year
Regression for All Countries
23. Appendix 6 : Regression Models for Top Countries
Editor's Notes
US’s immigration policy is usually based on origin-country diversity and
Note: We will only consider the clusters found in question 1.
Inputted zero for all the years that there was a blank, assuming nobody came from those countries that year1. For Congo - Civil conflicts until 2002, so no immigration numbers until 2002. We inputted 0’s for 2000 and 20012. North Korea - North Korea has strict emigration policies for people leaving their country. The United States passed the North Korean Human Rights Act in 2004 to relax it’s policy for refugees or asylum seekers from North Korea. Removed countries with “Former” in their names - Former USSR, Former Yugoslavia, Former Czechoslovakia
Kept individual countries as they are known today
Exception - “Serbia and Montenegro”( combined their numbers from 2010-2016 into the single “Serbia and Montenegro” row to consolidate everything)
Clustering on one variable – Inflows of Foreign Population by Nationality
We wanted to see if there were any clusters of countries that had similar migration trends and patterns over the years. We treated each year as a separate variable, so 17 total variables, and we set k = 10. The resulting cluster centroids represented the average inflow number for the countries in that specific cluster for each year. We exported the centroids information into excel and this is a snapshot of what the data looked like. We then plotted the numbers over time.
We also tried clustering by each year separately, which is essentially sorting by the highest country to the lowest since we were only clustering by one variable
Mexico ended up being its own cluster since it has the highest inflow amount each year averaging about 162K each year.
The second highest average is China at about 70K.
Since Mexico’s high numbers distorted the scale of the y-axis on this graph, we removed Mexico to be able to better see details about the other clusters bunched together at the bottom.
This is the same graph, but with Mexico removed. You can see the dips and peaks more clearly for the clusters.
These clusters show that most of the immigration numbers either increase or remain consistently stable.
Most of the clusters at the bottom are consistently below 1000.
There is one cluster with a decreasing trend over the years, the yellow one (Bosnia and Herzegovina, Nicaragua especially had numbers in the mid-10,000’s in the early 2000’s and have decreased to the lower 1000’s)
Grey group = first country from Europe (UK) and Africa (Nigeria)
Our focus was on the countries in these top 3 clusters.
After Mexico, China, India, and the Philippines are consistently in the top group of countries. They are similar in that they all have high populations and have high-skilled labor, especially with the growth of the tech industry and wanting to come to the US.
The purple cluster is Cuba and Dominican Republic and these numbers have drastically increased over the years. Cuba overtook India/Philippines in the top 3 in 2016.
The blue group has more variety in terms of which regions the countries are from. This is a group of 6 countries. We have countries from east Asia, and then countries closer to the US from the Caribbean and Central/South America.
We decided to focus on these top 12 countries, including Mexico, and predict the 2017 numbers for each country, which Aneesh will now explain.
Time series models are used for forecasting.
Compared the models based on RMSEs.
Used Single exponential and double exponential modelling to generate future predictions based on the RMSE values.
Value of alpha and beta was calculated based on the 12 year training model.
No Trends were seen in these countries when training the model, hence better RMSE.
Mexico and El Salvador show some trend during the training period.
Simple Linear Regression:
Independent variable: Year, dependent variable: Total population per year.
Creates a straight line with intercept and coefficient(slope of the line)
Used Excel to generate the Simple linear Regression models using the Data Analysis tool.
Used Regression to compare the predictions from time series models.
India, Cuba and Vietnam had the least RMSE as compared to the single and double exponential models.
Comparison and results
For most countries, SES was better at predicting the numbers from 2012-2016, since most of the time series did not indicate any type of trend and was just random fluctuations, which SES does a better job at adjusting for. DES was only better for two countries, Mexico and El Salvador.
Surprisingly, Regression was slightly better for 3 countries, but that is assuming the migration numbers would continue on the regression line that we developed. Also, RMSE between Regression and SES for those countries weren’t all that different.
But since Regression is better used when there is a possible correlation between the variables, and we only had time as a variable, we decided to use SES to make the predictions for 2017 for all the countries, except for Mexico and El Salvador, which we used DES.
https://www.migrationpolicy.org/article/mexican-immigrants-united-states
However, in recent years, migration patterns have changed due to factors including the improving Mexican economy, stepped-up U.S. immigration enforcement, and the long-term drop in Mexico’s birth rates.
Comparison and results
We were able to find the actual numbers for 2017 from the Department of Homeland Security website.
Overall, our models over-predicted the numbers by 9,567 and the total RMSE is 4,783.
And now Anh will close our presentation and discuss any takeaways we gained.