SlideShare a Scribd company logo
Using Machine Learning to Identify
the Factors of People's Mobility
INFORMS2020
Alexander Gilgur
Jose Emmanuel Ramirez-Marquez
The research performed by Jose E. Ramirez Marquez leading to these results has received funding from the National Science Foundation, CRISP Type 2 /
Collaborative Research: Resilience Analytics: A Data-Driven Approach for Enhanced Interdependent Network Resilience, Award number 1541165.
Motivation and Problem Statement
Migration is one of the key factors of population growth in US counties.
Modeling it enables planning local infrastructure and its supply chains.
2
Why do people pick up and leave and
go to the new places?
● Opportunities & Employment
● Infrastructure
● Affordability
● Stability
● Happiness & Unhappiness
● ... ... ...
Migration is one of the key factors of population growth in US counties.
Modeling it enables planning local infrastructure and its supply chains.
Motivation and Problem Statement
3
Modeling the effects of migration factors is complicated
● Confounded Causality
● Multicollinearity
● Nonlinear relationships
● Differences between Domestic & International migration
Migration is one of the key factors of population growth in US counties.
Modeling it enables planning local infrastructure and its supply chains.
Motivation and Problem Statement
4
Why do people pick up and leave and
go to the new places?
● Opportunities & Employment
● Infrastructure
● Affordability
● Stability
● Happiness & Unhappiness
● ... ... ...
Modeling the effects of migration factors is complicated
● Confounded Causality
● Multicollinearity
● Nonlinear relationships
● Differences between Domestic & International migration
We present a way to identify the factors that have
an effect on domestic, international, and overall
migration as first step to modeling migration
Migration is one of the key factors of population growth in US counties.
Modeling it enables planning local infrastructure and its supply chains.
Motivation and Problem Statement
5
Why do people pick up and leave and
go to the new places?
● Opportunities & Employment
● Infrastructure
● Affordability
● Stability
● Happiness & Unhappiness
● ... ... ...
P.C.
Importance
R.F.R.
Variance
Contribution
%
P.C.A.
Workflow
E.D.A.
6
Mobility
Factors
Data Collection
1. Community Commons Map
2. Bureau of Labor Statistics Employment Data
3. Census Bureau Income Inequality Data
4. Census Population Data
5. CDC Cause of Death Data: Suicide Mortality
6. USDA Economic Research Service Education Data
Government APIs are not very easy to use
Data were downloaded as Excel or CSV files and saved to Dropbox
They were later preprocessed for analysis
7
EDA: Target Variables’ Correlation Matrix
0: NETMIG
1: DOMESTICMIG
2: INTERNATIONALMIG
0: year
1: BIRTHS
2: CENSUSPOP
3: DEATHS
4: EMPLOYED
5: ESTIMATESBASE
6: GDP
7: GDP_LOG
8: LABOR_FORCE
9: NATURALINC
10: NPOPCHG_
11: POPESTIMATE
12: POP_CDC
13: RBIRTH
14: RDEATH
15: RNATURALINC
16: SCDEATHS
17: SCRATE
18: SC_R_DEATH
19: UNEMPLOYED
20: UNEMPLOYMENT_RATE_PCT
21: bachelors_degree_plus
22: high_school_diploma
23: lt_high_school_diploma
24: mean_income__dollars
25: mean_to_median_income_ratio
26: median_income__dollars
27: pct__bachelors_degree_plus
28: pct__high_school_diploma
29: pct__lt_high_school_diploma
30: pct__some_college_or_assoc_degree
31: some_college_or_assoc_degree
8
EDA: Target Variables’ Correlation Matrix
0: NETMIG
1: DOMESTICMIG
2: INTERNATIONALMIG
0: year
1: BIRTHS
2: CENSUSPOP
3: DEATHS
4: EMPLOYED
5: ESTIMATESBASE
6: GDP
7: GDP_LOG
8: LABOR_FORCE
9: NATURALINC
10: NPOPCHG_
11: POPESTIMATE
12: POP_CDC
13: RBIRTH
14: RDEATH
15: RNATURALINC
16: SCDEATHS
17: SCRATE
18: SC_R_DEATH
19: UNEMPLOYED
20: UNEMPLOYMENT_RATE_PCT
21: bachelors_degree_plus
22: high_school_diploma
23: lt_high_school_diploma
24: mean_income__dollars
25: mean_to_median_income_ratio
26: median_income__dollars
27: pct__bachelors_degree_plus
28: pct__high_school_diploma
29: pct__lt_high_school_diploma
30: pct__some_college_or_assoc_degree
31: some_college_or_assoc_degree
USCB models population by migrations, births, and
deaths
=> collinearity & data leakage
Factors positively correlated with international
migration are negatively correlated with domestic
migration
=> low correlation with overall migration
These variables are not all independent; we need to make them such.
9
Making Variables Independent: Principal Component Analysis
PCA finds the linear combinations (principal components, or PCs) of original variables that maximize
the variances of the principal components
This results in covariance being 0 => principal components are independent.
10
Principal Component Analysis and Dimensionality Reduction
Problem - thresholds are arbitrary
EV Threshold = 0.01
25 PCs 13 PCs
EV Threshold = 0.05
5 PCs
11
Dimensionality Reduction: Random Forest Regression
Random Forest Regression (RFR)
is one of the most robust methods
for modeling multidimensional
data. It builds a forest of
numerical decision trees and
ensembles the decision paths from
all trees by majority voting.
It is non-parametric => it does not
rely on knowledge of an underlying
model.
We can calculate feature importance
for any model, enabling
dimensionality reduction by
identifying the critical features for
further analysis.
Due to the way RFR operates,
feature importance evaluation based
on RFR is model-agnostic, ensuring
that if further modeling fails to
identify a feature as important, we
can rely on RFR-based feature
importance for model validation.
12
EDA:
● Identify Correlated Features
PCA:
● Combine Features into PCs
RFR:
● Identify Important PCs
Features
Principal Components
The EDA to PCA to RFR Daisy Chain
13
EDA:
● Identify Correlated Features
PCA:
● Combine Features into PCs
RFR:
● Identify Important PCs
Features
Principal Components
Ytest
PredictiononXtest
R2
test
= 0.78
The EDA to PCA to RFR Daisy Chain
14
EDA:
● Identify Correlated Features
PCA:
● Combine Features into PCs
RFR:
● Identify Important PCs
Features
Principal Components
Ytest
PredictiononXtest
R2
test
= 0.78
Important
Features
Ytest
PredictiononXtest
R2
test
= 0.78
R2
stabilized
?
The EDA to PCA to RFR Daisy Chain
15
The EDA to PCA to RFR Daisy Chain
EDA:
● Identify Correlated Features
PCA:
● Combine Features into PCs
RFR:
● Identify Important PCs
Features
Principal Components
Ytest
PredictiononXtest
R2
test
= 0.78
Important
Features
PC Importances
Principal Component
PC Importance
R2
stabilized
?
16
EDA:
● Identify Correlated Features
PCA:
● Combine Features into PCs
RFR:
● Identify Important PCs
Features
Principal Components
Ytest
PredictiononXtest
R2
test
= 0.78
Important
Features
PC Importances
Principal Component
PC Importance
R2
stabilized
?
Going Back to Original Features
18
We have come full circle:
1) Identified the correlated variables
2) Applied PCA to rotate the data into a form where principal components are orthogonal.
3) Used Random-Forest Regression (RFR) to perform final PC selection based on their
importance.
Now we need to
4) Transform the selected PCs back to the original features.
5) Use these features in modeling.
Further Work
19
● With the features identified by the PCA->RFR process, build the model of migration in the form of
Poiseuille's Equation:
● Infer the pressure and the resistance terms and model them based on the features:
○ Discontent
○ Socioeconomic data
○ GDP
○ Education level
○ Population volume
● Build a model predicting movements of large masses of population
Conclusions
20
● Mass migration is a high complexity multivariate problem
● Its features interact, overlap, and exhibit multicollinearity in many ways
● Properly modeling mass migrations requires understanding of the underlying processes, which is
impossible to achieve without dimensionality reduction
● We have described how this can be achieved based on publicly available data and using the
standard Machine-Learning techniques in a creative manner.
Thank You!
21
Appendix
22
Emigration: Mass Transfer Out
Information
about other
places
Possibility
of “Flight”
Affordability
of “Flight”
“Flight”
Discontent
High
Stress
Levels
Navier-Stokes Equation:
Poiseuille’s Equation:
-------
Effluent rate is proportional to
pressure difference and inversely
proportional to resistance 23
Information
about other
places
Possibility
of “Flight”
Affordability
of “Flight”
“Flight”
Immigration: Diffusion into New Host Community
Affordability
of Living
Media
Sentiment
New Host
Community Acceptance
of Strangers
24
Absorption of Community Characteristics
characteristics of the incoming community
Newcomers bring with them features that can be new to the host community. As time passes by, they retain fewer of
such characteristics, while members of the community that has adopted them become penetrated with the new
features. Likewise, features of the host community become characteristics of its new members.
In the new members of
the community
In the members of
the host community
Time
Saturation
Time
Saturation
characteristics of the host community
In the members of
the host community
In the new members of
the community
(a) (b)
25

More Related Content

Similar to Informs2020 using machine learning to identify the factors of people's mobility

Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
DataWorks Summit
 
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
Data Con LA
 
Data analysis
Data analysisData analysis
Data analysis
AnandDesshpande
 
Interpretable Machine Learning
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine Learning
inovex GmbH
 
DSDT Meetup April 2021
DSDT Meetup April 2021DSDT Meetup April 2021
DSDT Meetup April 2021
DSDT_MTL
 
Ml conference slides
Ml conference slidesMl conference slides
Ml conference slides
QuantUniversity
 
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j
 
Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons Learned
Krishnaram Kenthapadi
 
Artificial intelligence use cases for International Dating Apps. iDate 2018. ...
Artificial intelligence use cases for International Dating Apps. iDate 2018. ...Artificial intelligence use cases for International Dating Apps. iDate 2018. ...
Artificial intelligence use cases for International Dating Apps. iDate 2018. ...
Lluis Carreras
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
Adam Doyle
 
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Gloria Re Calegari
 
C2_W1---.pdf
C2_W1---.pdfC2_W1---.pdf
C2_W1---.pdf
Humayun Kabir
 
Assessing M&E Systems For Data Quality
Assessing M&E Systems For Data QualityAssessing M&E Systems For Data Quality
Assessing M&E Systems For Data Quality
MEASURE Evaluation
 
QCon conference 2019
QCon conference 2019QCon conference 2019
QCon conference 2019
QuantUniversity
 
Colombo+ronzoni+fontana
Colombo+ronzoni+fontanaColombo+ronzoni+fontana
Colombo+ronzoni+fontana
Ajay Ohri
 
KPCA and Eigen Face Based Dimension Reduction Face Recognition Method
KPCA and Eigen Face Based Dimension Reduction Face Recognition MethodKPCA and Eigen Face Based Dimension Reduction Face Recognition Method
KPCA and Eigen Face Based Dimension Reduction Face Recognition Method
ijtsrd
 
Unveiling Citywide Data to Generate Artificial Intelligent Solutions
Unveiling Citywide Data to Generate Artificial Intelligent SolutionsUnveiling Citywide Data to Generate Artificial Intelligent Solutions
Unveiling Citywide Data to Generate Artificial Intelligent Solutions
RPO America
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
VishalLabde
 
Vivarana literature survey
Vivarana literature surveyVivarana literature survey
Vivarana literature survey
Tharindu Ranasinghe
 
Data Warehouse techniques on Intermediate Census and Demographic Statistics W...
Data Warehouse techniques on Intermediate Census and Demographic Statistics W...Data Warehouse techniques on Intermediate Census and Demographic Statistics W...
Data Warehouse techniques on Intermediate Census and Demographic Statistics W...
Vincenzo Patruno
 

Similar to Informs2020 using machine learning to identify the factors of people's mobility (20)

Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
 
Data analysis
Data analysisData analysis
Data analysis
 
Interpretable Machine Learning
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine Learning
 
DSDT Meetup April 2021
DSDT Meetup April 2021DSDT Meetup April 2021
DSDT Meetup April 2021
 
Ml conference slides
Ml conference slidesMl conference slides
Ml conference slides
 
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
 
Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons Learned
 
Artificial intelligence use cases for International Dating Apps. iDate 2018. ...
Artificial intelligence use cases for International Dating Apps. iDate 2018. ...Artificial intelligence use cases for International Dating Apps. iDate 2018. ...
Artificial intelligence use cases for International Dating Apps. iDate 2018. ...
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
 
C2_W1---.pdf
C2_W1---.pdfC2_W1---.pdf
C2_W1---.pdf
 
Assessing M&E Systems For Data Quality
Assessing M&E Systems For Data QualityAssessing M&E Systems For Data Quality
Assessing M&E Systems For Data Quality
 
QCon conference 2019
QCon conference 2019QCon conference 2019
QCon conference 2019
 
Colombo+ronzoni+fontana
Colombo+ronzoni+fontanaColombo+ronzoni+fontana
Colombo+ronzoni+fontana
 
KPCA and Eigen Face Based Dimension Reduction Face Recognition Method
KPCA and Eigen Face Based Dimension Reduction Face Recognition MethodKPCA and Eigen Face Based Dimension Reduction Face Recognition Method
KPCA and Eigen Face Based Dimension Reduction Face Recognition Method
 
Unveiling Citywide Data to Generate Artificial Intelligent Solutions
Unveiling Citywide Data to Generate Artificial Intelligent SolutionsUnveiling Citywide Data to Generate Artificial Intelligent Solutions
Unveiling Citywide Data to Generate Artificial Intelligent Solutions
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
Vivarana literature survey
Vivarana literature surveyVivarana literature survey
Vivarana literature survey
 
Data Warehouse techniques on Intermediate Census and Demographic Statistics W...
Data Warehouse techniques on Intermediate Census and Demographic Statistics W...Data Warehouse techniques on Intermediate Census and Demographic Statistics W...
Data Warehouse techniques on Intermediate Census and Demographic Statistics W...
 

More from Alex Gilgur

INFORMS 2021 Social cohesion and emotion analysis of media during 2020 wildfi...
INFORMS 2021 Social cohesion and emotion analysis of media during 2020 wildfi...INFORMS 2021 Social cohesion and emotion analysis of media during 2020 wildfi...
INFORMS 2021 Social cohesion and emotion analysis of media during 2020 wildfi...
Alex Gilgur
 
Informs2019 machine learning and data mining in identification of unhappy c...
Informs2019   machine learning and data mining in identification of unhappy c...Informs2019   machine learning and data mining in identification of unhappy c...
Informs2019 machine learning and data mining in identification of unhappy c...
Alex Gilgur
 
Erlang capacity for_connections_cmg_1907
Erlang capacity for_connections_cmg_1907Erlang capacity for_connections_cmg_1907
Erlang capacity for_connections_cmg_1907
Alex Gilgur
 
Measuring Community Resilience: a Bayesian Approach CESUN2018
Measuring Community Resilience: a Bayesian Approach CESUN2018Measuring Community Resilience: a Bayesian Approach CESUN2018
Measuring Community Resilience: a Bayesian Approach CESUN2018
Alex Gilgur
 
The Curse of P90
The Curse of P90The Curse of P90
The Curse of P90
Alex Gilgur
 
Performance OR Capacity #CMGimPACt2016
Performance OR Capacity #CMGimPACt2016 Performance OR Capacity #CMGimPACt2016
Performance OR Capacity #CMGimPACt2016
Alex Gilgur
 
Data Science and Predictive SPC
Data Science and Predictive SPCData Science and Predictive SPC
Data Science and Predictive SPC
Alex Gilgur
 
Time Series Forecasting Modeling CMG12
Time Series Forecasting Modeling CMG12Time Series Forecasting Modeling CMG12
Time Series Forecasting Modeling CMG12
Alex Gilgur
 
CMG15 Session 525
CMG15 Session 525 CMG15 Session 525
CMG15 Session 525
Alex Gilgur
 
CSP2014 Predictive SPC
CSP2014 Predictive SPCCSP2014 Predictive SPC
CSP2014 Predictive SPC
Alex Gilgur
 
Monte carlo and network cmg'14
Monte carlo and network cmg'14Monte carlo and network cmg'14
Monte carlo and network cmg'14
Alex Gilgur
 

More from Alex Gilgur (11)

INFORMS 2021 Social cohesion and emotion analysis of media during 2020 wildfi...
INFORMS 2021 Social cohesion and emotion analysis of media during 2020 wildfi...INFORMS 2021 Social cohesion and emotion analysis of media during 2020 wildfi...
INFORMS 2021 Social cohesion and emotion analysis of media during 2020 wildfi...
 
Informs2019 machine learning and data mining in identification of unhappy c...
Informs2019   machine learning and data mining in identification of unhappy c...Informs2019   machine learning and data mining in identification of unhappy c...
Informs2019 machine learning and data mining in identification of unhappy c...
 
Erlang capacity for_connections_cmg_1907
Erlang capacity for_connections_cmg_1907Erlang capacity for_connections_cmg_1907
Erlang capacity for_connections_cmg_1907
 
Measuring Community Resilience: a Bayesian Approach CESUN2018
Measuring Community Resilience: a Bayesian Approach CESUN2018Measuring Community Resilience: a Bayesian Approach CESUN2018
Measuring Community Resilience: a Bayesian Approach CESUN2018
 
The Curse of P90
The Curse of P90The Curse of P90
The Curse of P90
 
Performance OR Capacity #CMGimPACt2016
Performance OR Capacity #CMGimPACt2016 Performance OR Capacity #CMGimPACt2016
Performance OR Capacity #CMGimPACt2016
 
Data Science and Predictive SPC
Data Science and Predictive SPCData Science and Predictive SPC
Data Science and Predictive SPC
 
Time Series Forecasting Modeling CMG12
Time Series Forecasting Modeling CMG12Time Series Forecasting Modeling CMG12
Time Series Forecasting Modeling CMG12
 
CMG15 Session 525
CMG15 Session 525 CMG15 Session 525
CMG15 Session 525
 
CSP2014 Predictive SPC
CSP2014 Predictive SPCCSP2014 Predictive SPC
CSP2014 Predictive SPC
 
Monte carlo and network cmg'14
Monte carlo and network cmg'14Monte carlo and network cmg'14
Monte carlo and network cmg'14
 

Recently uploaded

Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 

Recently uploaded (20)

Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 

Informs2020 using machine learning to identify the factors of people's mobility

  • 1. Using Machine Learning to Identify the Factors of People's Mobility INFORMS2020 Alexander Gilgur Jose Emmanuel Ramirez-Marquez The research performed by Jose E. Ramirez Marquez leading to these results has received funding from the National Science Foundation, CRISP Type 2 / Collaborative Research: Resilience Analytics: A Data-Driven Approach for Enhanced Interdependent Network Resilience, Award number 1541165.
  • 2. Motivation and Problem Statement Migration is one of the key factors of population growth in US counties. Modeling it enables planning local infrastructure and its supply chains. 2
  • 3. Why do people pick up and leave and go to the new places? ● Opportunities & Employment ● Infrastructure ● Affordability ● Stability ● Happiness & Unhappiness ● ... ... ... Migration is one of the key factors of population growth in US counties. Modeling it enables planning local infrastructure and its supply chains. Motivation and Problem Statement 3
  • 4. Modeling the effects of migration factors is complicated ● Confounded Causality ● Multicollinearity ● Nonlinear relationships ● Differences between Domestic & International migration Migration is one of the key factors of population growth in US counties. Modeling it enables planning local infrastructure and its supply chains. Motivation and Problem Statement 4 Why do people pick up and leave and go to the new places? ● Opportunities & Employment ● Infrastructure ● Affordability ● Stability ● Happiness & Unhappiness ● ... ... ...
  • 5. Modeling the effects of migration factors is complicated ● Confounded Causality ● Multicollinearity ● Nonlinear relationships ● Differences between Domestic & International migration We present a way to identify the factors that have an effect on domestic, international, and overall migration as first step to modeling migration Migration is one of the key factors of population growth in US counties. Modeling it enables planning local infrastructure and its supply chains. Motivation and Problem Statement 5 Why do people pick up and leave and go to the new places? ● Opportunities & Employment ● Infrastructure ● Affordability ● Stability ● Happiness & Unhappiness ● ... ... ...
  • 7. Data Collection 1. Community Commons Map 2. Bureau of Labor Statistics Employment Data 3. Census Bureau Income Inequality Data 4. Census Population Data 5. CDC Cause of Death Data: Suicide Mortality 6. USDA Economic Research Service Education Data Government APIs are not very easy to use Data were downloaded as Excel or CSV files and saved to Dropbox They were later preprocessed for analysis 7
  • 8. EDA: Target Variables’ Correlation Matrix 0: NETMIG 1: DOMESTICMIG 2: INTERNATIONALMIG 0: year 1: BIRTHS 2: CENSUSPOP 3: DEATHS 4: EMPLOYED 5: ESTIMATESBASE 6: GDP 7: GDP_LOG 8: LABOR_FORCE 9: NATURALINC 10: NPOPCHG_ 11: POPESTIMATE 12: POP_CDC 13: RBIRTH 14: RDEATH 15: RNATURALINC 16: SCDEATHS 17: SCRATE 18: SC_R_DEATH 19: UNEMPLOYED 20: UNEMPLOYMENT_RATE_PCT 21: bachelors_degree_plus 22: high_school_diploma 23: lt_high_school_diploma 24: mean_income__dollars 25: mean_to_median_income_ratio 26: median_income__dollars 27: pct__bachelors_degree_plus 28: pct__high_school_diploma 29: pct__lt_high_school_diploma 30: pct__some_college_or_assoc_degree 31: some_college_or_assoc_degree 8
  • 9. EDA: Target Variables’ Correlation Matrix 0: NETMIG 1: DOMESTICMIG 2: INTERNATIONALMIG 0: year 1: BIRTHS 2: CENSUSPOP 3: DEATHS 4: EMPLOYED 5: ESTIMATESBASE 6: GDP 7: GDP_LOG 8: LABOR_FORCE 9: NATURALINC 10: NPOPCHG_ 11: POPESTIMATE 12: POP_CDC 13: RBIRTH 14: RDEATH 15: RNATURALINC 16: SCDEATHS 17: SCRATE 18: SC_R_DEATH 19: UNEMPLOYED 20: UNEMPLOYMENT_RATE_PCT 21: bachelors_degree_plus 22: high_school_diploma 23: lt_high_school_diploma 24: mean_income__dollars 25: mean_to_median_income_ratio 26: median_income__dollars 27: pct__bachelors_degree_plus 28: pct__high_school_diploma 29: pct__lt_high_school_diploma 30: pct__some_college_or_assoc_degree 31: some_college_or_assoc_degree USCB models population by migrations, births, and deaths => collinearity & data leakage Factors positively correlated with international migration are negatively correlated with domestic migration => low correlation with overall migration These variables are not all independent; we need to make them such. 9
  • 10. Making Variables Independent: Principal Component Analysis PCA finds the linear combinations (principal components, or PCs) of original variables that maximize the variances of the principal components This results in covariance being 0 => principal components are independent. 10
  • 11. Principal Component Analysis and Dimensionality Reduction Problem - thresholds are arbitrary EV Threshold = 0.01 25 PCs 13 PCs EV Threshold = 0.05 5 PCs 11
  • 12. Dimensionality Reduction: Random Forest Regression Random Forest Regression (RFR) is one of the most robust methods for modeling multidimensional data. It builds a forest of numerical decision trees and ensembles the decision paths from all trees by majority voting. It is non-parametric => it does not rely on knowledge of an underlying model. We can calculate feature importance for any model, enabling dimensionality reduction by identifying the critical features for further analysis. Due to the way RFR operates, feature importance evaluation based on RFR is model-agnostic, ensuring that if further modeling fails to identify a feature as important, we can rely on RFR-based feature importance for model validation. 12
  • 13. EDA: ● Identify Correlated Features PCA: ● Combine Features into PCs RFR: ● Identify Important PCs Features Principal Components The EDA to PCA to RFR Daisy Chain 13
  • 14. EDA: ● Identify Correlated Features PCA: ● Combine Features into PCs RFR: ● Identify Important PCs Features Principal Components Ytest PredictiononXtest R2 test = 0.78 The EDA to PCA to RFR Daisy Chain 14
  • 15. EDA: ● Identify Correlated Features PCA: ● Combine Features into PCs RFR: ● Identify Important PCs Features Principal Components Ytest PredictiononXtest R2 test = 0.78 Important Features Ytest PredictiononXtest R2 test = 0.78 R2 stabilized ? The EDA to PCA to RFR Daisy Chain 15
  • 16. The EDA to PCA to RFR Daisy Chain EDA: ● Identify Correlated Features PCA: ● Combine Features into PCs RFR: ● Identify Important PCs Features Principal Components Ytest PredictiononXtest R2 test = 0.78 Important Features PC Importances Principal Component PC Importance R2 stabilized ? 16
  • 17. EDA: ● Identify Correlated Features PCA: ● Combine Features into PCs RFR: ● Identify Important PCs Features Principal Components Ytest PredictiononXtest R2 test = 0.78 Important Features PC Importances Principal Component PC Importance R2 stabilized ?
  • 18. Going Back to Original Features 18 We have come full circle: 1) Identified the correlated variables 2) Applied PCA to rotate the data into a form where principal components are orthogonal. 3) Used Random-Forest Regression (RFR) to perform final PC selection based on their importance. Now we need to 4) Transform the selected PCs back to the original features. 5) Use these features in modeling.
  • 19. Further Work 19 ● With the features identified by the PCA->RFR process, build the model of migration in the form of Poiseuille's Equation: ● Infer the pressure and the resistance terms and model them based on the features: ○ Discontent ○ Socioeconomic data ○ GDP ○ Education level ○ Population volume ● Build a model predicting movements of large masses of population
  • 20. Conclusions 20 ● Mass migration is a high complexity multivariate problem ● Its features interact, overlap, and exhibit multicollinearity in many ways ● Properly modeling mass migrations requires understanding of the underlying processes, which is impossible to achieve without dimensionality reduction ● We have described how this can be achieved based on publicly available data and using the standard Machine-Learning techniques in a creative manner.
  • 23. Emigration: Mass Transfer Out Information about other places Possibility of “Flight” Affordability of “Flight” “Flight” Discontent High Stress Levels Navier-Stokes Equation: Poiseuille’s Equation: ------- Effluent rate is proportional to pressure difference and inversely proportional to resistance 23
  • 24. Information about other places Possibility of “Flight” Affordability of “Flight” “Flight” Immigration: Diffusion into New Host Community Affordability of Living Media Sentiment New Host Community Acceptance of Strangers 24
  • 25. Absorption of Community Characteristics characteristics of the incoming community Newcomers bring with them features that can be new to the host community. As time passes by, they retain fewer of such characteristics, while members of the community that has adopted them become penetrated with the new features. Likewise, features of the host community become characteristics of its new members. In the new members of the community In the members of the host community Time Saturation Time Saturation characteristics of the host community In the members of the host community In the new members of the community (a) (b) 25