SlideShare a Scribd company logo
1 of 26
Download to read offline
Gradient Boosting Trees for Spatial Data with
Application to Public Health
Bo Li
Aug 7, 2019
University of Illinois at Urbana-Champaign
SAMSI GDRR Workshop, Raleigh, NC
Acknowledgement
Collaborators:
• Peng Wang (University of Cincinnati)
• Yunzhang Zhu (Ohio State University)
Support: NSF-AGS-1602845; NSF-DMS-1830312
1
Spatial data
Figure 1: One example from the google search for spatial data
2
Introduction
• Spatial and Spatiotemporal data is prevalent in many different
research areas including climatology, agriculture, public health and
various environmental studies.
• Making prediction over a spatial domain or of the future is a typical
problem of interest. For example, predict the rainfall at unobserved
locations, or predict HIV rate at county level for the next year
• The covariance structure of the spatial and spatiotemporal processes
plays a central role in prediction.
• Classical geostatistical methods usually assume some parametric
models for the underlying process.
3
Introduction
Consider a random field {Y (s) : s ∈ D}, where D is domain of interest in
d-dimensional Euclidean space Rd
.
A common framework of modeling this stochastic process is
Y (s) = µ(s) + Z(s) + (s),
where µ(s) is a function of known covariates x(s) = (x1(s), . . . , xp(s))T
,
often taking a form of µ(s) = x(s)T
β, and (s) is white noise.
Many literatures focus on modeling the covariance of the zero mean
random process Z(s).
4
Introduction
Common assumptions made on Z(s):
• Second order stationarity
• Z(s) follows a Gaussian random process
• The covariance structure of Z(s) follows a parametric model
• Nonparametric and nonstationary models are more robust and
flexible but still based on certain assumptions about the distribution
and covariance structure
We would like to try a completely assumption free method for making
prediction
5
Gradient Boosting Machine
• A machine learning technique that ensembles weak learners, usually
small decision trees, to produce a strong, accurate prediction model.
• Winner of Olsen et al. (2018) which compared 13 machine learning
approaches on 165 data sets.
• On data analytics competition platform Kaggle, gradient boosting is
the winning algorithm for almost every structured data (Usmani,
2017).
• Probably the most applied machine learning approaches in industry
for structured data.
6
Gradient Boosting Machine
Given observations (yi , xi ), i = 1, . . . , n, the gradient boosting trees fit a
non-parametric additive model
F(xi ) =
M
m=1
νhm(xi )
by minimizing the loss function
n
i=1 l(yi , Fm(xi )).
• Each hm(xi ) often comes from a decision tree, and the gradient
boosting algorithm (Friedman, 2001) fits hm(x) sequentially.
• The constant ν is a shrinkage factor that controls the learning rate
and prevents overfitting.
• Interpreted as nonparametric additive model
7
Gradient Boosting Machine
At each step m, we first generate a psuedo-response ˜yi by
˜yi = −
∂l(yi , F(xi ))
∂F(xi ) F(x)=Fm−1(x)
where Fm−1(x) =
m−1
m =1 νhm (x).
• Then hm(xi ) is fit based on (˜yi , xi ), i = 1, . . . , n by finding the tree
hm(x) that minimizes
n
i=1 l(˜yi , hm(xi )).
8
Decision Tree
• A tree based model is a non-parametric local constant prediction
model estimated by recursively partitioning the predictors.
• The most commonly used algorithm might be the classification and
regression tree (CART)
• Traditionally, the split was mainly determined by predictors.
• However, spatial data has its unique feature of spatial correlation
which should be taken into account
• How to integrate spatial information into decision tree?
9
Spatial Gradient Boosting Tree
• First estimate spatial clusters based on data dependency structure
• Take average of observations within each cluster and use that as
cluster-observation
• Treat the cluster-observations as another covariate or predictor so it
contributes to the splitting rule
• Once we generate the cluster-observations, we can adopt any
tree-fitting algorithm that is appropriate
10
Spatial Gradient Boosting Tree
• To cluster the observations, we use convex clustering over an
undirected graph (Qian, 2019).
• Consider each location or region a node of a graph.
• We partition the locations using convex clustering over the given
graph by considering the following optimization criterion
n
i=1
l(yi , µi ) + λ
(i,j)∈E
|µi − µj |, (1)
where E is the set of all edges in the graph, and
l(yi , µi ) = (yi − µi )2
is the least square loss
• Use a variant of an alternating direction methods of multipliers
(ADMM) algorithm for the optimization
11
Spatial clustering
Figure 2: Illustration of clustering spatial observations
12
Spatial clustering
Figure 3: Illustration of clustering spatial observations.
13
Decision tree
Let ˆµi be the cluster-observation. Grow a decision tree using yi as the
response, and Zi = (xi , ˆµi ) as the predictors with the following recursive
partitioning algorithms
a Start at the root node that contains all the samples.
b At each child node, find the best split based on each predictor Zi by
minimizing
Lj =
i∈R1(j)
l(yi , ˆyR1(j)) +
i∈R2(j)
l(yi , ˆyR2(j)) +
i∈RNA(j)
l(yi , ˆyRNA(j)).
c Find j = arg max Lj and use R1(j ), R2(j ) and RNA(j ) to split the
node into three child nodes.
d Repeat b and c until a stopping criteria is reached.
14
Predicting Aedes Aegypti Appearance
Figure 4: Aedes Aegypti, also known as the yellow fever mosquito
A mosquito that that can carry and spread multiple viruses, including
dengue fever, chikungunya, Zika fever, Mayaro and yellow fever viruses.
15
Predicting Aedes Aegypti Appearance
• Yellow fever mosquito is present across all regions of United States
and could exist for a span of multiple years, in some cases forever
• Once appears, tend to stay around until winter times
• Important to predict when and whether they would appear in spring
or early summer
• The spatial-temporal dynamics of its distribution is not well
understood. Therefore, prediction of its presence remains a
challenging problem.
16
Predicting Aedes Aegypti Appearance
• Goal: predict Aedes Aegypti appearances in March, April and May
for California Counties
• Use data in 2017 as the training data
• Response: Abundances generated from data
• Covariate: abundances of previous months, mean daily min temp
• Challenges – small sample size, missing values.
17
Predicting Aedes Aegypti Appearance
Table 1: Prediction Results for Presence of Aedes aegypti in three months of
2018 for 41 California Counties
Prediction Methods
Spatial GBM Regular GBM Naive∗
Present Absent Present Absent Present Absent
Mar Truth
Present 5 0 5 0 4 1
Absent 2 34 4 32 0 33
Apr Truth
Present 8 1 5 4 5 4
Absent 1 31 0 32 0 32
May Truth
Present 9 2 9 2 9 2
Absent 0 30 0 30 0 30
*Three counties have missing values for February 2018 after imputation
18
Predicting Aedes Aegypti Appearance
• In all three cases, the regular GBM has almost identical performance
as the naive predictor, which simply uses the data from the previous
month.
• For both March and April, when mosquitos starts to appear, the
proposed spatial GBM method holds an advantage over the other
approaches.
• Spatial GBM successfully identified 8 out of 9 counties where Aedes
aegypti starts to emerge in April by leveraging the spatial information
into the machine, whereas other methods only identified 5.
19
HIV new diagnosis prediction
• New HIV diagnosis rates of all counties in US from 2008-2015
• Data available at AIDSVU.org
• Shand et al. (2018) proposed a spatially varying autoregressive
(SVAR) model to make one year ahead prediction
• We will use the 2008-2014 data to fit the model and predict HIV
rate of 2015
• Compare our results to
• Regular gradient boosting without taking spatial information into
account
• SVAR model in Shand et al. (2018) at CA, Florida and New England
States
20
HIV new diagnosis prediction
Data is suppressed if below 5 per 100,000 people.
We impute the missing data according to the following rule:
˜yi =



0, if yj = NA for any j ∈ Ci , and yk = NA for any k ∈ Cj and j ∈ Ci ,
Rpoisson(0.5), if yj = NA for any j ∈ Ci but yk = NA, for some k ∈ Cj and j ∈ Ci ,
Rpoisson(¯yCi
), if yj = NA for some j ∈ Ci ,
where Rpoisson(0.5) is a random integer between 0 and 4 generated using a
normalized Poisson(0.5)
¯yCi
=
j∈Ci ,yj =NA
yj
|j ∈ Ci , yj = NA|
,
Rpoisson(¯yCi
) is generated using a truncated Poisson(¯yCi
).
21
Performance comparison over all counties in US
Prediction with regular GBM, MSE = 233
Truth (0, 5] (5, 30] (30, 80] (80, inf)
(0, 5] 2150 199 0 0
(5, 30] 98 461 0 0
(30, 80] 16 62 4 0
(80, inf) 1 1 1 0
Prediction with spatial GBM, MSE = 232
Truth (0, 5] (5, 30] (30, 80] (80, inf)
(0, 5] 2113 226 10 0
(5, 30] 86 468 5 0
(30, 80] 15 39 28 0
(80, inf) 1 0 2 0
The spatial modeling does much better on the high rate of HIV diagnosis!
22
23
Comparison to VAR model for all counties
Table 2: Comparison between three models in terms of Misclassification Rate
for three states
SVAR Model Spatial GBM Regular GBM
California 1 0.586 0.586
Florida 1 0.895 0.776
New England 0.980 0.592 0.586
Table 3: Comparison between three models in terms of MSE for three states
SVAR Model Spatial GBM Regular GBM
California 76.18 13.17 12.82
Florida 266.26 95.11 116.66
New England 47.94 28.78 27.34
24
Conclusion
• Propose a spatial gradient boosting tree to make prediction for
spatial data
• A statistical machine learning approach that can non-parametrically
incorporate spatial information through spatial clusters
• A proof of concept: not necessarily look at the second moments
• Future work: take the temporal correlation in spatial-temporal data
more seriously
25

More Related Content

Similar to GDRR Opening Workshop - Gradient Boosting Trees for Spatial Data Prediction - Bo Li, August 7, 2019

Ground measured data vs meteo data sets:57 locations in India_01.01.2020
Ground measured data vs meteo data sets:57 locations in India_01.01.2020Ground measured data vs meteo data sets:57 locations in India_01.01.2020
Ground measured data vs meteo data sets:57 locations in India_01.01.2020Gensol Engineering Limited
 
Interpolation 2013
Interpolation 2013Interpolation 2013
Interpolation 2013Atiqa Khan
 
Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...
Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...
Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...IRJET Journal
 
G.Arbia, Statistical problems with micro-data geo-masked for confidentiality
G.Arbia,  Statistical problems with micro-data geo-masked for confidentialityG.Arbia,  Statistical problems with micro-data geo-masked for confidentiality
G.Arbia, Statistical problems with micro-data geo-masked for confidentialityIstituto nazionale di statistica
 
COVID-19 (Coronavirus Disease) Outbreak Prediction Using a Susceptible-Expos...
COVID-19 (Coronavirus Disease) Outbreak  Prediction Using a Susceptible-Expos...COVID-19 (Coronavirus Disease) Outbreak  Prediction Using a Susceptible-Expos...
COVID-19 (Coronavirus Disease) Outbreak Prediction Using a Susceptible-Expos...Dr. Amir Mosavi, PhD., P.Eng.
 
FPP 1. Getting started
FPP 1. Getting startedFPP 1. Getting started
FPP 1. Getting startedRob Hyndman
 
Some sampling techniques for big data analysis
Some sampling techniques for big data analysisSome sampling techniques for big data analysis
Some sampling techniques for big data analysisJae-kwang Kim
 
IRJET- Disease Prediction using Machine Learning
IRJET-  Disease Prediction using Machine LearningIRJET-  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine LearningIRJET Journal
 
3. Statistical inference_anesthesia.pptx
3.  Statistical inference_anesthesia.pptx3.  Statistical inference_anesthesia.pptx
3. Statistical inference_anesthesia.pptxAbebe334138
 
Introduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesIntroduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesAdi Handarbeni
 
Forest Change Detection in incomplete satellite images with deep neural networks
Forest Change Detection in incomplete satellite images with deep neural networksForest Change Detection in incomplete satellite images with deep neural networks
Forest Change Detection in incomplete satellite images with deep neural networksAatif Sohail
 
Statistical inference: Estimation
Statistical inference: EstimationStatistical inference: Estimation
Statistical inference: EstimationParag Shah
 

Similar to GDRR Opening Workshop - Gradient Boosting Trees for Spatial Data Prediction - Bo Li, August 7, 2019 (20)

How to Decide the Best Fuzzy Model in ANFIS
How to Decide the Best Fuzzy Model in ANFIS How to Decide the Best Fuzzy Model in ANFIS
How to Decide the Best Fuzzy Model in ANFIS
 
50120130405020
5012013040502050120130405020
50120130405020
 
Ground measured data vs meteo data sets:57 locations in India_01.01.2020
Ground measured data vs meteo data sets:57 locations in India_01.01.2020Ground measured data vs meteo data sets:57 locations in India_01.01.2020
Ground measured data vs meteo data sets:57 locations in India_01.01.2020
 
Interpolation 2013
Interpolation 2013Interpolation 2013
Interpolation 2013
 
Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...
Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...
Argument to use Both Statistical and Graphical Evaluation Techniques in Groun...
 
G.Arbia, Statistical problems with micro-data geo-masked for confidentiality
G.Arbia,  Statistical problems with micro-data geo-masked for confidentialityG.Arbia,  Statistical problems with micro-data geo-masked for confidentiality
G.Arbia, Statistical problems with micro-data geo-masked for confidentiality
 
COVID-19 (Coronavirus Disease) Outbreak Prediction Using a Susceptible-Expos...
COVID-19 (Coronavirus Disease) Outbreak  Prediction Using a Susceptible-Expos...COVID-19 (Coronavirus Disease) Outbreak  Prediction Using a Susceptible-Expos...
COVID-19 (Coronavirus Disease) Outbreak Prediction Using a Susceptible-Expos...
 
GDRR Opening Workshop - Modeling Approaches for High-Frequency Financial Time...
GDRR Opening Workshop - Modeling Approaches for High-Frequency Financial Time...GDRR Opening Workshop - Modeling Approaches for High-Frequency Financial Time...
GDRR Opening Workshop - Modeling Approaches for High-Frequency Financial Time...
 
presentationIDC - 14MAY2015
presentationIDC - 14MAY2015presentationIDC - 14MAY2015
presentationIDC - 14MAY2015
 
FPP 1. Getting started
FPP 1. Getting startedFPP 1. Getting started
FPP 1. Getting started
 
Af03301980202
Af03301980202Af03301980202
Af03301980202
 
Some sampling techniques for big data analysis
Some sampling techniques for big data analysisSome sampling techniques for big data analysis
Some sampling techniques for big data analysis
 
Assessing Normality
Assessing NormalityAssessing Normality
Assessing Normality
 
Data science
Data scienceData science
Data science
 
IRJET- Disease Prediction using Machine Learning
IRJET-  Disease Prediction using Machine LearningIRJET-  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine Learning
 
3. Statistical inference_anesthesia.pptx
3.  Statistical inference_anesthesia.pptx3.  Statistical inference_anesthesia.pptx
3. Statistical inference_anesthesia.pptx
 
Introduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesIntroduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resources
 
Forest Change Detection in incomplete satellite images with deep neural networks
Forest Change Detection in incomplete satellite images with deep neural networksForest Change Detection in incomplete satellite images with deep neural networks
Forest Change Detection in incomplete satellite images with deep neural networks
 
Statistical inference: Estimation
Statistical inference: EstimationStatistical inference: Estimation
Statistical inference: Estimation
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
 

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 

Recently uploaded

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 

Recently uploaded (20)

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 

GDRR Opening Workshop - Gradient Boosting Trees for Spatial Data Prediction - Bo Li, August 7, 2019

  • 1. Gradient Boosting Trees for Spatial Data with Application to Public Health Bo Li Aug 7, 2019 University of Illinois at Urbana-Champaign SAMSI GDRR Workshop, Raleigh, NC
  • 2. Acknowledgement Collaborators: • Peng Wang (University of Cincinnati) • Yunzhang Zhu (Ohio State University) Support: NSF-AGS-1602845; NSF-DMS-1830312 1
  • 3. Spatial data Figure 1: One example from the google search for spatial data 2
  • 4. Introduction • Spatial and Spatiotemporal data is prevalent in many different research areas including climatology, agriculture, public health and various environmental studies. • Making prediction over a spatial domain or of the future is a typical problem of interest. For example, predict the rainfall at unobserved locations, or predict HIV rate at county level for the next year • The covariance structure of the spatial and spatiotemporal processes plays a central role in prediction. • Classical geostatistical methods usually assume some parametric models for the underlying process. 3
  • 5. Introduction Consider a random field {Y (s) : s ∈ D}, where D is domain of interest in d-dimensional Euclidean space Rd . A common framework of modeling this stochastic process is Y (s) = µ(s) + Z(s) + (s), where µ(s) is a function of known covariates x(s) = (x1(s), . . . , xp(s))T , often taking a form of µ(s) = x(s)T β, and (s) is white noise. Many literatures focus on modeling the covariance of the zero mean random process Z(s). 4
  • 6. Introduction Common assumptions made on Z(s): • Second order stationarity • Z(s) follows a Gaussian random process • The covariance structure of Z(s) follows a parametric model • Nonparametric and nonstationary models are more robust and flexible but still based on certain assumptions about the distribution and covariance structure We would like to try a completely assumption free method for making prediction 5
  • 7. Gradient Boosting Machine • A machine learning technique that ensembles weak learners, usually small decision trees, to produce a strong, accurate prediction model. • Winner of Olsen et al. (2018) which compared 13 machine learning approaches on 165 data sets. • On data analytics competition platform Kaggle, gradient boosting is the winning algorithm for almost every structured data (Usmani, 2017). • Probably the most applied machine learning approaches in industry for structured data. 6
  • 8. Gradient Boosting Machine Given observations (yi , xi ), i = 1, . . . , n, the gradient boosting trees fit a non-parametric additive model F(xi ) = M m=1 νhm(xi ) by minimizing the loss function n i=1 l(yi , Fm(xi )). • Each hm(xi ) often comes from a decision tree, and the gradient boosting algorithm (Friedman, 2001) fits hm(x) sequentially. • The constant ν is a shrinkage factor that controls the learning rate and prevents overfitting. • Interpreted as nonparametric additive model 7
  • 9. Gradient Boosting Machine At each step m, we first generate a psuedo-response ˜yi by ˜yi = − ∂l(yi , F(xi )) ∂F(xi ) F(x)=Fm−1(x) where Fm−1(x) = m−1 m =1 νhm (x). • Then hm(xi ) is fit based on (˜yi , xi ), i = 1, . . . , n by finding the tree hm(x) that minimizes n i=1 l(˜yi , hm(xi )). 8
  • 10. Decision Tree • A tree based model is a non-parametric local constant prediction model estimated by recursively partitioning the predictors. • The most commonly used algorithm might be the classification and regression tree (CART) • Traditionally, the split was mainly determined by predictors. • However, spatial data has its unique feature of spatial correlation which should be taken into account • How to integrate spatial information into decision tree? 9
  • 11. Spatial Gradient Boosting Tree • First estimate spatial clusters based on data dependency structure • Take average of observations within each cluster and use that as cluster-observation • Treat the cluster-observations as another covariate or predictor so it contributes to the splitting rule • Once we generate the cluster-observations, we can adopt any tree-fitting algorithm that is appropriate 10
  • 12. Spatial Gradient Boosting Tree • To cluster the observations, we use convex clustering over an undirected graph (Qian, 2019). • Consider each location or region a node of a graph. • We partition the locations using convex clustering over the given graph by considering the following optimization criterion n i=1 l(yi , µi ) + λ (i,j)∈E |µi − µj |, (1) where E is the set of all edges in the graph, and l(yi , µi ) = (yi − µi )2 is the least square loss • Use a variant of an alternating direction methods of multipliers (ADMM) algorithm for the optimization 11
  • 13. Spatial clustering Figure 2: Illustration of clustering spatial observations 12
  • 14. Spatial clustering Figure 3: Illustration of clustering spatial observations. 13
  • 15. Decision tree Let ˆµi be the cluster-observation. Grow a decision tree using yi as the response, and Zi = (xi , ˆµi ) as the predictors with the following recursive partitioning algorithms a Start at the root node that contains all the samples. b At each child node, find the best split based on each predictor Zi by minimizing Lj = i∈R1(j) l(yi , ˆyR1(j)) + i∈R2(j) l(yi , ˆyR2(j)) + i∈RNA(j) l(yi , ˆyRNA(j)). c Find j = arg max Lj and use R1(j ), R2(j ) and RNA(j ) to split the node into three child nodes. d Repeat b and c until a stopping criteria is reached. 14
  • 16. Predicting Aedes Aegypti Appearance Figure 4: Aedes Aegypti, also known as the yellow fever mosquito A mosquito that that can carry and spread multiple viruses, including dengue fever, chikungunya, Zika fever, Mayaro and yellow fever viruses. 15
  • 17. Predicting Aedes Aegypti Appearance • Yellow fever mosquito is present across all regions of United States and could exist for a span of multiple years, in some cases forever • Once appears, tend to stay around until winter times • Important to predict when and whether they would appear in spring or early summer • The spatial-temporal dynamics of its distribution is not well understood. Therefore, prediction of its presence remains a challenging problem. 16
  • 18. Predicting Aedes Aegypti Appearance • Goal: predict Aedes Aegypti appearances in March, April and May for California Counties • Use data in 2017 as the training data • Response: Abundances generated from data • Covariate: abundances of previous months, mean daily min temp • Challenges – small sample size, missing values. 17
  • 19. Predicting Aedes Aegypti Appearance Table 1: Prediction Results for Presence of Aedes aegypti in three months of 2018 for 41 California Counties Prediction Methods Spatial GBM Regular GBM Naive∗ Present Absent Present Absent Present Absent Mar Truth Present 5 0 5 0 4 1 Absent 2 34 4 32 0 33 Apr Truth Present 8 1 5 4 5 4 Absent 1 31 0 32 0 32 May Truth Present 9 2 9 2 9 2 Absent 0 30 0 30 0 30 *Three counties have missing values for February 2018 after imputation 18
  • 20. Predicting Aedes Aegypti Appearance • In all three cases, the regular GBM has almost identical performance as the naive predictor, which simply uses the data from the previous month. • For both March and April, when mosquitos starts to appear, the proposed spatial GBM method holds an advantage over the other approaches. • Spatial GBM successfully identified 8 out of 9 counties where Aedes aegypti starts to emerge in April by leveraging the spatial information into the machine, whereas other methods only identified 5. 19
  • 21. HIV new diagnosis prediction • New HIV diagnosis rates of all counties in US from 2008-2015 • Data available at AIDSVU.org • Shand et al. (2018) proposed a spatially varying autoregressive (SVAR) model to make one year ahead prediction • We will use the 2008-2014 data to fit the model and predict HIV rate of 2015 • Compare our results to • Regular gradient boosting without taking spatial information into account • SVAR model in Shand et al. (2018) at CA, Florida and New England States 20
  • 22. HIV new diagnosis prediction Data is suppressed if below 5 per 100,000 people. We impute the missing data according to the following rule: ˜yi =    0, if yj = NA for any j ∈ Ci , and yk = NA for any k ∈ Cj and j ∈ Ci , Rpoisson(0.5), if yj = NA for any j ∈ Ci but yk = NA, for some k ∈ Cj and j ∈ Ci , Rpoisson(¯yCi ), if yj = NA for some j ∈ Ci , where Rpoisson(0.5) is a random integer between 0 and 4 generated using a normalized Poisson(0.5) ¯yCi = j∈Ci ,yj =NA yj |j ∈ Ci , yj = NA| , Rpoisson(¯yCi ) is generated using a truncated Poisson(¯yCi ). 21
  • 23. Performance comparison over all counties in US Prediction with regular GBM, MSE = 233 Truth (0, 5] (5, 30] (30, 80] (80, inf) (0, 5] 2150 199 0 0 (5, 30] 98 461 0 0 (30, 80] 16 62 4 0 (80, inf) 1 1 1 0 Prediction with spatial GBM, MSE = 232 Truth (0, 5] (5, 30] (30, 80] (80, inf) (0, 5] 2113 226 10 0 (5, 30] 86 468 5 0 (30, 80] 15 39 28 0 (80, inf) 1 0 2 0 The spatial modeling does much better on the high rate of HIV diagnosis! 22
  • 24. 23
  • 25. Comparison to VAR model for all counties Table 2: Comparison between three models in terms of Misclassification Rate for three states SVAR Model Spatial GBM Regular GBM California 1 0.586 0.586 Florida 1 0.895 0.776 New England 0.980 0.592 0.586 Table 3: Comparison between three models in terms of MSE for three states SVAR Model Spatial GBM Regular GBM California 76.18 13.17 12.82 Florida 266.26 95.11 116.66 New England 47.94 28.78 27.34 24
  • 26. Conclusion • Propose a spatial gradient boosting tree to make prediction for spatial data • A statistical machine learning approach that can non-parametrically incorporate spatial information through spatial clusters • A proof of concept: not necessarily look at the second moments • Future work: take the temporal correlation in spatial-temporal data more seriously 25