In recent years, statistical machine learning approaches have been extremely popular largely due to its superior performance in prediction. Of all the commonly used machine learning tools, the gradient boosting tree is usually the favored vehicle for many practitioners. On the popular data analytics competition platform Kaggle, gradient boosting is the winning algorithm for almost every structured data. Besides its superior prediction performance, the gradient boosting trees also enjoy the interpretability of a non-parametric additive model and its fitting algorithm can be paralleled. In this project, we extend this powerful machine learning technique to the realm of spatial data analysis. The proposed approach does not require any parametric assumption on spatial correlations and enjoy all the advantages of gradient boosting. We illustrate the potential of the data with application on prediction of HIV new diagnose rates for all counties of the United States.
GDRR Opening Workshop - Gradient Boosting Trees for Spatial Data Prediction - Bo Li, August 7, 2019
1. Gradient Boosting Trees for Spatial Data with
Application to Public Health
Bo Li
Aug 7, 2019
University of Illinois at Urbana-Champaign
SAMSI GDRR Workshop, Raleigh, NC
4. Introduction
• Spatial and Spatiotemporal data is prevalent in many different
research areas including climatology, agriculture, public health and
various environmental studies.
• Making prediction over a spatial domain or of the future is a typical
problem of interest. For example, predict the rainfall at unobserved
locations, or predict HIV rate at county level for the next year
• The covariance structure of the spatial and spatiotemporal processes
plays a central role in prediction.
• Classical geostatistical methods usually assume some parametric
models for the underlying process.
3
5. Introduction
Consider a random field {Y (s) : s ∈ D}, where D is domain of interest in
d-dimensional Euclidean space Rd
.
A common framework of modeling this stochastic process is
Y (s) = µ(s) + Z(s) + (s),
where µ(s) is a function of known covariates x(s) = (x1(s), . . . , xp(s))T
,
often taking a form of µ(s) = x(s)T
β, and (s) is white noise.
Many literatures focus on modeling the covariance of the zero mean
random process Z(s).
4
6. Introduction
Common assumptions made on Z(s):
• Second order stationarity
• Z(s) follows a Gaussian random process
• The covariance structure of Z(s) follows a parametric model
• Nonparametric and nonstationary models are more robust and
flexible but still based on certain assumptions about the distribution
and covariance structure
We would like to try a completely assumption free method for making
prediction
5
7. Gradient Boosting Machine
• A machine learning technique that ensembles weak learners, usually
small decision trees, to produce a strong, accurate prediction model.
• Winner of Olsen et al. (2018) which compared 13 machine learning
approaches on 165 data sets.
• On data analytics competition platform Kaggle, gradient boosting is
the winning algorithm for almost every structured data (Usmani,
2017).
• Probably the most applied machine learning approaches in industry
for structured data.
6
8. Gradient Boosting Machine
Given observations (yi , xi ), i = 1, . . . , n, the gradient boosting trees fit a
non-parametric additive model
F(xi ) =
M
m=1
νhm(xi )
by minimizing the loss function
n
i=1 l(yi , Fm(xi )).
• Each hm(xi ) often comes from a decision tree, and the gradient
boosting algorithm (Friedman, 2001) fits hm(x) sequentially.
• The constant ν is a shrinkage factor that controls the learning rate
and prevents overfitting.
• Interpreted as nonparametric additive model
7
9. Gradient Boosting Machine
At each step m, we first generate a psuedo-response ˜yi by
˜yi = −
∂l(yi , F(xi ))
∂F(xi ) F(x)=Fm−1(x)
where Fm−1(x) =
m−1
m =1 νhm (x).
• Then hm(xi ) is fit based on (˜yi , xi ), i = 1, . . . , n by finding the tree
hm(x) that minimizes
n
i=1 l(˜yi , hm(xi )).
8
10. Decision Tree
• A tree based model is a non-parametric local constant prediction
model estimated by recursively partitioning the predictors.
• The most commonly used algorithm might be the classification and
regression tree (CART)
• Traditionally, the split was mainly determined by predictors.
• However, spatial data has its unique feature of spatial correlation
which should be taken into account
• How to integrate spatial information into decision tree?
9
11. Spatial Gradient Boosting Tree
• First estimate spatial clusters based on data dependency structure
• Take average of observations within each cluster and use that as
cluster-observation
• Treat the cluster-observations as another covariate or predictor so it
contributes to the splitting rule
• Once we generate the cluster-observations, we can adopt any
tree-fitting algorithm that is appropriate
10
12. Spatial Gradient Boosting Tree
• To cluster the observations, we use convex clustering over an
undirected graph (Qian, 2019).
• Consider each location or region a node of a graph.
• We partition the locations using convex clustering over the given
graph by considering the following optimization criterion
n
i=1
l(yi , µi ) + λ
(i,j)∈E
|µi − µj |, (1)
where E is the set of all edges in the graph, and
l(yi , µi ) = (yi − µi )2
is the least square loss
• Use a variant of an alternating direction methods of multipliers
(ADMM) algorithm for the optimization
11
15. Decision tree
Let ˆµi be the cluster-observation. Grow a decision tree using yi as the
response, and Zi = (xi , ˆµi ) as the predictors with the following recursive
partitioning algorithms
a Start at the root node that contains all the samples.
b At each child node, find the best split based on each predictor Zi by
minimizing
Lj =
i∈R1(j)
l(yi , ˆyR1(j)) +
i∈R2(j)
l(yi , ˆyR2(j)) +
i∈RNA(j)
l(yi , ˆyRNA(j)).
c Find j = arg max Lj and use R1(j ), R2(j ) and RNA(j ) to split the
node into three child nodes.
d Repeat b and c until a stopping criteria is reached.
14
16. Predicting Aedes Aegypti Appearance
Figure 4: Aedes Aegypti, also known as the yellow fever mosquito
A mosquito that that can carry and spread multiple viruses, including
dengue fever, chikungunya, Zika fever, Mayaro and yellow fever viruses.
15
17. Predicting Aedes Aegypti Appearance
• Yellow fever mosquito is present across all regions of United States
and could exist for a span of multiple years, in some cases forever
• Once appears, tend to stay around until winter times
• Important to predict when and whether they would appear in spring
or early summer
• The spatial-temporal dynamics of its distribution is not well
understood. Therefore, prediction of its presence remains a
challenging problem.
16
18. Predicting Aedes Aegypti Appearance
• Goal: predict Aedes Aegypti appearances in March, April and May
for California Counties
• Use data in 2017 as the training data
• Response: Abundances generated from data
• Covariate: abundances of previous months, mean daily min temp
• Challenges – small sample size, missing values.
17
19. Predicting Aedes Aegypti Appearance
Table 1: Prediction Results for Presence of Aedes aegypti in three months of
2018 for 41 California Counties
Prediction Methods
Spatial GBM Regular GBM Naive∗
Present Absent Present Absent Present Absent
Mar Truth
Present 5 0 5 0 4 1
Absent 2 34 4 32 0 33
Apr Truth
Present 8 1 5 4 5 4
Absent 1 31 0 32 0 32
May Truth
Present 9 2 9 2 9 2
Absent 0 30 0 30 0 30
*Three counties have missing values for February 2018 after imputation
18
20. Predicting Aedes Aegypti Appearance
• In all three cases, the regular GBM has almost identical performance
as the naive predictor, which simply uses the data from the previous
month.
• For both March and April, when mosquitos starts to appear, the
proposed spatial GBM method holds an advantage over the other
approaches.
• Spatial GBM successfully identified 8 out of 9 counties where Aedes
aegypti starts to emerge in April by leveraging the spatial information
into the machine, whereas other methods only identified 5.
19
21. HIV new diagnosis prediction
• New HIV diagnosis rates of all counties in US from 2008-2015
• Data available at AIDSVU.org
• Shand et al. (2018) proposed a spatially varying autoregressive
(SVAR) model to make one year ahead prediction
• We will use the 2008-2014 data to fit the model and predict HIV
rate of 2015
• Compare our results to
• Regular gradient boosting without taking spatial information into
account
• SVAR model in Shand et al. (2018) at CA, Florida and New England
States
20
22. HIV new diagnosis prediction
Data is suppressed if below 5 per 100,000 people.
We impute the missing data according to the following rule:
˜yi =
0, if yj = NA for any j ∈ Ci , and yk = NA for any k ∈ Cj and j ∈ Ci ,
Rpoisson(0.5), if yj = NA for any j ∈ Ci but yk = NA, for some k ∈ Cj and j ∈ Ci ,
Rpoisson(¯yCi
), if yj = NA for some j ∈ Ci ,
where Rpoisson(0.5) is a random integer between 0 and 4 generated using a
normalized Poisson(0.5)
¯yCi
=
j∈Ci ,yj =NA
yj
|j ∈ Ci , yj = NA|
,
Rpoisson(¯yCi
) is generated using a truncated Poisson(¯yCi
).
21
23. Performance comparison over all counties in US
Prediction with regular GBM, MSE = 233
Truth (0, 5] (5, 30] (30, 80] (80, inf)
(0, 5] 2150 199 0 0
(5, 30] 98 461 0 0
(30, 80] 16 62 4 0
(80, inf) 1 1 1 0
Prediction with spatial GBM, MSE = 232
Truth (0, 5] (5, 30] (30, 80] (80, inf)
(0, 5] 2113 226 10 0
(5, 30] 86 468 5 0
(30, 80] 15 39 28 0
(80, inf) 1 0 2 0
The spatial modeling does much better on the high rate of HIV diagnosis!
22
25. Comparison to VAR model for all counties
Table 2: Comparison between three models in terms of Misclassification Rate
for three states
SVAR Model Spatial GBM Regular GBM
California 1 0.586 0.586
Florida 1 0.895 0.776
New England 0.980 0.592 0.586
Table 3: Comparison between three models in terms of MSE for three states
SVAR Model Spatial GBM Regular GBM
California 76.18 13.17 12.82
Florida 266.26 95.11 116.66
New England 47.94 28.78 27.34
24
26. Conclusion
• Propose a spatial gradient boosting tree to make prediction for
spatial data
• A statistical machine learning approach that can non-parametrically
incorporate spatial information through spatial clusters
• A proof of concept: not necessarily look at the second moments
• Future work: take the temporal correlation in spatial-temporal data
more seriously
25