SlideShare a Scribd company logo
1 of 81
Download to read offline
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 110/7/2019
Ensemble models and
Gradient Boosting, part 1.
Leonardo Auslender
Independent Statistical Consultant
Leonardo ‘dot’ Auslender ‘at’
Gmail ‘dot’ com.
Copyright 2018.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 210/7/2019
Topics to cover:
1) Why more techniques? Bias-variance tradeoff.
2)Ensembles
1) Bagging – stacking
2) Random Forests
3) Gradient Boosting (GB)
4) Gradient-descent optimization method.
5) Innards of GB and example.
6) Overall Ensembles.
7) Model Interpretation: Partial Dependency Plots (PDP)
8) Case Studies: a. GB different parameters, b. raw data vs 50/50.
9) Xgboost
10)On the practice of Ensembles.
11)References.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 310/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 410/7/2019
1) Why more techniques? Bias-variance tradeoff.
(Broken clock is right twice a day, variance of estimation = 0, bias extremely high.
Thermometer is accurate overall, but reports higher/lower temperatures at night. Unbiased,
higher variance. Betting on same horse always has zero variance, possibly extremely biased).
Model error can be broken down into three components mathematically. Let f
be estimating function. f-hat empirically derived function.
Bet on right
Horse and win.
Bet on wrong
Horse and lose.
Bet on many
Horses and win.
Bet on many horses
and lose.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 510/7/2019
Credit : Scott Fortmann-Roe (web)
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 610/7/2019
Let X1, X2, X3,,, i.i.d random variables
Well known that E(X) = , and variance (E(X)) =
➔By just averaging estimates, we lower variance and assure same
aspects of bias.
➔Let us find methods to lower or stabilize variance (at least) while
keeping low bias. And maybe also, lower the bias.
➔And since cannot be fully attained, still searching for more
techniques.
➔ Minimize general objective function:
n 
Minimize loss function to reduce bias.
Regularization, minimize model complexity.
Obj(Θ) L(Θ) Ω(Θ),
L(Θ)
Ω(Θ)
= +
=
=
set of model parameters.1 pwhere Ω {w ,,,,,,w },=
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 710/7/2019
Always use many methods and compare results, or?
Present practice is to use many methods, compare results
and select champion model. Issues: estimation time,
difficulty in comparing and interpreting results, and method
for choosing champion.
In epidemiological studies, Christodoulou (2019) compared
many studies that used Logistic Regression (LR) and
Machine Learning (ML) Methods, and concluded that there is
no superior performance of ML over LR.
ML methods are typically more difficult to interpret.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 810/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 910/7/2019
Some terminology for Model combinations.
Ensembles: general name
Prediction/forecast combination: focusing on just
outcomes
Model combination for parameters:
Bayesian parameter averaging
We focus on ensembles as Prediction/forecast
combinations.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-1010/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 1110/7/2019
Ensembles.
Bagging (bootstrap aggregation, Breiman, 1996): Adding randomness ➔
improves function estimation. Variance reduction technique, reducing
MSE. Let initial data size n.
1) Construct bootstrap sample by randomly drawing n times with replacement
(note, some observations repeated).
2) Compute sample estimator (logistic or regression, tree, ANN … Tree in
practice).
3) Redo B times, B large (50 – 100 or more in practice, but unknown).
4) Bagged estimator. For classification, Breiman recommends majority vote of
classification for each observation. Buhlmann (2003) recommends averaging
bootstrapped probabilities. Note that individual obs may not appear B times
each.
NB: Independent sequence of trees. What if …….?
Reduces prediction error by lowering variance of aggregated predictor
while maintaining bias almost constant (variance/bias trade-off).
Friedman (1998) reconsidered boosting and bagging in terms of
gradient descent algorithms, seen later on.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 1210/7/2019
From http://dni-institute.in/blogs/bagging-algorithm-concepts-with-example/
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-1310/7/2019
Ensembles
Evaluation:
Empirical studies: boosting (seen later) smaller misclassification
rates compared to bagging, reduction of both bias and
variance. Different boosting algorithms (Breiman’s arc-x4 and arc-
gv). In cases with substantial noise, bagging performs better.
Especially used in clinical studies.
Why does Bagging work?
Breiman: bagging successful because reduces instability of
prediction method. Unstable: small perturbations in data ➔ large
changes in predictor. Experimental results show variance
reduction. Studies suggest that bagging performs some
smoothing on the estimates. Grandvalet (2004) argues that
bootstrap sampling equalizes effects of highly influential
observations.
Disadvantage: cannot be visualized easily.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 1410/7/2019
Ensembles
Adaptive Bagging (Breiman, 2001): Mixes bias-reduction boosting
with variance-reduction bagging. Uses out-of-bag obs to halt
optimizer.
Stacking:
Previously, same technique used throughout. Stacking (Wolpert 1992)
combines different algorithms on single data set. Voting is then
used for final classification. Ting and Witten (1999) “stack” the
probability distributions (PD) instead.
Stacking is “meta-classifier”: combines methods.
Pros: takes best from many methods. Cons: un-interpretable,
mixture of methods become black-box of predictions.
Stacking very prevalent in WEKA.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 1510/7/2019
5.3) Tree World.
5.3.1) L. Breiman: Bagging.
2.2) L. Breiman: Random
Forestss
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 16
Explanation by way of football example for The Saints. ***
https://gormanalysis.com/random-forest-from-top-to-bottom/
Opponent OppRk
SaintsAtHo
me
Expert1Pre
dWin
Expert2Pre
dWin
SaintsWon
1 Falcons 28 TRUE TRUE TRUE TRUE
2 Cowgirls 16 TRUE TRUE TRUE TRUE
3 Eagles 30 FALSE FALSE TRUE TRUE
4 Bucs 6 TRUE FALSE TRUE FALSE
5 Bucs 14 TRUE FALSE FALSE FALSE
6 Panthers 9 FALSE TRUE TRUE FALSE
7 Panthers 18 FALSE FALSE FALSE FALSE
Goal: predict when Saints will win. 5 Predictors: Opponent, opponent rank, home
game, expert1 and expert2 predictions. If run tree, just one split on opponent because
Saints lost to Bucs and Panthers and perfect separation then, but useless for future
opponents. Instead, at each step randomly select subset of 3 (or 2, or 4) features and
grow multiple weak but different trees, which when combined, should be a smart model.
3 Examples: Tree2 Tree3
Tree1 Tree3
OppRank <= 15
(<=) Left (>) Right
Opponent in Cowgirls,
eagles, falcons
Expert2 pred
F =Left T= Right
OppRank <= 12.5
(<=) Left (>) Right
Opponent in Cowgirls,
eagles, falcons (left)
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 1710/7/2019
Table of Opponent by SaintsWon
Opponent SaintsWon
Frequency|FALSE |TRUE | Total
---------+--------+--------+
BUCS | 2 | 0 | 2
---------+--------+--------+
COWGIRLS | 0 | 1 | 1
---------+--------+--------+
EAGLES | 0 | 1 | 1
---------+--------+--------+
FALCONS | 0 | 1 | 1
---------+--------+--------+
PANTHERS | 2 | 0 | 2
---------+--------+--------+
Total 4 3 7
Notice perfect separation
Of Opponent and SaintsWon
(target in model). Model
would be perfect fit with just
Opponent but useless for
Other opponent teams.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 1810/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 1910/7/2019
Assume following test data and predictions:
Opponent OppRk
SaintsAtH
ome
Expert1Pr
edWin
Expert2Pr
edWin
1 Falcons 1 TRUE TRUE TRUE
2 Falcons 32 TRUE TRUE FALSE
3 Falcons 32 TRUE FALSE TRUE
Obs Tree1 Tree2 Tree3
MajorityVot
e
Sample1 FALSE FALSE TRUE FALSE
Sample2 TRUE FALSE TRUE TRUE
Sample3 TRUE TRUE TRUE TRUE
Test data
Predictions.
Note that probability can be ascribed by counting # votes for each predicted target class and yield
good ranking of prob for different classes. But problem: if “Opprk” (2nd best predictor) is in initial group
of 3 with “opponent”, it won’t be used as splitter because “opponent” is perfect. Note that there are 10
ways to choose 3 out of 5, and each predictor appears 6 times ➔
“Opponent” dominates 60% of trees, while Opprisk appears without “opponent” just
30% of the time. Could mitigate this effect by also sampling training obs to be used to
develop model, giving Opprisk a higher chance to be root (not shown).
Further, assume that Expert2 gives perfect predictions when Saints
lose (not when they win). Right now, Expert2 as predictor is lost, but if
resampling is with replacement, higher chance to use Expert2 as
predictor because more losses might just appear.
Summary:
Data with N rows and p predictors:
1) Determine # of trees to grow.
2) For each tree
Randomly sample n <= N rows with replacement.
Create tree with m <= p predictors selected randomly at each non-
final node.
Combine different tree predictions by majority voting (classification
trees) or averaging (regression trees). Note that voting can be
replaced by average of probabilities, and averaging by medians.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-2110/7/2019
Definition of Random Forests.
Decision Tree Forest: ensemble (collection) of decision trees predictions of
which are combined to make overall prediction for the forest.
Similar to TreeBoost (Gradient boosting) model because large number of
trees are grown. However, TreeBoost generates series of trees with
output of one tree going into next tree in series. In contrast, decision
tree forest grows number of independent trees in parallel, and they do not
interact until after all of them have been built.
Disadvantage: complex model, cannot be visualized like single tree. More
“black box” like neural network ➔ advisable to create both single-tree and
tree forest model.
Single-tree model can be studied to get intuitive understanding of how
predictor variables relate, and decision tree forest model can be used to
score data and generate highly accurate predictions.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-2210/7/2019
Random Forests
1. Random sample of N observations with replacement (“bagging”).
On average, about 2/3 of rows selected. Remaining 1/3 called “out
of bag (OOB)” obs. New random selection is performed for each
tree constructed.
2. Using obs selected in step 1, construct decision tree. Build tree to
maximum size, without pruning. As tree is built, allow only subset of
total set of predictor variables to be considered as possible splitters
for each node. Select set of predictors to be considered as random
subset of total set of available predictors.
For example, if there are ten predictors, choose five randomly as
candidate splitters. Perform new random selection for each split. Some
predictors (possibly best one) will not be considered for each split, but
predictor excluded from one split may be used for another split in same
tree.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-2310/7/2019
Random Forests
No Overfitting or Pruning.
"Over-fitting“: problem in large, single-tree models where model fits
noise in data ➔ poor generalization power ➔ pruning. In nearly all
cases, decision tree forests do not have problem with over-fitting, and no
need to prune trees in forest. Generally, more trees in forest, better fit.
Internal Measure of Test Set (Generalization) Error ).
About 1/3 of observations excluded from each tree in forest, called “out
of bag (OOB)”: each tree has different set of out-of-bag observations ➔
each OOB set constitutes independent test sample.
To measure generalization error of decision tree forest, OOB set for each
tree is run through tree and error rate of prediction is computed.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 2410/7/2019
Detour: Found in the Internet: PCA and RF.
https://stats.stackexchange.com/questions/294791/
how-can-preprocessing-with-pca-but-keeping-the-same-dimensionality-improve-rando
?newsletter=1&nlcode=348729%7c8657
Discovery?
“PCA before random forest can be useful not for dimensionality reduction but to give you data a
shape where random forest can perform better.
I am quite sure that in general if you transform your data with PCA keeping the same dimensionality
of the original data you will have a better classification with random forest.”
Answer:
“Random forest struggles when the decision boundary is "diagonal" in the feature space because
RF has to approximate that diagonal with lots of "rectangular" splits. To the extent that PCA re-
orients the data so that splits perpendicular to the rotated & rescaled axes align well with the
decision boundary, PCA will help. But there's no reason to believe that PCA will help in general,
because not all decision boundaries are improved when rotated (e.g. a circle). And even if you do
have a diagonal decision boundary, or a boundary that would be easier to find in a rotated space,
applying PCA will only find that rotation by coincidence, because PCA has no knowledge at all about
the classification component of the task (it is not "y-aware").
Also, the following caveat applies to all projects using PCA for supervised learning: data rotated by PCA
may have little-to-no relevance to the classification objective.”
DO NOT BELIEVE EVERYTHING THAT APPEARS ON THE WEB!!!!! BE CRITICAL!!!
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 2510/7/2019
Further Developments.
Paluszynska (2017) focuses on providing better
information on variable importance using RF.
RF is constantly being researched and improved.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 2610/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 2710/7/2019
Detour: Underlying idea for boosting classification models (NOT yet GB).
(Freund, Schapire, 2012, Boosting: Foundations and Algorithms, MIT)
Start with model M(X) and obtain 80% accuracy, or 60% R2, etc.
Then Y = M(X) + error1. Hypothesize that error is still correlated with Y.
➔model error1 as error1 = G(X) + error2, and model error2 ….
Or In general Error (t - 1) = Z(X) + error (t) ➔
Y = M(X) + G(X) + ….. + Z(X) + error (t-k). If find optimal beta weights to
combined models, then
Y = b1 * M(X) + b2 G(X) + …. + Bt Z(X) + error (t-k)
Boosting is “Forward Stagewise Ensemble method” with single data set,
iteratively reweighting observations according to previous error, especially focusing on
wrongly classified observations.
Philosophy: Focus on most difficult points to classify in previous step by
reweighting observations.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 2810/7/2019
Main idea of GB using trees (GBDT).
Let Y be target, X predictors such that f 0(X) weak model to
predict Y that just predicts mean value of Y. “weak” to avoid over-
fitting.
Improve on f 0(X) by creating f 1(X) = f 0(X) + h (x). If h perfect
model ➔ f 1(X) = y ➔ h (x) = y - f 0(X) = residuals = negative
gradients of loss (or cost) function.
Residual
Fitting
-(y – f(x))
-1; 1
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 2910/7/2019
Explanation of GB by way of example..
/blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
Predict age in following data set by way of trees, conitnuous target ➔ regression tree.
Predict age, loss function: SSE.
PersonID Age LikesGardening
PlaysVideoGa
mes
LikesHats
1 13 FALSE TRUE TRUE
2 14 FALSE TRUE FALSE
3 15 FALSE TRUE FALSE
4 25 TRUE TRUE TRUE
5 35 FALSE TRUE TRUE
6 49 TRUE FALSE FALSE
7 68 TRUE TRUE TRUE
8 71 TRUE FALSE FALSE
9 73 TRUE FALSE TRUE
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 3010/7/2019
Only 9 obs in data, thus allow Tree to have very Small # obs in final Nodes.
We want Videos variable because we suspect it’s important. But doing so
(by allowing few obs in final nodes) also brought in split in “hats”, that
seems irrelevant and just noise leading to over-fitting, because tree model
searches in smaller and smaller areas of data as it progresses.
Let’s go in steps and look at the results of Tree1 (before second splits), stopping
at first split, where predictions are 19.25 and 57.2 and obtain residuals.
root
Likes gardening
F T
19.25 57.2
Hats
F T
Videos
F T
Tree 1
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 3110/7/2019
Run another tree using Tree1 residuals as new target.
PersonID Age
Tree1
Predictio
n
Tree1
Residual
1 13 19.25 -6.25
2 14 19.25 -5.25
3 15 19.25 -4.25
4 25 57.2 -32.2
5 35 19.25 15.75
6 49 57.2 -8.2
7 68 57.2 10.8
8 71 57.2 13.8
9 73 57.2 15.8
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 3210/7/2019
New root
Video Games
F T
7.133 -3.567
Tree 2
Note: Tree2 did not use “Likes Hats” because between Hats and VideoGames, videogames is
preferred when using all obs instead of in full Tree1 in smaller region of the data where hats
appear. And thus noise is avoided.
Tree 1 SSE = 1994 Tree 2 SSE = 1765
PersonID Age
Tree1
Prediction
Tree1
Residual
Tree2
Prediction
Combined
Prediction
Final
Residual
1 13 19.25 -6.25 -3.567 15.68 2.683
2 14 19.25 -5.25 -3.567 15.68 1.683
3 15 19.25 -4.25 -3.567 15.68 0.6833
4 25 57.2 -32.2 -3.567 53.63 28.63
5 35 19.25 15.75 -3.567 15.68 -19.32
6 49 57.2 -8.2 7.133 64.33 15.33
7 68 57.2 10.8 -3.567 53.63 -14.37
8 71 57.2 13.8 7.133 64.33 -6.667
9 73 57.2 15.8 7.133 64.33 -8.667
Combined pred
for PersonID 1:
15.68 = 19.25
– 3.567
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 3310/7/2019
So Far
1) Started with ‘weak’ model F0(x) = y
2) Fitted second model to residuals h1(x) = y – F0(x)
3) Combined two previous models F2(x) = F1(x) + h1(x).
Notice that h1(x) could be any type of model (stacking), not just trees. And
continue re-cursing until M.
Initial weak model was “mean” because well known that mean minimizes SSE.
Q: how to choose M, gradient boosting hyper parameter? Usually cross-validation.
4) Alternative to mean: minimize Absolute error instead of SSE as loss function.
More expensive because minimizer is median, computationally expensive. In this case, in
Tree 1 above, use median (y) = 35, and obtain residuals.
PersonID Age F0 Residual0
1 13 35 -22
2 14 35 -21
3 15 35 -20
4 25 35 -10
5 35 35 0
6 49 35 14
7 68 35 33
8 71 35 36
9 73 35 38
Focus on observations 1 and 4 with respective residuals of -22 and -10 respectively to
understand median case. Under SSE Loss function (standard Tree regression), a reduction in
residuals of 1 unit, drops SSE by 43 and 19 resp. ( e.g., 22 * 22 – 21 * 21, 100 - 81) while for absolute
loss, reduction is just 1 and 1 (22 – 21, 10 – 9) ➔
SSE reduction will focus more on first observation because of 43, while absolute error focuses
on all obs because they are all 1 ➔
Instead of training subsequent trees on residuals of F0, train h0 on gradient of loss function (L(y, F0(x))
w.r.t y-hats produced by F0(x). With absolute error loss, subsequent h trees will consider sign of every
residual, as opposed to SSE loss that considers magnitude of residual.
Gradient of SSE =
which is “– residual” ➔ this is a gradient descent algorithm. For Absolute Error:
Each h tree groups observations into final nodes, and average gradient can be calculated in
each and scaled by factor γ, such that Fm + γm hm minimizes loss function in each node.
Shrinkage: For each gradient step, magnitude is multiplied by factor that ranges between 0 and 1 and
called learning rate ➔ each gradient step is shrunken allowing for slow convergence toward observed
values ➔ observations close to target values end up grouped into larger nodes, thus regularizing the
method.
Finally before each new tree step, row and column sampling occur to produce more different
tree splits (similar to Random Forests).
ˆ ˆ,ˆ| |
ˆ ˆ,
( ) 1 1
ˆ
 − 
= − = 
− 
= = −
(AE)
Y Y Y Y
Absolute Error Y Y
Y Y Y Y
dAE
Gradient AE or
dY
Results for SSE and Absolute Error: SSE case
Age F0
PseudoR
esidual0
h0 gamma0 F1
PseudoR
esidual1
h1 gamma1 F2
13 40.33 -27.33 -21.08 1 19.25 -6.25 -3.567 1 15.68
14 40.33 -26.33 -21.08 1 19.25 -5.25 -3.567 1 15.68
15 40.33 -25.33 -21.08 1 19.25 -4.25 -3.567 1 15.68
25 40.33 -15.33 16.87 1 57.2 -32.2 -3.567 1 53.63
35 40.33 -5.333 -21.08 1 19.25 15.75 -3.567 1 15.68
49 40.33 8.667 16.87 1 57.2 -8.2 7.133 1 64.33
68 40.33 27.67 16.87 1 57.2 10.8 -3.567 1 53.63
71 40.33 30.67 16.87 1 57.2 13.8 7.133 1 64.33
73 40.33 32.67 16.87 1 57.2 15.8 7.133 1 64.33
h1
root
-21.08 16.87
h0
Gardening
F T
root
Videos
F T
7.133 -3.567
E.g., for first observation. 40.33 is mean age, -27.33 = 13 – 40.33, -21.08 prediction due to
gardening = F. F1 = 19.25 = 40.33 – 21.08. PseudoRes1 = 13 – 19.25, F2 = 19.25 – 3.567 = 15.68.
Gamma0 = avg (pseudoresidual0 / h0) (by diff. values of h0). Same for gamma1.
Results for SSE and Absolute Error: Absolute Error case.
root
-1 0.6
h0
Gardening
F T
h1
root
Videos
F T
0.333 -0.333
Age F0
PseudoResi
dual0
h0 gamma0 F1
PseudoRes
idual1
h1 gamma1 F2
13 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25
14 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25
15 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25
25 35 -1 0.6 55 68 -1 -0.3333 0.75 67.75
35 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25
49 35 1 0.6 55 68 -1 0.3333 9 71
68 35 1 0.6 55 68 -1 -0.3333 0.75 67.75
71 35 1 0.6 55 68 1 0.3333 9 71
73 35 1 0.6 55 68 1 0.3333 9 71
E.g., for 1st observation. 35 is median age, residual = -1, 1 if residual >0 or < 0 resp.
F1 = 14.5 because 35 + 20.5 * (-1).
F2 = 14.25 = 14.5 + 0.75 * (-0.3333).
Predictions within leaf nodes computed by “mean” of obs therein.
Gamma0 = median ((age – F0) / h0) = avg ((14 – 35) / -1; (15 – 35) / -1)) = 20.5 55 = (68 – 35) / .06 .
Gamma1 = median ((age – F1) / h1) by different values of h1 (and of h0 for gamma0).
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 3710/7/2019
Quick description of GB using trees (GBDT).
1) Create very small tree as initial model, ‘weak’ learner, (e.g., tree with two terminal nodes. ( ➔
depth = 1). ‘WEAK’ avoids over-fitting and local minina, and predicts, F1, for each obs. Tree1.
2) Each tree allocates a probability of event or a mean value in each terminal node, according
to the nature of the dependent variable or target.
3) Compute “residuals” (prediction error) for every observation (if 0-1 target, apply logistic
transformation to linearize them, i.e., odds = p / 1 – p).
4) Use residuals as new ‘target variable and grow second small tree on them (second stage of
the process, same depth). To ensure against over-fitting, use random sample without
replacement (➔ “stochastic gradient boosting”.) Tree2.
5) New model, once second stage is complete, we obtain concatenation of two trees, Tree1 and
Tree2 and predictions F1 + F2 * gamma, gamma multiplier or shrinkage factor (called step
size in gradient descent).
6) Iterate procedure of computing residuals from most recent tree, which become the target of
the new model, etc.
7) In the case of a binary target variable, each tree produces at least some nodes in which
‘event’ is majority (‘events’ are typically more difficult to identify since most data sets
contain very low proportion of ‘events’ in usual case).
8) Final score for each observation is obtained by summing (with weights) the different scores
(probabilities) of every tree for each observation.
Why does it work? Why “gradient” and “boosting”?
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 3810/7/2019
Comparing GBDT vs Trees in point 4 above (I).
GBDT takes sample from training data to create tree at each iteration, CART
does not. Below, notice differences between with sample proportion of 60%
for GBDT and no sample for generic trees for the fraud data set,
Total_spend is the target. Predictions are similar.
IF doctor_visits < 8.5 THEN DO; /* GBDT */
_prediction_ + -1208.458663;
END;
ELSE DO;
_prediction_ + 1360.7910083;
END;
IF 8.5 <= doctor_visits THEN DO; /* GENERIC TREES*/
P_pseudo_res0 = 1378.74081896893;
END;
ELSE DO;
P_pseudo_res0 = -1290.94575707227;
END;
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 3910/7/2019
Comparing GBDT vs Trees in point 4 above (II).
Again, GBDT takes sample from training data to create tree at each
iteration, CART does not. If we allow for CART to work with same
proportion sample but different seed, splitting variables may be different at
specific depth of tree creation.
/* GBDT */
IF doctor_visits < 8.5 THEN DO;
_ARB_F_ + -579.8214325;
END; EDA of two samples would
ELSE DO; indicate subtle differences
_ARB_F_ + 701.49142697; that induce differences in
END; selected splitting variables.
END;
/ ORIGINAL TREES */
IF 183.5 <= member_duration THEN DO;
P_pseudo_res0 = 1677.87318718526;
END;
ELSE DO;
P_pseudo_res0 = -1165.32773940565;
END;
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4010/7/2019
More Details
Friedman’s general 2001 GB algorithm:
1) Data (Y, X), Y (N, 1), X (N, p)
2) Choose # iterations M
3) Choose loss function L(Y, F(x), Error), and corresponding gradient, i.e., 0-1 loss
function, and residuals are corresponding gradient. Function called ‘f’. Loss f
implied by Y.
4) Choose base learner h( X, θ), say shallow trees.
Algorithm:
1: initialize f0 with a constant, usually mean of Y.
2: for t = 1 to M do
3: compute negative gradient gt(x), i.e., residual from Y as next target.
4: fit a new base-learner function h(x, θt), i.e., tree.
5: find best gradient descent step-size, and min Loss f:
6: update function estimate:
8: end for
(all f function are function estimates, i.e., ‘hats’).
0 <
n
t t ti t 1 i i
γ i 1
, 1γ argmin L(y ,f (x ) γh (x )) γ−
=
= +
−= +t t 1 t t tf f (x) γ h (x,θ )
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4110/7/2019
Specifics of Tree Gradient Boosting, called TreeBoost (Friedman).
Friedman’s 2001 GB algorithm for tree methods:
Same as previous one, and
jtprediction of tree t in final node N
for tree 'm'.
J
t jt jm
j 1
jt
h (x) p I(x N )
p :
=
= 
t t-1
In TreeBoost Friedman proposes to find optimal
in each final node instead of
unique at every iteration. Then
f (x)=f (x)+
i jt
jm
J
jt t jt
j 1
jt i t 1 i t i
γ x N
,
γ h (x)I(x N ),
γ argmin L(y ,f (x ) γh (x ))
γ
γ,
=
−


= +


Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4210/7/2019
Parallels with Stepwise (regression) methods.
Stepwise starts from original Y and X, and in later iterations
turns to residuals, and reduced and orthogonalized X matrix,
where ‘entered’ predictors are no longer used and
orthogonalized away from other predictors.
GBDT uses residuals as targets, but does not orthogonalize or
drop any predictors.
Stepwise stops either by statistical inference, or AIC/BIC
search. GBDT has a fixed number of iterations.
Stepwise has no ‘gamma’ (shrinkage factor).
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4310/7/2019
Setting. ***
Hypothesize existence of function Y = f (X, betas, error). Change of
paradigm, no MLE (e.g., logistic, regression, etc) but loss function.
Minimize Loss function itself, its expected value called risk. Many different
loss functions available, gaussian, 0-1, etc.
A loss function describes the loss (or cost) associated with all possible
decisions. Different decision functions or predictor functions will tend
to lead to different types of mistakes. The loss function tells us which
type of mistakes we should be more concerned about.
For instance, estimating demand, decision function could be linear equation
and loss function could be squared or absolute error.
The best decision function is the function that yields the lowest expected
loss, and the expected loss function is itself called risk of an estimator. 0-1
assigns 0 for correct prediction, 1 for incorrect.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6510/7/2019
Key Details. ***
Friedman’s 2001 GB algorithm: Need
1) Loss function (usually determined by nature of Y (binary,
continuous…)) (NO MLE).
2) Weak learner, typically tree stump or spline, marginally better
classifier than random (but by how much?).
3) Model with T Iterations:
# nodes in each tree;
L2 or L1 norm of leaf weights;
other. Function not directly
opti
T
ti
t 1
n T
i i k
i 1 t 1
ˆy tree (X)
ˆObjective function : L(y , y ) Ω(Tree )
Ω {
=
= =
=
+
=

 
mized by GB.}
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4510/7/2019
L2-error penalizes symmetrically away from 0, Huber penalizes less than OLS away
from [-1, 1], Bernoulli and Adaboost are very similar. Note that Y ε [-1, 1] in 0-1 case
here.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4610/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4710/7/2019
Gradient Descent.
“Gradient” descent method to find minimum of function.
Gradient: multivariate generalization of derivative of function in one
dimension to many dimensions. I.e., gradient is vector of partial
derivatives. In one dimension, gradient is tangent to function.
Easier to work with convex and “smooth” functions.
convex Non-convex
Gradient Descent.
Let L (x1, x2) = 0.5 * (x1 – 15) **2 + 0.5 * (x2 – 25) ** 2, and solve for X1 and X2 that min L by gradient
descent.
Steps:
Take M = 100. Starting point s0 = (0, 0) Step size = 0.1
Iterate m = 1 to M
1. Calculate gradient of L at sm – 1
2. Step in direction of greatest descent (negative gradient) with step size γ, i.e.,
If γ mall and M large, sm minimizes L.
Additional considerations:
Instead of M iterations, stop when next improvement small.
Use line search to choose step sizes (Line search chooses search in descent direction of
minimization).
How does it work in gradient boosting?
Objective is Min L, starting from F0(x). For m = 1, compute gradient of L w.r.t F0(x). Then fit weak learner
to gradient components ➔ for regression tree, obtain average gradient in each final node. In each node,
step in direction of avg. gradient using line search to determine step magnitude. Outcome is F1, and
repeat. In symbols:
Initialize model with constant: F0(x) = mean, median, etc.
For m = 1 to M Compute pseudo residual
fit base learner h to residuals
compute step magnitude gamma m (for trees, different gamma for
each node)
Update Fm(x) = Fm-1(x) + γm hm(x)
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4910/7/2019
“Gradient” descent
Method of gradient descent is a first order optimization algorithm that is based on taking
small steps in direction of the negative gradient at one point in the curve in order to find
the (hopefully global) minimum value (of loss function). If it is desired to search for the
maximum value instead, then the positive gradient is used and the method is then called
gradient ascent.
Second order not searched, solution could be local minimum.
Requires starting point, possibly many to avoid local minima.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5010/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-5110/7/2019
Comparing full tree (depth = 6) to boosted tree residuals by iteration..
2 GB versions: 1) with raw 20% events (M1), 2) with 50/50 mixture of events (M2). Non GB
Tree (referred as maxdepth 6 for M1 data set) the most biased. Notice that M2 stabilizes
earlier than M1. X axis: Iteration #. Y axis: average Residual. “Tree Depth 6” obviously
unaffected by iteration since it’s single tree run.
1.5969917399003E-15
-2.9088316687833E-16
Tree depth 6 2.83E-15
0 2 4 6 8 10
Iteration
-5E-15
-2.5E-15
0
2.5E-15
5E-15
MEAN_RESID_M1_TRN_TREES
MEAN_RESID_M2_TRN_TREESMEAN_RESID_M1_TRN_TREES
Avg residuals by iteration by model names in gradient boosting
Vertical line - Mean stabilizes
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5210/7/2019
Comparing full tree (depth = 6) to boosted tree residuals by iteration..
Now Y = Var. of resids. M2 has highest variance followed by Depth 6 (single tree) and
then M1. M2 stabilizes earlier as well. In conclusion, M2 has lower bias and higher variance
than M1 in this example, difference lies on mixture of 0-1 in target variable.
0.1218753847
8
0.1781230782
5
Depth 6 = 0.145774
0.1219
0.1404
0.159
0.1775
0.196
0.2146
VarofResids
0 2 4 6 8 10
Iteration
VAR_RESID_M2_TRN_TREESVAR_RESID_M1_TRN_TREES
Variance of residuals by iteration in gradient boosting
Vertical line - Variance stabilizes
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5310/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5410/7/2019
Important Message N
1
Basic information on the original data set.s:
1
..
1
Data set name ........................ train
1
. # TRN obs ............... 3595
1
Validation data set ................. validata
1
. # VAL obs .............. 2365
1
Test data set ................
1
. # TST obs .............. 0
1
...
1
Dep variable ....................... fraud
1
.....
1
Pct Event Prior TRN............. 20.389
1
Pct Event Prior VAL............. 19.281
1
Pct Event Prior TEST ............
1
TRN and VAL data sets obtained by random sampling
Without replacement. .
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5510/7/2019
Variable Label
1
FRAUD Fraudulent Activity yes/no
total_spend Total spent on opticals
1
doctor_visits Total visits to a doctor
1
no_claims No of claims made recently
1
member_duration Membership duration
1
optom_presc Number of opticals claimed
1
num_members Number of members covered
5
5
Fraud data set, original 20% fraudsters.
Study alternatives of changing number of iterations from 3
to 50 and depth from 1 to 10 with training and validation
data sets.
Original Percentage of fraudsters 20% in both data sets..
Notice just 5 predictors, thus max number of iterations is 50
as exaggeration. In usual large databases, number of
iterations could reach 1000 or higher.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5610/7/2019
E.g., M5_VAL_GRAD_BOOSTING: M5 case with validation data set and using
gradient boosting as modeling technique. Model # 10 as identifier.
Requested Models: Names & Descriptions.
Model #
Full Model Name Model Description
***
Overall Models
-1
M1 Raw 20pct depth 1 iterations 3
-10
M2 Raw 20pct depth 1 iterations 10
-10
M3 Raw 20pct depth 5 iterations 3
-10
M4 Raw 20pct depth 5 iterations 10
-10
M5 Raw 20pct depth 10 iterations 50
-10
01_M1_TRN_GRAD_BOOSTING Gradient Boosting
1
02_M1_VAL_GRAD_BOOSTING Gradient Boosting
2
03_M2_TRN_GRAD_BOOSTING Gradient Boosting
3
04_M2_VAL_GRAD_BOOSTING Gradient Boosting
4
05_M3_TRN_GRAD_BOOSTING Gradient Boosting
5
06_M3_VAL_GRAD_BOOSTING Gradient Boosting
6
07_M4_TRN_GRAD_BOOSTING Gradient Boosting
7
08_M4_VAL_GRAD_BOOSTING Gradient Boosting
8
09_M5_TRN_GRAD_BOOSTING Gradient Boosting
9
10_M5_VAL_GRAD_BOOSTING Gradient Boosting 10
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5710/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5810/7/2019
All agree on No_claims as First
split but not at same values
and yield different event probs.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5910/7/2019
Note M2 split
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6010/7/2019
Constrained GB parameters may create undesirable models
but GB parameters with high values (gamma, iterations)
may lead to running times
that are too long, especially when models have to be
re-touched.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6110/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6210/7/2019
Variable importance is model dependent, could lead to misleading
conclusions. .
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6310/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6410/7/2019
Goodness
Of Fit.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6510/7/2019
Prediction Probability increases with model complexity.
Also, better discrimination with M5. Fixed prob. bins vs. lift table with
Equal observation numbers per bin.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6610/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6710/7/2019
M5 best per AUROC Also when validated.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6810/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6910/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7010/7/2019
Specific GOFs
In rank order.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7110/7/2019
GOF ranks
GOF measure
rankAUROC
Avg
Square
Error
Cum Lift
3rd bin
Cum
Resp
Rate 3rd Gini
Rsquare
Cramer
Tjur
rank rank rank rank rank rank
Unw.
Mean
Model Name
5 5 5 5 5 5 5.00
01_M1_TRN_GRAD_BOOSTING
03_M2_TRN_GRAD_BOOSTING
4 4 4 4 4 4 4.00
05_M3_TRN_GRAD_BOOSTING
3 3 3 3 3 3 3.00
07_M4_TRN_GRAD_BOOSTING
2 2 2 2 2 2 2.00
09_M5_TRN_GRAD_BOOSTING
1 1 1 1 1 1 1.00
GOF ranks
GOF measure
rankAUROC
Avg
Square
Error
Cum Lift
3rd bin
Cum
Resp
Rate 3rd Gini
Rsquare
Cramer
Tjur
rank rank rank rank rank rank
Unw.
Mean
Model Name
5 5 5 5 5 5 5.00
02_M1_VAL_GRAD_BOOSTING
04_M2_VAL_GRAD_BOOSTING
4 4 4 4 4 4 4.00
06_M3_VAL_GRAD_BOOSTING
3 3 3 3 3 3 3.00
08_M4_VAL_GRAD_BOOSTING
2 2 2 2 2 2 2.00
10_M5_VAL_GRAD_BOOSTING
1 1 1 1 1 1 1.00
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7210/7/2019
M5 winner.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7310/7/2019
Huge jump in performance
Per R-square measure.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7410/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7510/7/2019
Overall conclusion for GB parameters
While higher values of number of iterations and depth imply
longer (and possibly significant) computer runs),
constraining these parameters can have significant negative
effects on model results.
In context of thousands of predictors, computer resource
availability might significantly affect model results.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7610/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7710/7/2019
Overall Ensembles.
Given specific classification study and many different modeling techniques,
create logistic regression model with original target variable and the
different predictions from the different models, without variable selection
(this is not critical).
Method also called Platt’s calibration or scaling (Platt, 2000), although used
in context of smoothing posterior probabilities, and not ensembling models.
Evaluate importance of different models either via p-values or partial
dependency plots.
Note: It is not Stacking, because Stacking “votes” to decide on final
classification.
Additional ‘easier’ ways to create ensembles: mean, median, etc. of
predictive probabilities from different models.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7810/7/2019
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7910/7/2019
Partial Dependency plots (PDP).
Due to GB’s (and other methods’) black-box nature, and to put them on
similar footing to linear methods (linear and logistic regressions) these plots
show effect of X on modeled response considering all other covariates
measured at their means.
Graphs may not capture nature of variable interactions especially if
interaction significantly affects model outcome. Let F be model of F on X
vars and F* the model predictions. Then:
PDP of F(x1 / x2, … xp) on X1 is avg (F(x1 / x2 ,,, xp)) for every point of x1.
Note that x2,,,, xp are not measured at their means but their effects are
summarized by avg. PDPs could be measured at medians, etc. LR
evaluates coeff. of X1 assuming all other vars frozen, in fact uncorrelated,
which is formally wrong. Also possible to obtain PDP of F (x1, x2 / x3….. Xp),
i.e., pairs of variables.
Since GB, Boosting, Bagging, etc. are BLACK BOX models, use PDP to
obtain model interpretation. Also useful for regression based models.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 8010/7/2019
References
Christodoulou E, Ma J, Collins GS, Steyerberg EW,
Verbakel JY, vanCalster B: A systematic review shows no
performance benefit of machine learning over logistic
regression for clinical prediction models., J Clin Epidemiol.
2019 Feb 11. pii: S0895-4356(18)31081-3.
doi:10.1016/j.jclinepi.2019.02.004
Platt J. (2000): Probabilistic outputs for support vector
machines and comparisons to regularized likelihood
methods, Advances in large margin classifiers. 10 (3): 61–
74.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 8110/7/2019

More Related Content

Similar to 4 2 ensemble models and grad boost part 1 2019-10-07

Strategic Sourcing Quantitative Decision-Making and Analytics Mo.docx
Strategic Sourcing Quantitative Decision-Making and Analytics Mo.docxStrategic Sourcing Quantitative Decision-Making and Analytics Mo.docx
Strategic Sourcing Quantitative Decision-Making and Analytics Mo.docx
cpatriciarpatricia
 

Similar to 4 2 ensemble models and grad boost part 1 2019-10-07 (20)

4 1 tree world
4 1 tree world4 1 tree world
4 1 tree world
 
Classification methods and assessment
Classification methods and assessmentClassification methods and assessment
Classification methods and assessment
 
Longintro
LongintroLongintro
Longintro
 
4 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-074 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-07
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdf
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
EPFL workshop on sparsity
EPFL workshop on sparsityEPFL workshop on sparsity
EPFL workshop on sparsity
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07
 
4 MEDA.pdf
4 MEDA.pdf4 MEDA.pdf
4 MEDA.pdf
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Intepretable Machine Learning
Intepretable Machine LearningIntepretable Machine Learning
Intepretable Machine Learning
 
M3R.FINAL
M3R.FINALM3R.FINAL
M3R.FINAL
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf
 
Strategic Sourcing Quantitative Decision-Making and Analytics Mo.docx
Strategic Sourcing Quantitative Decision-Making and Analytics Mo.docxStrategic Sourcing Quantitative Decision-Making and Analytics Mo.docx
Strategic Sourcing Quantitative Decision-Making and Analytics Mo.docx
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
 
Cb36469472
Cb36469472Cb36469472
Cb36469472
 
Data Science An Engineering Implementation Perspective
Data Science An Engineering Implementation PerspectiveData Science An Engineering Implementation Perspective
Data Science An Engineering Implementation Perspective
 
4 meda
4 meda4 meda
4 meda
 
Max diff
Max diffMax diff
Max diff
 
Algoritma Random Forest beserta aplikasi nya
Algoritma Random Forest beserta aplikasi nyaAlgoritma Random Forest beserta aplikasi nya
Algoritma Random Forest beserta aplikasi nya
 

More from Leonardo Auslender

4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
Leonardo Auslender
 

More from Leonardo Auslender (17)

1 UMI.pdf
1 UMI.pdf1 UMI.pdf
1 UMI.pdf
 
Suppression Enhancement.pdf
Suppression Enhancement.pdfSuppression Enhancement.pdf
Suppression Enhancement.pdf
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf
 
Linear Regression.pdf
Linear Regression.pdfLinear Regression.pdf
Linear Regression.pdf
 
2 UEDA.pdf
2 UEDA.pdf2 UEDA.pdf
2 UEDA.pdf
 
3 BEDA.pdf
3 BEDA.pdf3 BEDA.pdf
3 BEDA.pdf
 
1 EDA.pdf
1 EDA.pdf1 EDA.pdf
1 EDA.pdf
 
0 Statistics Intro.pdf
0 Statistics Intro.pdf0 Statistics Intro.pdf
0 Statistics Intro.pdf
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf
 
3 beda
3 beda3 beda
3 beda
 
2 ueda
2 ueda2 ueda
2 ueda
 
1 eda
1 eda1 eda
1 eda
 
0 statistics intro
0 statistics intro0 statistics intro
0 statistics intro
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Classification methods and assessment
Classification methods and assessmentClassification methods and assessment
Classification methods and assessment
 

Recently uploaded

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 

Recently uploaded (20)

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 

4 2 ensemble models and grad boost part 1 2019-10-07

  • 1. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 110/7/2019 Ensemble models and Gradient Boosting, part 1. Leonardo Auslender Independent Statistical Consultant Leonardo ‘dot’ Auslender ‘at’ Gmail ‘dot’ com. Copyright 2018.
  • 2. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 210/7/2019 Topics to cover: 1) Why more techniques? Bias-variance tradeoff. 2)Ensembles 1) Bagging – stacking 2) Random Forests 3) Gradient Boosting (GB) 4) Gradient-descent optimization method. 5) Innards of GB and example. 6) Overall Ensembles. 7) Model Interpretation: Partial Dependency Plots (PDP) 8) Case Studies: a. GB different parameters, b. raw data vs 50/50. 9) Xgboost 10)On the practice of Ensembles. 11)References.
  • 3. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 310/7/2019
  • 4. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 410/7/2019 1) Why more techniques? Bias-variance tradeoff. (Broken clock is right twice a day, variance of estimation = 0, bias extremely high. Thermometer is accurate overall, but reports higher/lower temperatures at night. Unbiased, higher variance. Betting on same horse always has zero variance, possibly extremely biased). Model error can be broken down into three components mathematically. Let f be estimating function. f-hat empirically derived function. Bet on right Horse and win. Bet on wrong Horse and lose. Bet on many Horses and win. Bet on many horses and lose.
  • 5. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 510/7/2019 Credit : Scott Fortmann-Roe (web)
  • 6. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 610/7/2019 Let X1, X2, X3,,, i.i.d random variables Well known that E(X) = , and variance (E(X)) = ➔By just averaging estimates, we lower variance and assure same aspects of bias. ➔Let us find methods to lower or stabilize variance (at least) while keeping low bias. And maybe also, lower the bias. ➔And since cannot be fully attained, still searching for more techniques. ➔ Minimize general objective function: n  Minimize loss function to reduce bias. Regularization, minimize model complexity. Obj(Θ) L(Θ) Ω(Θ), L(Θ) Ω(Θ) = + = = set of model parameters.1 pwhere Ω {w ,,,,,,w },=
  • 7. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 710/7/2019 Always use many methods and compare results, or? Present practice is to use many methods, compare results and select champion model. Issues: estimation time, difficulty in comparing and interpreting results, and method for choosing champion. In epidemiological studies, Christodoulou (2019) compared many studies that used Logistic Regression (LR) and Machine Learning (ML) Methods, and concluded that there is no superior performance of ML over LR. ML methods are typically more difficult to interpret.
  • 8. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 810/7/2019
  • 9. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 910/7/2019 Some terminology for Model combinations. Ensembles: general name Prediction/forecast combination: focusing on just outcomes Model combination for parameters: Bayesian parameter averaging We focus on ensembles as Prediction/forecast combinations.
  • 10. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-1010/7/2019
  • 11. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 1110/7/2019 Ensembles. Bagging (bootstrap aggregation, Breiman, 1996): Adding randomness ➔ improves function estimation. Variance reduction technique, reducing MSE. Let initial data size n. 1) Construct bootstrap sample by randomly drawing n times with replacement (note, some observations repeated). 2) Compute sample estimator (logistic or regression, tree, ANN … Tree in practice). 3) Redo B times, B large (50 – 100 or more in practice, but unknown). 4) Bagged estimator. For classification, Breiman recommends majority vote of classification for each observation. Buhlmann (2003) recommends averaging bootstrapped probabilities. Note that individual obs may not appear B times each. NB: Independent sequence of trees. What if …….? Reduces prediction error by lowering variance of aggregated predictor while maintaining bias almost constant (variance/bias trade-off). Friedman (1998) reconsidered boosting and bagging in terms of gradient descent algorithms, seen later on.
  • 12. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 1210/7/2019 From http://dni-institute.in/blogs/bagging-algorithm-concepts-with-example/
  • 13. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-1310/7/2019 Ensembles Evaluation: Empirical studies: boosting (seen later) smaller misclassification rates compared to bagging, reduction of both bias and variance. Different boosting algorithms (Breiman’s arc-x4 and arc- gv). In cases with substantial noise, bagging performs better. Especially used in clinical studies. Why does Bagging work? Breiman: bagging successful because reduces instability of prediction method. Unstable: small perturbations in data ➔ large changes in predictor. Experimental results show variance reduction. Studies suggest that bagging performs some smoothing on the estimates. Grandvalet (2004) argues that bootstrap sampling equalizes effects of highly influential observations. Disadvantage: cannot be visualized easily.
  • 14. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 1410/7/2019 Ensembles Adaptive Bagging (Breiman, 2001): Mixes bias-reduction boosting with variance-reduction bagging. Uses out-of-bag obs to halt optimizer. Stacking: Previously, same technique used throughout. Stacking (Wolpert 1992) combines different algorithms on single data set. Voting is then used for final classification. Ting and Witten (1999) “stack” the probability distributions (PD) instead. Stacking is “meta-classifier”: combines methods. Pros: takes best from many methods. Cons: un-interpretable, mixture of methods become black-box of predictions. Stacking very prevalent in WEKA.
  • 15. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 1510/7/2019 5.3) Tree World. 5.3.1) L. Breiman: Bagging. 2.2) L. Breiman: Random Forestss
  • 16. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 16 Explanation by way of football example for The Saints. *** https://gormanalysis.com/random-forest-from-top-to-bottom/ Opponent OppRk SaintsAtHo me Expert1Pre dWin Expert2Pre dWin SaintsWon 1 Falcons 28 TRUE TRUE TRUE TRUE 2 Cowgirls 16 TRUE TRUE TRUE TRUE 3 Eagles 30 FALSE FALSE TRUE TRUE 4 Bucs 6 TRUE FALSE TRUE FALSE 5 Bucs 14 TRUE FALSE FALSE FALSE 6 Panthers 9 FALSE TRUE TRUE FALSE 7 Panthers 18 FALSE FALSE FALSE FALSE Goal: predict when Saints will win. 5 Predictors: Opponent, opponent rank, home game, expert1 and expert2 predictions. If run tree, just one split on opponent because Saints lost to Bucs and Panthers and perfect separation then, but useless for future opponents. Instead, at each step randomly select subset of 3 (or 2, or 4) features and grow multiple weak but different trees, which when combined, should be a smart model. 3 Examples: Tree2 Tree3 Tree1 Tree3 OppRank <= 15 (<=) Left (>) Right Opponent in Cowgirls, eagles, falcons Expert2 pred F =Left T= Right OppRank <= 12.5 (<=) Left (>) Right Opponent in Cowgirls, eagles, falcons (left)
  • 17. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 1710/7/2019 Table of Opponent by SaintsWon Opponent SaintsWon Frequency|FALSE |TRUE | Total ---------+--------+--------+ BUCS | 2 | 0 | 2 ---------+--------+--------+ COWGIRLS | 0 | 1 | 1 ---------+--------+--------+ EAGLES | 0 | 1 | 1 ---------+--------+--------+ FALCONS | 0 | 1 | 1 ---------+--------+--------+ PANTHERS | 2 | 0 | 2 ---------+--------+--------+ Total 4 3 7 Notice perfect separation Of Opponent and SaintsWon (target in model). Model would be perfect fit with just Opponent but useless for Other opponent teams.
  • 18. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 1810/7/2019
  • 19. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 1910/7/2019 Assume following test data and predictions: Opponent OppRk SaintsAtH ome Expert1Pr edWin Expert2Pr edWin 1 Falcons 1 TRUE TRUE TRUE 2 Falcons 32 TRUE TRUE FALSE 3 Falcons 32 TRUE FALSE TRUE Obs Tree1 Tree2 Tree3 MajorityVot e Sample1 FALSE FALSE TRUE FALSE Sample2 TRUE FALSE TRUE TRUE Sample3 TRUE TRUE TRUE TRUE Test data Predictions. Note that probability can be ascribed by counting # votes for each predicted target class and yield good ranking of prob for different classes. But problem: if “Opprk” (2nd best predictor) is in initial group of 3 with “opponent”, it won’t be used as splitter because “opponent” is perfect. Note that there are 10 ways to choose 3 out of 5, and each predictor appears 6 times ➔ “Opponent” dominates 60% of trees, while Opprisk appears without “opponent” just 30% of the time. Could mitigate this effect by also sampling training obs to be used to develop model, giving Opprisk a higher chance to be root (not shown).
  • 20. Further, assume that Expert2 gives perfect predictions when Saints lose (not when they win). Right now, Expert2 as predictor is lost, but if resampling is with replacement, higher chance to use Expert2 as predictor because more losses might just appear. Summary: Data with N rows and p predictors: 1) Determine # of trees to grow. 2) For each tree Randomly sample n <= N rows with replacement. Create tree with m <= p predictors selected randomly at each non- final node. Combine different tree predictions by majority voting (classification trees) or averaging (regression trees). Note that voting can be replaced by average of probabilities, and averaging by medians.
  • 21. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-2110/7/2019 Definition of Random Forests. Decision Tree Forest: ensemble (collection) of decision trees predictions of which are combined to make overall prediction for the forest. Similar to TreeBoost (Gradient boosting) model because large number of trees are grown. However, TreeBoost generates series of trees with output of one tree going into next tree in series. In contrast, decision tree forest grows number of independent trees in parallel, and they do not interact until after all of them have been built. Disadvantage: complex model, cannot be visualized like single tree. More “black box” like neural network ➔ advisable to create both single-tree and tree forest model. Single-tree model can be studied to get intuitive understanding of how predictor variables relate, and decision tree forest model can be used to score data and generate highly accurate predictions.
  • 22. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-2210/7/2019 Random Forests 1. Random sample of N observations with replacement (“bagging”). On average, about 2/3 of rows selected. Remaining 1/3 called “out of bag (OOB)” obs. New random selection is performed for each tree constructed. 2. Using obs selected in step 1, construct decision tree. Build tree to maximum size, without pruning. As tree is built, allow only subset of total set of predictor variables to be considered as possible splitters for each node. Select set of predictors to be considered as random subset of total set of available predictors. For example, if there are ten predictors, choose five randomly as candidate splitters. Perform new random selection for each split. Some predictors (possibly best one) will not be considered for each split, but predictor excluded from one split may be used for another split in same tree.
  • 23. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-2310/7/2019 Random Forests No Overfitting or Pruning. "Over-fitting“: problem in large, single-tree models where model fits noise in data ➔ poor generalization power ➔ pruning. In nearly all cases, decision tree forests do not have problem with over-fitting, and no need to prune trees in forest. Generally, more trees in forest, better fit. Internal Measure of Test Set (Generalization) Error ). About 1/3 of observations excluded from each tree in forest, called “out of bag (OOB)”: each tree has different set of out-of-bag observations ➔ each OOB set constitutes independent test sample. To measure generalization error of decision tree forest, OOB set for each tree is run through tree and error rate of prediction is computed.
  • 24. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 2410/7/2019 Detour: Found in the Internet: PCA and RF. https://stats.stackexchange.com/questions/294791/ how-can-preprocessing-with-pca-but-keeping-the-same-dimensionality-improve-rando ?newsletter=1&nlcode=348729%7c8657 Discovery? “PCA before random forest can be useful not for dimensionality reduction but to give you data a shape where random forest can perform better. I am quite sure that in general if you transform your data with PCA keeping the same dimensionality of the original data you will have a better classification with random forest.” Answer: “Random forest struggles when the decision boundary is "diagonal" in the feature space because RF has to approximate that diagonal with lots of "rectangular" splits. To the extent that PCA re- orients the data so that splits perpendicular to the rotated & rescaled axes align well with the decision boundary, PCA will help. But there's no reason to believe that PCA will help in general, because not all decision boundaries are improved when rotated (e.g. a circle). And even if you do have a diagonal decision boundary, or a boundary that would be easier to find in a rotated space, applying PCA will only find that rotation by coincidence, because PCA has no knowledge at all about the classification component of the task (it is not "y-aware"). Also, the following caveat applies to all projects using PCA for supervised learning: data rotated by PCA may have little-to-no relevance to the classification objective.” DO NOT BELIEVE EVERYTHING THAT APPEARS ON THE WEB!!!!! BE CRITICAL!!!
  • 25. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 2510/7/2019 Further Developments. Paluszynska (2017) focuses on providing better information on variable importance using RF. RF is constantly being researched and improved.
  • 26. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 2610/7/2019
  • 27. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 2710/7/2019 Detour: Underlying idea for boosting classification models (NOT yet GB). (Freund, Schapire, 2012, Boosting: Foundations and Algorithms, MIT) Start with model M(X) and obtain 80% accuracy, or 60% R2, etc. Then Y = M(X) + error1. Hypothesize that error is still correlated with Y. ➔model error1 as error1 = G(X) + error2, and model error2 …. Or In general Error (t - 1) = Z(X) + error (t) ➔ Y = M(X) + G(X) + ….. + Z(X) + error (t-k). If find optimal beta weights to combined models, then Y = b1 * M(X) + b2 G(X) + …. + Bt Z(X) + error (t-k) Boosting is “Forward Stagewise Ensemble method” with single data set, iteratively reweighting observations according to previous error, especially focusing on wrongly classified observations. Philosophy: Focus on most difficult points to classify in previous step by reweighting observations.
  • 28. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 2810/7/2019 Main idea of GB using trees (GBDT). Let Y be target, X predictors such that f 0(X) weak model to predict Y that just predicts mean value of Y. “weak” to avoid over- fitting. Improve on f 0(X) by creating f 1(X) = f 0(X) + h (x). If h perfect model ➔ f 1(X) = y ➔ h (x) = y - f 0(X) = residuals = negative gradients of loss (or cost) function. Residual Fitting -(y – f(x)) -1; 1
  • 29. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 2910/7/2019 Explanation of GB by way of example.. /blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/ Predict age in following data set by way of trees, conitnuous target ➔ regression tree. Predict age, loss function: SSE. PersonID Age LikesGardening PlaysVideoGa mes LikesHats 1 13 FALSE TRUE TRUE 2 14 FALSE TRUE FALSE 3 15 FALSE TRUE FALSE 4 25 TRUE TRUE TRUE 5 35 FALSE TRUE TRUE 6 49 TRUE FALSE FALSE 7 68 TRUE TRUE TRUE 8 71 TRUE FALSE FALSE 9 73 TRUE FALSE TRUE
  • 30. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 3010/7/2019 Only 9 obs in data, thus allow Tree to have very Small # obs in final Nodes. We want Videos variable because we suspect it’s important. But doing so (by allowing few obs in final nodes) also brought in split in “hats”, that seems irrelevant and just noise leading to over-fitting, because tree model searches in smaller and smaller areas of data as it progresses. Let’s go in steps and look at the results of Tree1 (before second splits), stopping at first split, where predictions are 19.25 and 57.2 and obtain residuals. root Likes gardening F T 19.25 57.2 Hats F T Videos F T Tree 1
  • 31. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 3110/7/2019 Run another tree using Tree1 residuals as new target. PersonID Age Tree1 Predictio n Tree1 Residual 1 13 19.25 -6.25 2 14 19.25 -5.25 3 15 19.25 -4.25 4 25 57.2 -32.2 5 35 19.25 15.75 6 49 57.2 -8.2 7 68 57.2 10.8 8 71 57.2 13.8 9 73 57.2 15.8
  • 32. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 3210/7/2019 New root Video Games F T 7.133 -3.567 Tree 2 Note: Tree2 did not use “Likes Hats” because between Hats and VideoGames, videogames is preferred when using all obs instead of in full Tree1 in smaller region of the data where hats appear. And thus noise is avoided. Tree 1 SSE = 1994 Tree 2 SSE = 1765 PersonID Age Tree1 Prediction Tree1 Residual Tree2 Prediction Combined Prediction Final Residual 1 13 19.25 -6.25 -3.567 15.68 2.683 2 14 19.25 -5.25 -3.567 15.68 1.683 3 15 19.25 -4.25 -3.567 15.68 0.6833 4 25 57.2 -32.2 -3.567 53.63 28.63 5 35 19.25 15.75 -3.567 15.68 -19.32 6 49 57.2 -8.2 7.133 64.33 15.33 7 68 57.2 10.8 -3.567 53.63 -14.37 8 71 57.2 13.8 7.133 64.33 -6.667 9 73 57.2 15.8 7.133 64.33 -8.667 Combined pred for PersonID 1: 15.68 = 19.25 – 3.567
  • 33. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 3310/7/2019 So Far 1) Started with ‘weak’ model F0(x) = y 2) Fitted second model to residuals h1(x) = y – F0(x) 3) Combined two previous models F2(x) = F1(x) + h1(x). Notice that h1(x) could be any type of model (stacking), not just trees. And continue re-cursing until M. Initial weak model was “mean” because well known that mean minimizes SSE. Q: how to choose M, gradient boosting hyper parameter? Usually cross-validation. 4) Alternative to mean: minimize Absolute error instead of SSE as loss function. More expensive because minimizer is median, computationally expensive. In this case, in Tree 1 above, use median (y) = 35, and obtain residuals. PersonID Age F0 Residual0 1 13 35 -22 2 14 35 -21 3 15 35 -20 4 25 35 -10 5 35 35 0 6 49 35 14 7 68 35 33 8 71 35 36 9 73 35 38
  • 34. Focus on observations 1 and 4 with respective residuals of -22 and -10 respectively to understand median case. Under SSE Loss function (standard Tree regression), a reduction in residuals of 1 unit, drops SSE by 43 and 19 resp. ( e.g., 22 * 22 – 21 * 21, 100 - 81) while for absolute loss, reduction is just 1 and 1 (22 – 21, 10 – 9) ➔ SSE reduction will focus more on first observation because of 43, while absolute error focuses on all obs because they are all 1 ➔ Instead of training subsequent trees on residuals of F0, train h0 on gradient of loss function (L(y, F0(x)) w.r.t y-hats produced by F0(x). With absolute error loss, subsequent h trees will consider sign of every residual, as opposed to SSE loss that considers magnitude of residual. Gradient of SSE = which is “– residual” ➔ this is a gradient descent algorithm. For Absolute Error: Each h tree groups observations into final nodes, and average gradient can be calculated in each and scaled by factor γ, such that Fm + γm hm minimizes loss function in each node. Shrinkage: For each gradient step, magnitude is multiplied by factor that ranges between 0 and 1 and called learning rate ➔ each gradient step is shrunken allowing for slow convergence toward observed values ➔ observations close to target values end up grouped into larger nodes, thus regularizing the method. Finally before each new tree step, row and column sampling occur to produce more different tree splits (similar to Random Forests). ˆ ˆ,ˆ| | ˆ ˆ, ( ) 1 1 ˆ  −  = − =  −  = = − (AE) Y Y Y Y Absolute Error Y Y Y Y Y Y dAE Gradient AE or dY
  • 35. Results for SSE and Absolute Error: SSE case Age F0 PseudoR esidual0 h0 gamma0 F1 PseudoR esidual1 h1 gamma1 F2 13 40.33 -27.33 -21.08 1 19.25 -6.25 -3.567 1 15.68 14 40.33 -26.33 -21.08 1 19.25 -5.25 -3.567 1 15.68 15 40.33 -25.33 -21.08 1 19.25 -4.25 -3.567 1 15.68 25 40.33 -15.33 16.87 1 57.2 -32.2 -3.567 1 53.63 35 40.33 -5.333 -21.08 1 19.25 15.75 -3.567 1 15.68 49 40.33 8.667 16.87 1 57.2 -8.2 7.133 1 64.33 68 40.33 27.67 16.87 1 57.2 10.8 -3.567 1 53.63 71 40.33 30.67 16.87 1 57.2 13.8 7.133 1 64.33 73 40.33 32.67 16.87 1 57.2 15.8 7.133 1 64.33 h1 root -21.08 16.87 h0 Gardening F T root Videos F T 7.133 -3.567 E.g., for first observation. 40.33 is mean age, -27.33 = 13 – 40.33, -21.08 prediction due to gardening = F. F1 = 19.25 = 40.33 – 21.08. PseudoRes1 = 13 – 19.25, F2 = 19.25 – 3.567 = 15.68. Gamma0 = avg (pseudoresidual0 / h0) (by diff. values of h0). Same for gamma1.
  • 36. Results for SSE and Absolute Error: Absolute Error case. root -1 0.6 h0 Gardening F T h1 root Videos F T 0.333 -0.333 Age F0 PseudoResi dual0 h0 gamma0 F1 PseudoRes idual1 h1 gamma1 F2 13 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25 14 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25 15 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25 25 35 -1 0.6 55 68 -1 -0.3333 0.75 67.75 35 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25 49 35 1 0.6 55 68 -1 0.3333 9 71 68 35 1 0.6 55 68 -1 -0.3333 0.75 67.75 71 35 1 0.6 55 68 1 0.3333 9 71 73 35 1 0.6 55 68 1 0.3333 9 71 E.g., for 1st observation. 35 is median age, residual = -1, 1 if residual >0 or < 0 resp. F1 = 14.5 because 35 + 20.5 * (-1). F2 = 14.25 = 14.5 + 0.75 * (-0.3333). Predictions within leaf nodes computed by “mean” of obs therein. Gamma0 = median ((age – F0) / h0) = avg ((14 – 35) / -1; (15 – 35) / -1)) = 20.5 55 = (68 – 35) / .06 . Gamma1 = median ((age – F1) / h1) by different values of h1 (and of h0 for gamma0).
  • 37. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 3710/7/2019 Quick description of GB using trees (GBDT). 1) Create very small tree as initial model, ‘weak’ learner, (e.g., tree with two terminal nodes. ( ➔ depth = 1). ‘WEAK’ avoids over-fitting and local minina, and predicts, F1, for each obs. Tree1. 2) Each tree allocates a probability of event or a mean value in each terminal node, according to the nature of the dependent variable or target. 3) Compute “residuals” (prediction error) for every observation (if 0-1 target, apply logistic transformation to linearize them, i.e., odds = p / 1 – p). 4) Use residuals as new ‘target variable and grow second small tree on them (second stage of the process, same depth). To ensure against over-fitting, use random sample without replacement (➔ “stochastic gradient boosting”.) Tree2. 5) New model, once second stage is complete, we obtain concatenation of two trees, Tree1 and Tree2 and predictions F1 + F2 * gamma, gamma multiplier or shrinkage factor (called step size in gradient descent). 6) Iterate procedure of computing residuals from most recent tree, which become the target of the new model, etc. 7) In the case of a binary target variable, each tree produces at least some nodes in which ‘event’ is majority (‘events’ are typically more difficult to identify since most data sets contain very low proportion of ‘events’ in usual case). 8) Final score for each observation is obtained by summing (with weights) the different scores (probabilities) of every tree for each observation. Why does it work? Why “gradient” and “boosting”?
  • 38. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 3810/7/2019 Comparing GBDT vs Trees in point 4 above (I). GBDT takes sample from training data to create tree at each iteration, CART does not. Below, notice differences between with sample proportion of 60% for GBDT and no sample for generic trees for the fraud data set, Total_spend is the target. Predictions are similar. IF doctor_visits < 8.5 THEN DO; /* GBDT */ _prediction_ + -1208.458663; END; ELSE DO; _prediction_ + 1360.7910083; END; IF 8.5 <= doctor_visits THEN DO; /* GENERIC TREES*/ P_pseudo_res0 = 1378.74081896893; END; ELSE DO; P_pseudo_res0 = -1290.94575707227; END;
  • 39. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 3910/7/2019 Comparing GBDT vs Trees in point 4 above (II). Again, GBDT takes sample from training data to create tree at each iteration, CART does not. If we allow for CART to work with same proportion sample but different seed, splitting variables may be different at specific depth of tree creation. /* GBDT */ IF doctor_visits < 8.5 THEN DO; _ARB_F_ + -579.8214325; END; EDA of two samples would ELSE DO; indicate subtle differences _ARB_F_ + 701.49142697; that induce differences in END; selected splitting variables. END; / ORIGINAL TREES */ IF 183.5 <= member_duration THEN DO; P_pseudo_res0 = 1677.87318718526; END; ELSE DO; P_pseudo_res0 = -1165.32773940565; END;
  • 40. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4010/7/2019 More Details Friedman’s general 2001 GB algorithm: 1) Data (Y, X), Y (N, 1), X (N, p) 2) Choose # iterations M 3) Choose loss function L(Y, F(x), Error), and corresponding gradient, i.e., 0-1 loss function, and residuals are corresponding gradient. Function called ‘f’. Loss f implied by Y. 4) Choose base learner h( X, θ), say shallow trees. Algorithm: 1: initialize f0 with a constant, usually mean of Y. 2: for t = 1 to M do 3: compute negative gradient gt(x), i.e., residual from Y as next target. 4: fit a new base-learner function h(x, θt), i.e., tree. 5: find best gradient descent step-size, and min Loss f: 6: update function estimate: 8: end for (all f function are function estimates, i.e., ‘hats’). 0 < n t t ti t 1 i i γ i 1 , 1γ argmin L(y ,f (x ) γh (x )) γ− = = + −= +t t 1 t t tf f (x) γ h (x,θ )
  • 41. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4110/7/2019 Specifics of Tree Gradient Boosting, called TreeBoost (Friedman). Friedman’s 2001 GB algorithm for tree methods: Same as previous one, and jtprediction of tree t in final node N for tree 'm'. J t jt jm j 1 jt h (x) p I(x N ) p : = =  t t-1 In TreeBoost Friedman proposes to find optimal in each final node instead of unique at every iteration. Then f (x)=f (x)+ i jt jm J jt t jt j 1 jt i t 1 i t i γ x N , γ h (x)I(x N ), γ argmin L(y ,f (x ) γh (x )) γ γ, = −   = +  
  • 42. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4210/7/2019 Parallels with Stepwise (regression) methods. Stepwise starts from original Y and X, and in later iterations turns to residuals, and reduced and orthogonalized X matrix, where ‘entered’ predictors are no longer used and orthogonalized away from other predictors. GBDT uses residuals as targets, but does not orthogonalize or drop any predictors. Stepwise stops either by statistical inference, or AIC/BIC search. GBDT has a fixed number of iterations. Stepwise has no ‘gamma’ (shrinkage factor).
  • 43. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4310/7/2019 Setting. *** Hypothesize existence of function Y = f (X, betas, error). Change of paradigm, no MLE (e.g., logistic, regression, etc) but loss function. Minimize Loss function itself, its expected value called risk. Many different loss functions available, gaussian, 0-1, etc. A loss function describes the loss (or cost) associated with all possible decisions. Different decision functions or predictor functions will tend to lead to different types of mistakes. The loss function tells us which type of mistakes we should be more concerned about. For instance, estimating demand, decision function could be linear equation and loss function could be squared or absolute error. The best decision function is the function that yields the lowest expected loss, and the expected loss function is itself called risk of an estimator. 0-1 assigns 0 for correct prediction, 1 for incorrect.
  • 44. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6510/7/2019 Key Details. *** Friedman’s 2001 GB algorithm: Need 1) Loss function (usually determined by nature of Y (binary, continuous…)) (NO MLE). 2) Weak learner, typically tree stump or spline, marginally better classifier than random (but by how much?). 3) Model with T Iterations: # nodes in each tree; L2 or L1 norm of leaf weights; other. Function not directly opti T ti t 1 n T i i k i 1 t 1 ˆy tree (X) ˆObjective function : L(y , y ) Ω(Tree ) Ω { = = = = + =    mized by GB.}
  • 45. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4510/7/2019 L2-error penalizes symmetrically away from 0, Huber penalizes less than OLS away from [-1, 1], Bernoulli and Adaboost are very similar. Note that Y ε [-1, 1] in 0-1 case here.
  • 46. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4610/7/2019
  • 47. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4710/7/2019 Gradient Descent. “Gradient” descent method to find minimum of function. Gradient: multivariate generalization of derivative of function in one dimension to many dimensions. I.e., gradient is vector of partial derivatives. In one dimension, gradient is tangent to function. Easier to work with convex and “smooth” functions. convex Non-convex
  • 48. Gradient Descent. Let L (x1, x2) = 0.5 * (x1 – 15) **2 + 0.5 * (x2 – 25) ** 2, and solve for X1 and X2 that min L by gradient descent. Steps: Take M = 100. Starting point s0 = (0, 0) Step size = 0.1 Iterate m = 1 to M 1. Calculate gradient of L at sm – 1 2. Step in direction of greatest descent (negative gradient) with step size γ, i.e., If γ mall and M large, sm minimizes L. Additional considerations: Instead of M iterations, stop when next improvement small. Use line search to choose step sizes (Line search chooses search in descent direction of minimization). How does it work in gradient boosting? Objective is Min L, starting from F0(x). For m = 1, compute gradient of L w.r.t F0(x). Then fit weak learner to gradient components ➔ for regression tree, obtain average gradient in each final node. In each node, step in direction of avg. gradient using line search to determine step magnitude. Outcome is F1, and repeat. In symbols: Initialize model with constant: F0(x) = mean, median, etc. For m = 1 to M Compute pseudo residual fit base learner h to residuals compute step magnitude gamma m (for trees, different gamma for each node) Update Fm(x) = Fm-1(x) + γm hm(x)
  • 49. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 4910/7/2019 “Gradient” descent Method of gradient descent is a first order optimization algorithm that is based on taking small steps in direction of the negative gradient at one point in the curve in order to find the (hopefully global) minimum value (of loss function). If it is desired to search for the maximum value instead, then the positive gradient is used and the method is then called gradient ascent. Second order not searched, solution could be local minimum. Requires starting point, possibly many to avoid local minima.
  • 50. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5010/7/2019
  • 51. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-5110/7/2019 Comparing full tree (depth = 6) to boosted tree residuals by iteration.. 2 GB versions: 1) with raw 20% events (M1), 2) with 50/50 mixture of events (M2). Non GB Tree (referred as maxdepth 6 for M1 data set) the most biased. Notice that M2 stabilizes earlier than M1. X axis: Iteration #. Y axis: average Residual. “Tree Depth 6” obviously unaffected by iteration since it’s single tree run. 1.5969917399003E-15 -2.9088316687833E-16 Tree depth 6 2.83E-15 0 2 4 6 8 10 Iteration -5E-15 -2.5E-15 0 2.5E-15 5E-15 MEAN_RESID_M1_TRN_TREES MEAN_RESID_M2_TRN_TREESMEAN_RESID_M1_TRN_TREES Avg residuals by iteration by model names in gradient boosting Vertical line - Mean stabilizes
  • 52. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5210/7/2019 Comparing full tree (depth = 6) to boosted tree residuals by iteration.. Now Y = Var. of resids. M2 has highest variance followed by Depth 6 (single tree) and then M1. M2 stabilizes earlier as well. In conclusion, M2 has lower bias and higher variance than M1 in this example, difference lies on mixture of 0-1 in target variable. 0.1218753847 8 0.1781230782 5 Depth 6 = 0.145774 0.1219 0.1404 0.159 0.1775 0.196 0.2146 VarofResids 0 2 4 6 8 10 Iteration VAR_RESID_M2_TRN_TREESVAR_RESID_M1_TRN_TREES Variance of residuals by iteration in gradient boosting Vertical line - Variance stabilizes
  • 53. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5310/7/2019
  • 54. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5410/7/2019 Important Message N 1 Basic information on the original data set.s: 1 .. 1 Data set name ........................ train 1 . # TRN obs ............... 3595 1 Validation data set ................. validata 1 . # VAL obs .............. 2365 1 Test data set ................ 1 . # TST obs .............. 0 1 ... 1 Dep variable ....................... fraud 1 ..... 1 Pct Event Prior TRN............. 20.389 1 Pct Event Prior VAL............. 19.281 1 Pct Event Prior TEST ............ 1 TRN and VAL data sets obtained by random sampling Without replacement. .
  • 55. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5510/7/2019 Variable Label 1 FRAUD Fraudulent Activity yes/no total_spend Total spent on opticals 1 doctor_visits Total visits to a doctor 1 no_claims No of claims made recently 1 member_duration Membership duration 1 optom_presc Number of opticals claimed 1 num_members Number of members covered 5 5 Fraud data set, original 20% fraudsters. Study alternatives of changing number of iterations from 3 to 50 and depth from 1 to 10 with training and validation data sets. Original Percentage of fraudsters 20% in both data sets.. Notice just 5 predictors, thus max number of iterations is 50 as exaggeration. In usual large databases, number of iterations could reach 1000 or higher.
  • 56. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5610/7/2019 E.g., M5_VAL_GRAD_BOOSTING: M5 case with validation data set and using gradient boosting as modeling technique. Model # 10 as identifier. Requested Models: Names & Descriptions. Model # Full Model Name Model Description *** Overall Models -1 M1 Raw 20pct depth 1 iterations 3 -10 M2 Raw 20pct depth 1 iterations 10 -10 M3 Raw 20pct depth 5 iterations 3 -10 M4 Raw 20pct depth 5 iterations 10 -10 M5 Raw 20pct depth 10 iterations 50 -10 01_M1_TRN_GRAD_BOOSTING Gradient Boosting 1 02_M1_VAL_GRAD_BOOSTING Gradient Boosting 2 03_M2_TRN_GRAD_BOOSTING Gradient Boosting 3 04_M2_VAL_GRAD_BOOSTING Gradient Boosting 4 05_M3_TRN_GRAD_BOOSTING Gradient Boosting 5 06_M3_VAL_GRAD_BOOSTING Gradient Boosting 6 07_M4_TRN_GRAD_BOOSTING Gradient Boosting 7 08_M4_VAL_GRAD_BOOSTING Gradient Boosting 8 09_M5_TRN_GRAD_BOOSTING Gradient Boosting 9 10_M5_VAL_GRAD_BOOSTING Gradient Boosting 10
  • 57. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5710/7/2019
  • 58. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5810/7/2019 All agree on No_claims as First split but not at same values and yield different event probs.
  • 59. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 5910/7/2019 Note M2 split
  • 60. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6010/7/2019 Constrained GB parameters may create undesirable models but GB parameters with high values (gamma, iterations) may lead to running times that are too long, especially when models have to be re-touched.
  • 61. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6110/7/2019
  • 62. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6210/7/2019 Variable importance is model dependent, could lead to misleading conclusions. .
  • 63. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6310/7/2019
  • 64. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6410/7/2019 Goodness Of Fit.
  • 65. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6510/7/2019 Prediction Probability increases with model complexity. Also, better discrimination with M5. Fixed prob. bins vs. lift table with Equal observation numbers per bin.
  • 66. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6610/7/2019
  • 67. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6710/7/2019 M5 best per AUROC Also when validated.
  • 68. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6810/7/2019
  • 69. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 6910/7/2019
  • 70. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7010/7/2019 Specific GOFs In rank order.
  • 71. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7110/7/2019 GOF ranks GOF measure rankAUROC Avg Square Error Cum Lift 3rd bin Cum Resp Rate 3rd Gini Rsquare Cramer Tjur rank rank rank rank rank rank Unw. Mean Model Name 5 5 5 5 5 5 5.00 01_M1_TRN_GRAD_BOOSTING 03_M2_TRN_GRAD_BOOSTING 4 4 4 4 4 4 4.00 05_M3_TRN_GRAD_BOOSTING 3 3 3 3 3 3 3.00 07_M4_TRN_GRAD_BOOSTING 2 2 2 2 2 2 2.00 09_M5_TRN_GRAD_BOOSTING 1 1 1 1 1 1 1.00 GOF ranks GOF measure rankAUROC Avg Square Error Cum Lift 3rd bin Cum Resp Rate 3rd Gini Rsquare Cramer Tjur rank rank rank rank rank rank Unw. Mean Model Name 5 5 5 5 5 5 5.00 02_M1_VAL_GRAD_BOOSTING 04_M2_VAL_GRAD_BOOSTING 4 4 4 4 4 4 4.00 06_M3_VAL_GRAD_BOOSTING 3 3 3 3 3 3 3.00 08_M4_VAL_GRAD_BOOSTING 2 2 2 2 2 2 2.00 10_M5_VAL_GRAD_BOOSTING 1 1 1 1 1 1 1.00
  • 72. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7210/7/2019 M5 winner.
  • 73. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7310/7/2019 Huge jump in performance Per R-square measure.
  • 74. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7410/7/2019
  • 75. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7510/7/2019 Overall conclusion for GB parameters While higher values of number of iterations and depth imply longer (and possibly significant) computer runs), constraining these parameters can have significant negative effects on model results. In context of thousands of predictors, computer resource availability might significantly affect model results.
  • 76. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7610/7/2019
  • 77. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7710/7/2019 Overall Ensembles. Given specific classification study and many different modeling techniques, create logistic regression model with original target variable and the different predictions from the different models, without variable selection (this is not critical). Method also called Platt’s calibration or scaling (Platt, 2000), although used in context of smoothing posterior probabilities, and not ensembling models. Evaluate importance of different models either via p-values or partial dependency plots. Note: It is not Stacking, because Stacking “votes” to decide on final classification. Additional ‘easier’ ways to create ensembles: mean, median, etc. of predictive probabilities from different models.
  • 78. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7810/7/2019
  • 79. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 7910/7/2019 Partial Dependency plots (PDP). Due to GB’s (and other methods’) black-box nature, and to put them on similar footing to linear methods (linear and logistic regressions) these plots show effect of X on modeled response considering all other covariates measured at their means. Graphs may not capture nature of variable interactions especially if interaction significantly affects model outcome. Let F be model of F on X vars and F* the model predictions. Then: PDP of F(x1 / x2, … xp) on X1 is avg (F(x1 / x2 ,,, xp)) for every point of x1. Note that x2,,,, xp are not measured at their means but their effects are summarized by avg. PDPs could be measured at medians, etc. LR evaluates coeff. of X1 assuming all other vars frozen, in fact uncorrelated, which is formally wrong. Also possible to obtain PDP of F (x1, x2 / x3….. Xp), i.e., pairs of variables. Since GB, Boosting, Bagging, etc. are BLACK BOX models, use PDP to obtain model interpretation. Also useful for regression based models.
  • 80. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 8010/7/2019 References Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, vanCalster B: A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models., J Clin Epidemiol. 2019 Feb 11. pii: S0895-4356(18)31081-3. doi:10.1016/j.jclinepi.2019.02.004 Platt J. (2000): Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, Advances in large margin classifiers. 10 (3): 61– 74.
  • 81. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 8110/7/2019