SlideShare a Scribd company logo
1 of 134
Download to read offline
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 1
5/9/2018
Ensemble models and
Gradient Boosting.
Leonardo Auslender
Independent Statistical Consultant
Leonardo.Auslender ‘at’
Gmail ‘dot’ com.
Copyright 2018.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 2
5/9/2018
Topics to cover:
1) Why more techniques? Bias-variance tradeoff.
2)Ensembles
1) Bagging – stacking
2) Random Forests
3) Gradient Boosting (GB)
4) Gradient-descent optimization method.
5) Innards of GB and example.
6) Overall Ensembles.
7) Partial Dependency Plots (PDP)
8) Case Study.
9) Xgboost
10)On the practice of Ensembles.
11)References.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 3
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 4
5/9/2018
1) Why more techniques? Bias-variance tradeoff.
(Broken clock is right twice a day, variance of estimation = 0, bias extremely high.
Thermometer is accurate overall, but reports higher/lower temperatures at night. Unbiased,
higher variance. Betting on same horse always has zero variance, possibly extremely biased).
Model error can be broken down into three components mathematically. Let f
be estimating function. f-hat empirically derived function.
Bet on right
Horse and win.
Bet on wrong
Horse and lose.
Bet on many
Horses and win.
Bet on many horses
and lose.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 5
5/9/2018
Credit : Scott Fortmann-Roe (web)
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 6
5/9/2018
Let X1, X2, X3,,, i.i.d random variables
Well known that E(X) = , and variance (E(X)) =
By just averaging estimates, we lower variance and assure same
aspects of bias.
Let us find methods to lower or stabilize variance (at least) while
keeping low bias. And maybe also, lower the bias.
And since cannot be fully attained, still searching for more
techniques.
 Minimize general objective function:
n
 
Minimize loss function to reduce bias.
Regularization, minimize model complexity.
Obj(Θ) L(Θ) Ω(Θ),
L(Θ)
Ω(Θ)
 


set of model parameters.
1 p
where Ω {w ,,,,,,w },

Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 7
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 8
5/9/2018
Some terminology for Model combinations.
Ensembles: general name
Prediction/forecast combination: focusing on just
outcomes
Model combination for parameters:
Bayesian parameter averaging
We focus on ensembles as Prediction/forecast
combinations.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 Ch. 5-9
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 10
5/9/2018
Ensembles.
Bagging (bootstrap aggregation, Breiman, 1996): Adding randomness 
improves function estimation. Variance reduction technique, reducing
MSE. Let initial data size n.
1) Construct bootstrap sample by randomly drawing n times with replacement
(note, some observations repeated).
2) Compute sample estimator (logistic or regression, tree, ANN … Tree in
practice).
3) Redo B times, B large (50 – 100 or more in practice, but unknown).
4) Bagged estimator. For classification, Breiman recommends majority vote of
classification for each observation. Buhlmann (2003) recommends averaging
bootstrapped probabilities. Note that individual obs may not appear B times
each.
NB: Independent sequence of trees. What if …….?
Reduces prediction error by lowering variance of aggregated predictor
while maintaining bias almost constant (variance/bias trade-off).
Friedman (1998) reconsidered boosting and bagging in terms of
gradient descent algorithms, seen later on.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 11
5/9/2018
From http://dni-institute.in/blogs/bagging-algorithm-concepts-with-example/
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 Ch. 5-12
5/9/2018
Ensembles
Evaluation:
Empirical studies: boosting (seen later) smaller misclassification
rates compared to bagging, reduction of both bias and
variance. Different boosting algorithms (Breiman’s arc-x4 and arc-
gv). In cases with substantial noise, bagging performs better.
Especially used in clinical studies.
Why does Bagging work?
Breiman: bagging successful because reduces instability of
prediction method. Unstable: small perturbations in data  large
changes in predictor. Experimental results show variance
reduction. Studies suggest that bagging performs some
smoothing on the estimates. Grandvalet (2004) argues that
bootstrap sampling equalizes effects of highly influential
observations.
Disadvantage: cannot be visualized easily.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 13
5/9/2018
Ensembles
Adaptive Bagging (Breiman, 2001): Mixes bias-reduction boosting
with variance-reduction bagging. Uses out-of-bag obs to halt
optimizer.
Stacking:
Previously, same technique used throughout. Stacking (Wolpert 1992)
combines different algorithms on single data set. Voting is then
used for final classification. Ting and Witten (1999) “stack” the
probability distributions (PD) instead.
Stacking is “meta-classifier”: combines methods.
Pros: takes best from many methods. Cons: un-interpretable,
mixture of methods become black-box of predictions.
Stacking very prevalent in WEKA.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 14
5/9/2018
5.3) Tree World.
5.3.1) L. Breiman: Bagging.
2.2) L. Breiman: Random
Forests
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 15
Explanation by way of football example for The Saints.
https://gormanalysis.com/random-forest-from-top-to-bottom/
Opponent OppRk
SaintsAtHo
me
Expert1Pre
dWin
Expert2Pre
dWin
SaintsWon
1 Falcons 28 TRUE TRUE TRUE TRUE
2 Cowgirls 16 TRUE TRUE TRUE TRUE
3 Eagles 30 FALSE FALSE TRUE TRUE
4 Bucs 6 TRUE FALSE TRUE FALSE
5 Bucs 14 TRUE FALSE FALSE FALSE
6 Panthers 9 FALSE TRUE TRUE FALSE
7 Panthers 18 FALSE FALSE FALSE FALSE
Goal: predict when Saints will win. 5 Predictors: Opponent, opponent rank, home
game, expert1 and expert2 predictions. If run tree, just one split on opponent because
Saints lost to Bucs and Panthers and perfect separation then, but useless for future
opponents. Instead, at each step randomly select subset of 3 (or 2, or 4) features and
grow multiple weak but different trees, which when combined, should be a smart model.
3 Examples: Tree2 Tree3
Tree1 Tree3
OppRank <= 15
(<=) Left (>) Right
Opponent in Cowgirls,
eagles, falcons
Expert2 pred
F =Left T= Right
OppRank <= 12.5
(<=) Left (>) Right
Opponent in Cowgirls,
eagles, falcons (left)
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 16
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 17
5/9/2018
Assume following test data and predictions:
Opponent OppRk
SaintsAtH
ome
Expert1Pr
edWin
Expert2Pr
edWin
1 Falcons 1 TRUE TRUE TRUE
2 Falcons 32 TRUE TRUE FALSE
3 Falcons 32 TRUE FALSE TRUE
Obs Tree1 Tree2 Tree3
MajorityVot
e
Sample1 FALSE FALSE TRUE FALSE
Sample2 TRUE FALSE TRUE TRUE
Sample3 TRUE TRUE TRUE TRUE
Test data
Predictions.
Note that probability can be ascribed by counting # votes for each predicted target class and yield
good ranking of prob for different classes. But problem: if “Opprk” (2nd best predictor) is in initial group
of 3 with “opponent”, it won’t be used as splitter because “opponent” is perfect. Note that there are 10
ways to choose 3 out of 5, and each predictor appears 6 times 
“Opponent” dominates 60% of trees, while Opprisk appears without “opponent” just
30% of the time. Could mitigate this effect by also sampling training obs to be used to
develop model, giving Opprisk a higher chance to be root (not shown).
Further, assume that Expert2 gives perfect predictions when Saints
lose (not when they win). Right now, Expert2 as predictor is lost, but if
resampling is with replacement, higher chance to use Expert2 as
predictor because more losses might just appear.
Summary:
Data with N rows and p predictors:
1) Determine # of trees to grow.
2) For each tree
Randomly sample n <= N rows with replacement.
Create tree with m <= p predictors selected randomly at each non-
final node.
Combine different tree predictions by majority voting (classification
trees) or averaging (regression trees). Note that voting can be
replaced by average of probabilities, and averaging by medians.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 Ch. 5-19
5/9/2018
Definition of Random Forests.
Decision Tree Forest: ensemble (collection) of decision trees whose
predictions are combined to make overall prediction for the forest.
Similar to TreeBoost (Gradient boosting) model because large number
of trees are grown. However, TreeBoost generates series of trees with
output of one tree going into next tree in series. In contrast, decision
tree forest grows number of independent trees in parallel, and they do
not interact until after all of them have been built.
Disadvantage: complex model, cannot be visualized like single tree.
More “black box” like neural network  advisable to create both single-
tree and tree forest model.
Single-tree model can be studied to get intuitive understanding of how
predictor variables relate, and decision tree forest model can be used
to score data and generate highly accurate predictions.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 Ch. 5-20
5/9/2018
Random Forests
1. Random sample of N observations with replacement (“bagging”).
On average, about 2/3 of rows selected. Remaining 1/3 called “out
of bag (OOB)” obs. New random selection is performed for each
tree constructed.
2. Using obs selected in step 1, construct decision tree. Build tree to
maximum size, without pruning. As tree is built, allow only subset of
total set of predictor variables to be considered as possible splitters
for each node. Select set of predictors to be considered as random
subset of total set of available predictors.
For example, if there are ten predictors, choose five randomly as
candidate splitters. Perform new random selection for each split. Some
predictors (possibly best one) will not be considered for each split, but
predictor excluded from one split may be used for another split in same
tree.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 Ch. 5-21
5/9/2018
Random Forests
No Overfitting or Pruning.
"Over-fitting“: problem in large, single-tree models where model fits
noise in data  poor generalization power  pruning. In nearly all
cases, decision tree forests do not have problem with over-fitting, and no
need to prune trees in forest. Generally, more trees in forest, better fit.
Internal Measure of Test Set (Generalization) Error .
About 1/3 of observations excluded from each tree in forest, called “out
of bag (OOB)”: each tree has different set of out-of-bag observations 
each OOB set constitutes independent test sample.
To measure generalization error of decision tree forest, OOB set for each
tree is run through tree and error rate of prediction is computed.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 22
5/9/2018
Detour: Found in the Internet: PCA and RF.
https://stats.stackexchange.com/questions/294791/
how-can-preprocessing-with-pca-but-keeping-the-same-dimensionality-improve-rando
?newsletter=1&nlcode=348729%7c8657
Discovery?
“PCA before random forest can be useful not for dimensionality reduction but to give you data
a shape where random forest can perform better.
I am quite sure that in general if you transform your data with PCA keeping the same
dimensionality of the original data you will have a better classification with random forest.”
Answer:
“Random forest struggles when the decision boundary is "diagonal" in the feature space
because RF has to approximate that diagonal with lots of "rectangular" splits. To the extent that
PCA re-orients the data so that splits perpendicular to the rotated & rescaled axes align well
with the decision boundary, PCA will help. But there's no reason to believe that PCA will help in
general, because not all decision boundaries are improved when rotated (e.g. a circle). And
even if you do have a diagonal decision boundary, or a boundary that would be easier to find in
a rotated space, applying PCA will only find that rotation by coincidence, because PCA has no
knowledge at all about the classification component of the task (it is not "y-aware").
Also, the following caveat applies to all projects using PCA for supervised learning: data rotated by
PCA may have little-to-no relevance to the classification objective.”
DO NOT BELIEVE EVERYTHING THAT APPEARS IN THE WEB!!!!! BE CRITICAL!!!
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 23
5/9/2018
Further Developments.
Paluszynska (2017) focuses on providing better information
on variable importance using RF.
RF is constantly being researched and improved.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 24
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 25
5/9/2018
Detour: Underlying idea for boosting classification models (NOT yet GB).
(Freund, Schapire, 2012, Boosting: Foundations and Algorithms, MIT)
Start with model M(X) and obtain 80% accuracy, or 60% R2, etc.
Then Y = M(X) + error1. Hypothesize that error is still correlated with Y.
 error1 = G(X) + error2, where we model Error1 now, or
In general Error (t - 1) = Z(X) + error (t) 
Y = M(X) + G(X) + ….. + Z(X) + error (t-k). If find optimal beta weights to
combined models, then
Y = b1 * M(X) + b2 G(X) + …. + Bt Z(X) + error (t-k)
Boosting is “Forward Stagewise Ensemble method” with single data set,
iteratively reweighting observations according to previous error, especially focusing on
wrongly classified observations.
Philosophy: Focus on most difficult points to classify in previous step by
reweighting observations.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 26
5/9/2018
Main idea of GB using trees (GBDT).
Let Y be target, X predictors such that f 0(X) weak model to
predict Y that just predicts mean value of Y. “weak” to avoid over-
fitting.
Improve on f 0(X) by creating f 1(X) = f 0(X) + h (x). If h perfect
model  f 1(X) = y  h (x) = y - f 0(X) = residuals = negative
gradients of loss (or cost) function.
Residual
Fitting
-(y – f(x))
-1; 1
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 27
5/9/2018
Explanation of GB by way of example..
/blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
Predict age in following data set by way of trees, conitnuous target  regression tree.
Predict age, loss function: SSE.
PersonID Age
LikesGardenin
g
PlaysVideoGa
mes
LikesHats
1 13 FALSE TRUE TRUE
2 14 FALSE TRUE FALSE
3 15 FALSE TRUE FALSE
4 25 TRUE TRUE TRUE
5 35 FALSE TRUE TRUE
6 49 TRUE FALSE FALSE
7 68 TRUE TRUE TRUE
8 71 TRUE FALSE FALSE
9 73 TRUE FALSE TRUE
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 28
5/9/2018
Only 9 obs in data, thus we allow Tree to have very Small # obs in final Nodes.
We want Videos variable because we suspect it’s important. But doing so (by
allowing few obs in final nodes) also brought in split in “hats”, that seems
irrelevant and just noise leading to over-fitting, because tree searches in smaller
and smaller areas of data as it progresses.
Let’s go in steps and look at the results of Tree1 (before second splits), stopping at first
split, where predictions are 19.25 and 57.2 and obtain residuals.
root
Likes gardening
F T
19.25 57.2
Hats
F T
Videos
F T
Tree 1
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 29
5/9/2018
Run another tree using Tree1 residuals as new target.
PersonID Age
Tree1
Predictio
n
Tree1
Residual
1 13 19.25 -6.25
2 14 19.25 -5.25
3 15 19.25 -4.25
4 25 57.2 -32.2
5 35 19.25 15.75
6 49 57.2 -8.2
7 68 57.2 10.8
8 71 57.2 13.8
9 73 57.2 15.8
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 30
5/9/2018
New root
Video Games
F T
7.133 -3.567
Tree 2
Note: Tree2 did not use “Likes Hats” because between Hats and VideoGames, videogames is
preferred when using all obs instead of in full Tree1 in smaller region of the data where hats appear.
And thus noise is avoided.
Tree 1 SSE = 1994 Tree 2 SSE = 1765
PersonID Age
Tree1
Prediction
Tree1
Residual
Tree2
Prediction
Combined
Prediction
Final
Residual
1 13 19.25 -6.25 -3.567 15.68 2.683
2 14 19.25 -5.25 -3.567 15.68 1.683
3 15 19.25 -4.25 -3.567 15.68 0.6833
4 25 57.2 -32.2 -3.567 53.63 28.63
5 35 19.25 15.75 -3.567 15.68 -19.32
6 49 57.2 -8.2 7.133 64.33 15.33
7 68 57.2 10.8 -3.567 53.63 -14.37
8 71 57.2 13.8 7.133 64.33 -6.667
9 73 57.2 15.8 7.133 64.33 -8.667
Combined pred
for PersonID 1:
15.68 = 19.25
– 3.567
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 31
5/9/2018
So Far
1) Started with ‘weak’ model F0(x) = y
2) Fitted second model to residuals h1(x) = y – F0(x)
3) Combined two previous models F2(x) = F1(x) + h1(x).
Notice that h1(x) could be any type of model (stacking), not just trees. And
continue re-cursing until M.
Initial weak model was “mean” because well known that mean minimizes SSE.
Q: how to choose M, gradient boosting hyper parameter? Usually cross-validation.
4) Alternative to mean: minimize Absolute error instead of SSE as loss function.
More expensive because minimizer is median, computationally expensive. In this case, in
Tree 1 above, use median (y) = 35, and obtain residuals.
PersonID Age F0 Residual0
1 13 35 -22
2 14 35 -21
3 15 35 -20
4 25 35 -10
5 35 35 0
6 49 35 14
7 68 35 33
8 71 35 36
9 73 35 38
Focus on observations 1 and 4 with respective residuals of -22 and -10 respectively to
understand median case. Under SSE Loss function (standard Tree regression), a reduction in
residuals of 1 unit, drops SSE by 43 and 19 resp. ( e.g., 22 * 22 – 21 * 21, 100 - 81) while for absolute
loss, reduction is just 1 and 1 (22 – 21, 10 – 9) 
SSE reduction will focus more on first observation because of 43, while absolute error focuses
on all obs because they are all 1 
Instead of training subsequent trees on residuals of F0, train h0 on gradient of loss function (L(y, F0(x))
w.r.t y-hats produced by F0(x). With absolute error loss, subsequent h trees will consider sign of every
residual, as opposed to SSE loss that considers magnitude of residual.
Gradient of SSE =
which is “– residual”  this is a gradient descent algorithm. For Absolute Error:
Each h tree groups observations into final nodes, and average gradient can be calculated in
each and scaled by factor γ, such that Fm + γm hm minimizes loss function in each node.
Shrinkage: For each gradient step, magnitude is multiplied by factor that ranges between 0 and 1 and
called learning rate  each gradient step is shrunken allowing for slow convergence toward observed
values  observations close to target values end up grouped into larger nodes, thus regularizing the
method.
Finally before each new tree step, row and column sampling occur to produce more different
tree splits (similar to Random Forests).
ˆ ˆ
,
ˆ
| |
ˆ ˆ
,
( ) 1 1
ˆ
  

   
 


  
(AE)
Y Y Y Y
Absolute Error Y Y
Y Y Y Y
dAE
Gradient AE or
dY
Results for SSE and Absolute Error: SSE case
Age F0
PseudoR
esidual0
h0 gamma0 F1
PseudoR
esidual1
h1 gamma1 F2
13 40.33 -27.33 -21.08 1 19.25 -6.25 -3.567 1 15.68
14 40.33 -26.33 -21.08 1 19.25 -5.25 -3.567 1 15.68
15 40.33 -25.33 -21.08 1 19.25 -4.25 -3.567 1 15.68
25 40.33 -15.33 16.87 1 57.2 -32.2 -3.567 1 53.63
35 40.33 -5.333 -21.08 1 19.25 15.75 -3.567 1 15.68
49 40.33 8.667 16.87 1 57.2 -8.2 7.133 1 64.33
68 40.33 27.67 16.87 1 57.2 10.8 -3.567 1 53.63
71 40.33 30.67 16.87 1 57.2 13.8 7.133 1 64.33
73 40.33 32.67 16.87 1 57.2 15.8 7.133 1 64.33
h1
root
-21.08 16.87
h0
Gardening
F T
root
Videos
F T
7.133 -3.567
E.g., for first observation. 40.33 is mean age, -27.33 = 13 – 40.33, -21.08 prediction due to
gardening = F. F1 = 19.25 = 40.33 – 21.08. PseudoRes1 = 13 – 19.25, F2 = 19.25 – 3.567 = 15.68.
Gamma0 = avg (pseudoresidual0 / h0) (by diff. values of h0). Same for gamma1.
Results for SSE and Absolute Error: Absolute Error case.
root
-1 0.6
h0
Gardening
F T
h1
root
Videos
F T
0.333 -0.333
Age F0
PseudoResi
dual0
h0 gamma0 F1
PseudoRes
idual1
h1 gamma1 F2
13 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25
14 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25
15 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25
25 35 -1 0.6 55 68 -1 -0.3333 0.75 67.75
35 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25
49 35 1 0.6 55 68 -1 0.3333 9 71
68 35 1 0.6 55 68 -1 -0.3333 0.75 67.75
71 35 1 0.6 55 68 1 0.3333 9 71
73 35 1 0.6 55 68 1 0.3333 9 71
E.g., for 1st observation. 35 is median age, residual = -1, 1 if residual >0 or < 0 resp.
F1 = 14.5 because 35 + 20.5 * (-1).
F2 = 14.25 = 14.5 + 0.75 * (-0.3333).
Predictions within leaf nodes computed by “mean” of obs therein.
Gamma0 = median ((age – F0) / h0) = avg ((14 – 35) / -1; (15 – 35) / -1)) = 20.5 55 = (68 – 35) / .06 .
Gamma1 = median ((age – F1) / h1) by different values of h1 (and of h0 for gamma0).
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 35
5/9/2018
Quick description of GB using trees (GBDT).
1) Create very small tree as initial model, ‘weak’ learner, (e.g., tree with two terminal nodes. ( 
depth = 1). ‘WEAK’ avoids over-fitting and local minina, and predicts, F1, for each obs. Tree1.
2) Each tree allocates a probability of event or a mean value in each terminal node, according
to the nature of the dependent variable or target.
3) Compute “residuals” (prediction error) for every observation (if 0-1 target, apply logistic
transformation to linearize them, p / 1 – p).
4) Use residuals as new ‘target variable and grow second small tree on them (second stage of
the process, same depth). To ensure against over-fitting, use random sample without
replacement ( “stochastic gradient boosting”.) Tree2.
5) New model, once second stage is complete, we obtain concatenation of two trees, Tree1 and
Tree2 and predictions F1 + F2 * gamma, gamma multiplier or shrinkage factor (called step
size in gradient descent).
6) Iterate procedure of computing residuals from most recent tree, which become the target of
the new model, etc.
7) In the case of a binary target variable, each tree produces at least some nodes in which
‘event’ is majority (‘events’ are typically more difficult to identify since most data sets
contain very low proportion of ‘events’ in usual case).
8) Final score for each observation is obtained by summing (with weights) the different scores
(probabilities) of every tree for each observation.
Why does it work? Why “gradient” and “boosting”?
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 36
5/9/2018
Comparing GBDT vs Trees in point 4 above (I).
GBDT takes sample from training data to create tree at each iteration, CART
does not. Below, notice differences between with sample proportion of 60%
for GBDT and no sample for generic trees for the fraud data set,
Total_spend is the target. Predictions are similar.
IF doctor_visits < 8.5 THEN DO; /* GBDT */
_prediction_ + -1208.458663;
END;
ELSE DO;
_prediction_ + 1360.7910083;
END;
IF 8.5 <= doctor_visits THEN DO; /* GENERIC TREES*/
P_pseudo_res0 = 1378.74081896893;
END;
ELSE DO;
P_pseudo_res0 = -1290.94575707227;
END;
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 37
5/9/2018
Comparing GBDT vs Trees in point 4 above (II).
Again, GBDT takes sample from training data to create tree at each
iteration, CART does not. If we allow for CART to work with same
proportion sample but different seed, splitting variables may be different at
specific depth of tree creation.
/* GBDT */
IF doctor_visits < 8.5 THEN DO;
_ARB_F_ + -579.8214325;
END; EDA of two samples would
ELSE DO; indicate subtle differences
_ARB_F_ + 701.49142697; that induce differences in
END; selected splitting variables.
END;
/ ORIGINAL TREES */
IF 183.5 <= member_duration THEN DO;
P_pseudo_res0 = 1677.87318718526;
END;
ELSE DO;
P_pseudo_res0 = -1165.32773940565;
END;
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 38
5/9/2018
More Details
Friedman’s general 2001 GB algorithm:
1) Data (Y, X), Y (N, 1), X (N, p)
2) Choose # iterations M
3) Choose loss function L(Y, F(x), Error), and corresponding gradient, i.e., 0-1 loss
function, and residuals are corresponding gradient. Function called ‘f’. Loss f
implied by Y.
4) Choose base learner h( X, θ), say shallow trees.
Algorithm:
1: initialize f0 with a constant, usually mean of Y.
2: for t = 1 to M do
3: compute negative gradient gt(x), i.e., residual from Y as next target.
4: fit a new base-learner function h(x, θt), i.e., tree.
5: find best gradient descent step-size, and min Loss f:
6: update function estimate:
8: end for
(all f function are function estimates, i.e., ‘hats’).
0 <
n
t t t
i t 1 i i
γ i 1
, 1
γ argmin L(y ,f (x ) γh (x )) γ



 


 
t t 1 t t t
f f (x) γ h (x,θ )
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 39
5/9/2018
Specifics of Tree Gradient Boosting, called TreeBoost (Friedman).
Friedman’s 2001 GB algorithm for tree methods:
Same as previous one, and
jt
prediction of tree t in final node N
for tree 'm'.
J
t jt jm
j 1
jt
h (x) p I(x N )
p :

 

t t-1
In TreeBoost Friedman proposes to find optimal
in each final node instead of
unique at every iteration. Then
f (x)=f (x)+
i jt
jm
J
jt t jt
j 1
jt i t 1 i t i
γ x N
,
γ h (x)I(x N ),
γ argmin L(y ,f (x ) γh (x ))
γ
γ,




 


Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 40
5/9/2018
Parallels with Stepwise (regression) methods.
Stepwise starts from original Y and X, and in later iterations
turns to residuals, and reduced and orthogonalized X matrix,
where ‘entered’ predictors are no longer used and
orthogonalized away from other predictors.
GBDT uses residuals as targets, but does not orthogonalize or
drop any predictors.
Stepwise stops either by statistical inference, or AIC/BIC
search. GBDT has a fixed number of iterations.
Stepwise has no ‘gamma’ (shrinkage factor).
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 41
5/9/2018
Setting.
Hypothesize existence of function Y = f (X, betas, error). Change of
paradigm, no MLE (e.g., logistic, regression, etc) but loss function.
Minimize Loss function itself, its expected value called risk. Many different
loss functions available, gaussian, 0-1, etc.
A loss function describes the loss (or cost) associated with all possible
decisions. Different decision functions or predictor functions will tend
to lead to different types of mistakes. The loss function tells us which
type of mistakes we should be more concerned about.
For instance, estimating demand, decision function could be linear equation
and loss function could be squared or absolute error.
The best decision function is the function that yields the lowest expected
loss, and the expected loss function is itself called risk of an estimator. 0-1
assigns 0 for correct prediction, 1 for incorrect.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 65
5/9/2018
Key Details.
Friedman’s 2001 GB algorithm: Need
1) Loss function (usually determined by nature of Y (binary,
continuous…)) (NO MLE).
2) Weak learner, typically tree stump or spline, marginally better
classifier than random (but by how much?).
3) Model with T Iterations:
# nodes in each tree;
L2 or L1 norm of leaf weights;
other. Function not directly
opti
T
t
i
t 1
n T
i i k
i 1 t 1
ŷ tree (X)
ˆ
Objective function : L(y , y ) Ω(Tree )
Ω {

 




 
mized by GB.}
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 43
5/9/2018
L2-error penalizes symmetrically away from 0, Huber penalizes less than OLS
away from [-1, 1], Bernoulli and Adaboost are very similar. Note that Y ε [-1, 1]
in 0-1 case here.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 44
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 45
5/9/2018
Gradient Descent.
“Gradient” descent method to find minimum of function.
Gradient: multivariate generalization of derivative of function in one
dimension to many dimensions. I.e., gradient is vector of partial
derivatives. In one dimension, gradient is tangent to function.
Easier to work with convex and “smooth” functions.
convex Non-convex
Gradient Descent.
Let L (x1, x2) = 0.5 * (x1 – 15) **2 + 0.5 * (x2 – 25) ** 2, and solve for X1 and X2 that min L by gradient
descent.
Steps:
Take M = 100. Starting point s0 = (0, 0) Step size = 0.1
Iterate m = 1 to M
1. Calculate gradient of L at sm – 1
2. Step in direction of greatest descent (negative gradient) with step size γ, i.e.,
If γ mall and M large, sm minimizes L.
Additional considerations:
Instead of M iterations, stop when next improvement small.
Use line search to choose step sizes (Line search chooses search in descent direction of
minimization).
How does it work in gradient boosting?
Objective is Min L, starting from F0(x). For m = 1, compute gradient of L w.r.t F0(x). Then fit weak learner
to gradient components  for regression tree, obtain average gradient in each final node. In each node,
step in direction of avg. gradient using line search to determine step magnitude. Outcome is F1, and
repeat. In symbols:
Initialize model with constant: F0(x) = mean, median, etc.
For m = 1 to M Compute pseudo residual
fit base learner h to residuals
compute step magnitude gamma m (for trees, different gamma for
each node)
Update Fm(x) = Fm-1(x) + γm hm(x)
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 47
5/9/2018
“Gradient” descent
Method of gradient descent is a first order optimization algorithm that is based on taking
small steps in direction of the negative gradient at one point in the curve in order to find
the (hopefully global) minimum value (of loss function). If it is desired to search for the
maximum value instead, then the positive gradient is used and the method is then called
gradient ascent.
Second order not searched, solution could be local minimum.
Requires starting point, possibly many to avoid local minima.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 48
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 Ch. 5-49
5/9/2018
Comparing full tree (depth = 6) to boosted tree residuals by iteration..
2 GB versions: 1) with raw 20% events (M1), 2) with 50/50 mixture of events (M2). Non GB
Tree (referred as maxdepth 6 for M1 data set) the most biased. Notice that M2 stabilizes
earlier than M1. X axis: Iteration #. Y axis: average Residual. “Tree Depth 6” obviously
unaffected by iteration since it’s single tree run.
1.5969917399003E-15
-2.9088316687833E-16
Tree depth 6 2.83E-15
0 2 4 6 8 10
Iteration
-5E-15
-2.5E-15
0
2.5E-15
5E-15
MEAN_RESID_M1_TRN_TREES
MEAN_RESID_M2_TRN_TREES
MEAN_RESID_M1_TRN_TREES
Avg residuals by iteration by model names in gradient boosting
Vertical line - Mean stabilizes
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 50
5/9/2018
Comparing full tree (depth = 6) to boosted tree residuals by iteration..
Now Y = Var. of resids. M2 has highest variance followed by Depth 6 (single tree) and
then M1. M2 stabilizes earlier as well. In conclusion, M2 has lower bias and higher variance
than M1 in this example, difference lies on mixture of 0-1 in target variable.
0.1218753847
8
0.1781230782
5
Depth 6 = 0.145774
0.1219
0.1404
0.159
0.1775
0.196
0.2146
Var
of
Resids
0 2 4 6 8 10
Iteration
VAR_RESID_M2_TRN_TREES
VAR_RESID_M1_TRN_TREES
Variance of residuals by iteration in gradient boosting
Vertical line - Variance stabilizes
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 51
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 52
5/9/2018
Important Message N
1
Basic information on the original data set.s:
1
..
1
Data set name ........................ train
1
. # TRN obs ............... 3595
1
Validation data set ................. validata
1
. # VAL obs .............. 2365
1
Test data set ................
1
. # TST obs .............. 0
1
...
1
Dep variable ....................... fraud
1
.....
1
Pct Event Prior TRN............. 20.389
1
Pct Event Prior VAL............. 19.281
1
Pct Event Prior TEST ............
1
TRN and VAL data sets obtained by random sampling
Without replacement. .
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 53
5/9/2018
Variable Label
1
FRAUD Fraudulent Activity yes/no
total_spend Total spent on opticals
1
doctor_visits Total visits to a doctor
1
no_claims No of claims made recently
1
member_duration Membership duration
1
optom_presc Number of opticals claimed
1
num_members Number of members covered
5
3
Fraud data set, original 20% fraudsters.
Study alternatives of changing number of iterations from 3
to 50 and depth from 1 to 10 with training and validation
data sets.
Original Percentage of fraudsters 20% in both data sets..
Notice just 5 predictors, thus max number of iterations is 50
as exaggeration. In usual large databases, number of
iterations could reach 1000 or higher.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 54
5/9/2018
E.g., M5_VAL_GRAD_BOOSTING: M5 case with validation data set and using
gradient boosting as modeling technique. Model # 10 as identifier.
Requested Models: Names & Descriptions. Model #
Full Model Name Model Description
***
Overall Models
-1
M1 Raw 20pct depth 1 iterations 3
-10
M2 Raw 20pct depth 1 iterations 10
-10
M3 Raw 20pct depth 5 iterations 3
-10
M4 Raw 20pct depth 5 iterations 10
-10
M5 Raw 20pct depth 10 iterations 50
-10
01_M1_TRN_GRAD_BOOSTING Gradient Boosting
1
02_M1_VAL_GRAD_BOOSTING Gradient Boosting
2
03_M2_TRN_GRAD_BOOSTING Gradient Boosting
3
04_M2_VAL_GRAD_BOOSTING Gradient Boosting
4
05_M3_TRN_GRAD_BOOSTING Gradient Boosting
5
06_M3_VAL_GRAD_BOOSTING Gradient Boosting
6
07_M4_TRN_GRAD_BOOSTING Gradient Boosting
7
08_M4_VAL_GRAD_BOOSTING Gradient Boosting
8
09_M5_TRN_GRAD_BOOSTING Gradient Boosting
9
10_M5_VAL_GRAD_BOOSTING Gradient Boosting
10
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 55
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 56
5/9/2018
All agree on
No_claims as
First split.
Disagreement at depth 2. M2 does not use member duration.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 57
5/9/2018
M5 (# 9) yields different importance levels.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 58
5/9/2018
Probability range largest for M5,
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 59
5/9/2018
M5 best per AUROC Also when validated.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 60
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 61
5/9/2018
Huge jump in performance
Per R-square measure.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 62
5/9/2018
M5 (#09) obvious winner.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 63
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 64
5/9/2018
Overall conclusion for GB parameters
While higher values of number of iterations and depth imply
longer (and possibly significant) computer runs,
constraining these parameters can have significant negative
effects on model results.
In context of thousands of predictors, computer resource
availability might significantly affect model results.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 65
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 66
5/9/2018
Overall Ensembles.
Given specific classification study and many different modeling techniques,
create logistic regression model with original target variable and the different
predictions from the different models, without variable selection (this is not
critical).
Evaluate importance of different models either via p-values or partial
dependency plots.
Note: It is not Stacking, because Stacking “votes” to decide on final
classification.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 67
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 68
5/9/2018
Partial Dependency plots (PDP).
Due to GB’s (and other methods’) black-box nature, these plots show the
effect of predictor X on modeled response once all other predictors
have been marginalized (integrated away). Marginalized Predictors
usually fixed at constant value, typically mean.
Graphs may not capture nature of variable interactions especially if
interaction significantly affects model outcome.
Formally, PDP of F(x1, x2, xp) on X is E(F) over all vars except X. Thus, for
given Xs, PDP is average of predictions in training with Xs kept constant.
Since GB, Boosting, Bagging, etc are BLACK BOX models, use PDP to
obtain model interpretation. Also useful for logistic models.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 69
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 70
5/9/2018
Analytical problem to investigate.
Optical Health Care fraud insurance patients. Longer care typically involves higher
treatment costs and insurance company has to set up reserves immediately as soon
as a case is opened. Sometimes doctors involve in fraud.
Aim: predict fraudulent charges  classification problem; we’ll use a battery of
models and compare them, with and without 50/50 original training sample. Below
left, original data (M1 models), right 50/50 training (M2 models).
Notice
1
From .....
**************************************************************** 1
................. Basic information on the original data
set.s: 1
................. ..
1
................. Data set name ........................ train 1
................. Num_observations ................ 3595 1
................. Validation data set ................. validata 1
................. Num_observations .............. 2365 1
................. Test data set ................ 1
................. Num_observations .......... 0 1
................. ... 1
................. Dep variable ....................... fraud 1
................. ..... 1
................. Pct Event Prior TRN............. 20.389 1
................. Pct Event Prior VAL............. 19.281 1
................. Pct Event Prior TEST ............ 1
*************************************************************
**** 1
Notice
1
From .....
****************************************************************
1
................. Basic information on the original data set.s:
1
................. ..
1
................. Data set name ........................ sampled50_50
1
................. Num_observations ................ 1133
1
................. Validation data set ................. validata50_50
1
................. Num_observations .............. 4827
1
................. Test data set ................
1
................. Num_observations .......... 0
1
................. ...
1
................. Dep variable ....................... fraud
1
................. .....
1
................. Pct Event Prior TRN............. 50.838
1
................. Pct Event Prior VAL............. 12.699
1
................. Pct Event Prior TEST ............
1
*****************************************************************
1
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 71
5/9/2018
Requested Models: Names & Descriptions. Model #
Full Model Name Model Description
***
Overall Models
-1
M1 Raw 20pct -10
M2 50/50 prior for TRN -10
01_M1ENSEMBLE_TRN_LOGISTIC_NONE Logistic TRN NONE 1
02_M1ENSEMBLE_VAL_LOGISTIC_NONE Logistic VAL NONE 2
03_M1_TRN_BAGGING Bagging TRN Bagging 3
04_M1_TRN_GRAD_BOOSTING Gradient Boosting 4
05_M1_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE 5
06_M1_TRN_RFORESTS Random Forests 6
07_M1_TRN_TREES Trees TRN Trees 7
08_M1_VAL_BAGGING Trees VAL Trees 8
09_M1_VAL_GRAD_BOOSTING Gradient Boosting 9
10_M1_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE 10
11_M1_VAL_RFORESTS Random Forests 11
12_M1_VAL_TREES Trees VAL Trees 12
13_M2ENSEMBLE_TRN_LOGISTIC_NONE Logistic TRN NONE 13
14_M2ENSEMBLE_VAL_LOGISTIC_NONE Logistic VAL NONE 14
15_M2_TRN_BAGGING Bagging TRN Bagging 15
16_M2_TRN_GRAD_BOOSTING Gradient Boosting 16
17_M2_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE 17
18_M2_TRN_RFORESTS Random Forests 18
19_M2_TRN_TREES Trees TRN Trees 19
20_M2_VAL_BAGGING Trees VAL Trees 20
21_M2_VAL_GRAD_BOOSTING Gradient Boosting 21
22_M2_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE 22
23_M2_VAL_RFORESTS Random Forests 23
24_M2_VAL_TREES Trees VAL Trees 24
E.g., 08_M1_VAL_BAGGING: 8th model of M1 data set case, Validation and
using Bagging as the modeling technique.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 72
5/9/2018
For models other than Tree themselves, modeled posterior
probabilities via interval valued target variable.
For simplicity just first 4 levels of trees are shown.
Notation: M5_GB_TRN_TREES: Model M5, Tree simulation of
Gradient boosting run. BG: Bagging, RF: Random Forests, LG logistic.
Intention: obtain general idea of tree representation for
comparison to standard tree model. .
Next page: small detail for BG (Bagging), GB Gradient Boosting and Trees
themselves (notice leve difference). Later, graphical comparison of vars + splits at
each tree level.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 73
5/9/2018
Requested Tree Models: Names & Descriptions.
Model Name Level 1 + prob Level 2 + prob Level 3 + prob Level 4 + prob
M1_BG_TRN_TREES no_claims <= 0.5
(0.142)
member_duration <= 180
(0.200)
total_spend <= 52.5 (0.519)
total_spend > 52.5 (0.184)
member_duration > 180
(0.062)
doctor_visits <= 5.5 (0.092)
doctor_visits > 5.5 (0.048)
no_claims > 0.5 (0.446) no_claims <= 3.5 (0.394) member_duration <= 127
(0.545)
member_duration > 127
(0.320)
no_claims > 3.5 (0.788) optom_presc <= 3.5 (0.783)
optom_presc > 3.5 (0.817)
M1_GB_TRN_TREES no_claims <= 2.5
(0.184)
no_claims <= 0.5 (0.158) member_duration <= 180
(0.199)
member_duration > 180
(0.102)
no_claims > 0.5 (0.321) optom_presc <= 3.5 (0.287)
optom_presc > 3.5 (0.615)
no_claims > 2.5 (0.634) no_claims <= 4.5 (0.570) optom_presc <= 3.5 (0.537)
optom_presc > 3.5 (0.839)
no_claims > 4.5 (0.764) member_duration <= 303
(0.781)
M1_GB_TRN_TREES no_claims > 2.5
(0.634)
no_claims > 4.5 (0.764) member_duration > 303
(0.656)
M1_RF_TRN_TREES total_spend <= 50.5
(0.396)
no_claims <= 0.5 (0.328) member_duration <= 181
(0.375)
member_duration > 181
(0.152)
no_claims > 0.5 (0.552) member_duration <= 1.66
(0.648)
member_duration > 1.66
(0.397)
total_spend > 50.5
(0.197)
no_claims <= 0.5 (0.182) optom_presc <= 5.5 (0.180)
optom_presc > 5.5 (0.296)
no_claims > 0.5 (0.257) total_spend <= 86.5 (0.458)
total_spend > 86.5 (0.227)
M1_TRN_TREES no_claims <= 0.5
(0.142)
member_duration > 180
(0.062)
doctor_visits <= 5.5 (0.113)
doctor_visits > 5.5 (0.038) member_duration <= 150
(0.974)
member_duration > 150
(0.577)
member_duration <= 180
(0.201)
total_spend <= 42.5 (0.718)
total_spend > 42.5 (0.189) member_duration <= 325
(0.099)
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 74
5/9/2018
Tree representations
By level and
By node.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 75
5/9/2018
Tree representation comparisons, level 1.
Except for RF (03), all methods split at No_claims 0.5 but attain Different event probabilities.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 76
5/9/2018
RF (M2 07) splits uniquely on Optom_presc. Notice that the split values for member_duration and
no_claims are not necessarily the same across models.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 77
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 78
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 79
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 80
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 81
5/9/2018
ETC. Next, how do variables behave in each model (omitting
LG) ?
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 82
5/9/2018
Tree representations
By level and
By split variable.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 83
5/9/2018
RF alone in selecting total_spend
Notice prob < .4 in both nodes, as compared to
~0.78 above for M2_TRN_TREES. In later
levels, RF continues relatively apart from other
models.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 84
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 85
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 86
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 87
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 88
5/9/2018
Requested ENSEMBLE Tree Models: Names & Descriptions. Mod #
Model Name Level 1 +
Prob.
Level 2 +
Prob.
Level 3 +
Prob.
Level 4 +
Prob.
4
M1_NSMBL_LG_TRN_TREES p_M1_RFOR
ESTS <
0.32284 ( . )
p_M1_RFOR
ESTS <
0.21769 ( . )
p_M1_RFOR
ESTS <
0.12995 ( . )
p_M1_RFOR
ESTS >=
0.12995 ( . ) 4
p_M1_RFOR
ESTS >=
0.21769 ( . )
p_M1_LOGIS
TIC_STEPWI
SE < 0.2138 (
. )
4
p_M1_LOGIS
TIC_STEPWI
SE >= 0.2138
( . ) 4
p_M1_RFOR
ESTS >=
0.32284 ( . )
p_M1_RFOR
ESTS <
0.47186 ( . )
p_M1_LOGIS
TIC_STEPWI
SE < 0.36438
( . )
4
p_M1_LOGIS
TIC_STEPWI
SE >=
0.36438 ( . ) 4
p_M1_RFOR
ESTS >=
0.47186 ( . )
p_M1_RFOR
ESTS <
0.64668 ( . ) 4
p_M1_RFOR
ESTS >=
0.64668 ( . ) 4
M2_NSMBL_LG_TRN_TREES p_M2_GRAD
_BOOSTING
< 0.53437 ( . )
p_M2_GRAD
_BOOSTING
< 0.36111 ( . )
p_M2_RFOR
ESTS <
0.30466 ( . ) 9
p_M2_RFOR
ESTS >=
0.30466 ( . ) 9
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 89
5/9/2018
Requested ENSEMBLE Tree Models: Names & Descriptions. Mod #
Model Name Level 1 +
Prob.
Level 2 +
Prob.
Level 3 +
Prob.
Level 4 +
Prob.
M2_NSMBL_LG_TRN_TREES p_M2_GRAD
_BOOSTING
< 0.53437 ( . )
p_M2_GRAD
_BOOSTING
>= 0.36111 ( .
)
p_M2_GRAD
_BOOSTING
< 0.47453 ( . )
9
p_M2_GRAD
_BOOSTING
>= 0.47453 ( .
) 9
p_M2_GRAD
_BOOSTING
>= 0.53437 ( .
)
p_M2_GRAD
_BOOSTING
< 0.62556 ( . )
p_M2_BAGG
ING <
0.34876 ( . ) 9
p_M2_BAGG
ING >=
0.34876 ( . ) 9
p_M2_GRAD
_BOOSTING
>= 0.62556 ( .
)
p_M2_RFOR
ESTS <
0.81591 ( . ) 9
p_M2_RFOR
ESTS >=
0.81591 ( . ) 9
M1 ensembled mostly in RF, M2 in Gradient Boosting.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 90
5/9/2018
Conclusion on tree representations
No_claims at 0.5 certainly top splitter but notice that event
probabilities diverge (because RF, GB and BG model a
posterior probability, not a binary event, and thus carry
information from a previous model). Later splits diverge in
predictors and split values.
Important to view each tree model independently to gage
interpretability. And also that dependent variable in models
other than trees is the probability of event that resulted
from BG, RF or GB.
And it is important to view these recent findings in terms of
variables importance.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 91
5/9/2018
Importance
Measures
For Tree based
Methods.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 92
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 93
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 94
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 95
5/9/2018
RF and GB significant.
BG, GB, STPW significant.
Tree methods find no_claims as most important, logistic finds most predictors important.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 96
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 97
5/9/2018
Tree based methods do not reach top probability of 1.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 98
5/9/2018
Not over-fitted. Some strong over-fit.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 99
5/9/2018
Over-fit degree different
Than in classif. Rates (prev. slide).
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 100
5/9/2018
Note that TRN and VAL rank do not match. Lower VAL ranked Models tend to overfit
more.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 101
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 102
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 103
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 104
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 105
5/9/2018
M2-Ensemble has best average Validation ranking, Random Forests worst.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 106
5/9/2018
The two ensembles and the two gradient boosting are best performers.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 107
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 108
5/9/2018
50/50: scales
Shifted up.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 109
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 110
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 111
5/9/2018
Very interesting almost U relationship, conditioned on
Other vars in model.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 112
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 113
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 114
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 115
5/9/2018
Different K-S values.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 116
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 117
5/9/2018
While all tree models choose No_claims as most important, 50/50 trees
(M2_TREES) selected just no_claims, while M1_TREES selected 3 additional
predictors. BG, RF and GB Are not similarly affected.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 118
5/9/2018
M2 tree grows smaller trees and lowers miscl. Fro 0.5 to about
0.27, M1 from 0.2 to about 0.15.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 119
5/9/2018
Similarly for ASE.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 120
5/9/2018
M1_tree achieve a wider range of posterior probabilities.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 121
5/9/2018
Conclusion on 50/50 resampling.
In this example, 50/50 resampled models yielded a
smaller Tree with worse performance than its raw
counterpart.
Actual performance (for best models) was not affected
by 50/50 or raw modeling.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 122
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 123
5/9/2018
XGBoost
Developed by Chen and Guestrin (2016) XGBoost: A Scalable Tree
Boosting System.
Claims: Faster and better than neural networks and Random Forests.
Uses 2nd order gradients of loss functions based on Taylor expansions of loss functions,
plugged into same algorithm for greater generalization. In addition, transforms loss function
into more sophisticated objective function containing regularization terms, that penalizes tree
growth, with penalty proportional to the size of the node weights thus preventing overfitting.
More efficient than GB due to parallel computing on single computer (10 times faster).
Algorithm takes advantage of advanced decomposition of objective function that allows for
outperforming GB.
Not yet SAS available. Available in R, Julia, Python, CLI.
Tool used in many champion models in recent competitions (Kaggle, etc.).
See also Foster’s (2017) XGboostExplainer.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 124
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 125
5/9/2018
Comments on GB.
1) Not immediately apparent what weak classifier is for GB (e.g., by
varying depth in our case). Likewise, number of iterations is big
issue. In our simple example, M6 GB was best performer. Still, overall
modeling benefited from ensembling all methods as measured by
either AUROC or Cum Lift or ensemble p-values.
2) The posterior probability ranges are vastly different and thus the
tendency to classify observations by the .5 threshold is too simplistic.
3) The PDPs show that different methods find distinct multivariate
structures. Interestingly, the ensemble p-values show a decreasing
tendency by logistic and trees and a strong S shaped tendency
by M6 GB, which could mean that M6 GB alone tends to
overshoot its predictions.
4) GB relatively unaffected by 50/50 mixture.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 126
5/9/2018
Comments on GB.
5) While on classification GB problems, predictions are within [0, 1], for
continuous target problems, predictions can be beyond the range of the
target variable  headaches.
This is due to the fact that GB models residual at each iteration, not the
original target; this can lead to surprises, such as negative predictions
when Y takes only non-negative residual values, contrary to the original
Tree algorithm.
6) Shrinkage parameter and early stopping (# trees) act as regularizers
but combined effect not known and could be ineffective.
7) If shrinkage too small, and allow large T, model is large, expensive
to compute, implement and understand.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 127
5/9/2018
Drawbacks of GB.
1) IT IS NOT MAGIC, it won’t solve ALL modeling needs,
but best off-the-shelf tool. Still need to look for
transformations, odd issues, missing values, etc.
2) As all tree methods, categorical variables with many levels can
make it impossible to obtain model. E.g., zip codes.
3) Memory requirements can be very large, especially with large
iterations, typical problem of ensemble methods.
4) Large number of iterations  slow speed to obtain predictions 
on-line scoring may require trade-off between complexity and time
available. Once GB is learned, parallelization certainly helps.
5) No simple algorithm to capture interactions because of base-
learners.
6) No simple rules to determine gamma, # of iterations or depth of
simple learner. Need to try different combinations and possibly
recalibrate in time.
7) Still, one of the most powerful methods available.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 128
5/9/2018
Un-reviewed
Catboost
DeepForest
gcForest
Use of tree methods for continuous target variable.
…
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 129
5/9/2018
2.11) References
Auslender L. (1998): Alacart, poor man’s classification trees, NESUG.
Breiman L., Friedman J., Olshen R., Stone J. (1984): Classification and Regression Trees, Wadsworth.
Chen and Guestrin (2016): XGBoost: A Scalable Tree Boosting System.
Chipman H., George E., McCulloch R.: BART, Bayesian additive regression Trees, The Annals of
Statistics.
Foster D. (2017): New R package that makes Xgboost Interpretable, https://medium.com/applied-data-
science/new-r-package-the-xgboost-explainer-51dd7d1aa211
Friedman, J. (2001).Greedy boosting approximation: a gradient boosting machine. Ann.Stat. 29, 1189–
1232.doi:10.1214/aos/1013203451
Paluszynska A. (2017): Structural mining and knowledge extraction from random forest with
applications to The Cancer Genome Atlas project
(https://www.google.com/url?q=https%3A%2F%2Frawgit.com%2FgeneticsMiNIng%2FBlackBoxOpener
%2Fmaster%2FrandomForestExplainer_Master_thesis.pdf&sa=D&sntz=1&usg=AFQjCNHTJONZK24L
ioDeOB0KZnwLkn98fw and https://mi2datalab.github.io/randomForestExplainer/)
Quinlan J. Ross (1993): C4.5: programs for machine learning, Morgan Kaufmann Publshers.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 130
5/9/2018
Earlier literature on combining methods:
Winkler, RL. and Makridakis, S. (1983). The combination of
forecasts. J. R. Statis. Soc. A. 146(2), 150-157.
Makridakis, S. and Winkler, R.L. (1983). Averages of
Forecasts: Some Empirical Results,. Management Science,
29(9) 987-996.
Bates, J.M. and Granger, C.W. (1969). The combination of
forecasts. Or, 451-468.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 131
5/9/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 132
5/9/2018
1) Can you explain in nontechnical language the idea of
maximum likelihood estimation?, of SVM?
2) Contrast GB with RF.
3) In what way is over-fitting like a glove? Like an umbrella?
4) Would ensemble models always improve on individual models?
5) Would you select variables by way of tree methods to use in linear
methods later on? Yes? No? why?
6) In Tree regression, final predictions are means. Could better
predictions be obtained by regression model instead? A logistic for
a binary target? Discuss.
7) There are 9 coins, 8 of which are of equal weight, and there’s one
scale. How many steps until you identify the odd coin?
8) Why are manhole covers round?
9) You obtain 100% accuracy in validation of classification model.
Are you a genius? Yes, no, why?
10)If 85% of witnesses saw blue car during accident, and 15% saw
red car, what is probability (car is blue)?
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 133
5/9/2018
Counter-interview questions (you ask the interviewer).
1) How do you measure the height of a building with just a
barometer? Give three answers at least.
2) Two players A and B take turns saying a positive integer
number from 1 to 9. The numbers are added until
whoever reaches 100 or above, loses. Is there a strategy
to never lose? (aborting a game midway is acceptable, but
give reasoning).
3) There are two jugs, one that holds 5 gallons, the other one
3, and a nearby water fountain. How do you put exactly (less
than one ounce deviation is fine) 4 ounces in the 5 gallon
jug?
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 Ch. 5-134
5/9/2018
for now

More Related Content

Similar to 4_2_Ensemble models and gradient boosting2.pdf

Download It
Download ItDownload It
Download It
butest
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
Xueping Peng
 
Handout11
Handout11Handout11
Handout11
butest
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
butest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
kevinlan
 
week9_Machine_Learning.ppt
week9_Machine_Learning.pptweek9_Machine_Learning.ppt
week9_Machine_Learning.ppt
butest
 
Tree net and_randomforests_2009
Tree net and_randomforests_2009Tree net and_randomforests_2009
Tree net and_randomforests_2009
Matthew Magistrado
 

Similar to 4_2_Ensemble models and gradient boosting2.pdf (20)

M3R.FINAL
M3R.FINALM3R.FINAL
M3R.FINAL
 
Algoritma Random Forest beserta aplikasi nya
Algoritma Random Forest beserta aplikasi nyaAlgoritma Random Forest beserta aplikasi nya
Algoritma Random Forest beserta aplikasi nya
 
Download It
Download ItDownload It
Download It
 
EPFL workshop on sparsity
EPFL workshop on sparsityEPFL workshop on sparsity
EPFL workshop on sparsity
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
13 random forest
13 random forest13 random forest
13 random forest
 
4 MEDA.pdf
4 MEDA.pdf4 MEDA.pdf
4 MEDA.pdf
 
Handout11
Handout11Handout11
Handout11
 
Data Science An Engineering Implementation Perspective
Data Science An Engineering Implementation PerspectiveData Science An Engineering Implementation Perspective
Data Science An Engineering Implementation Perspective
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
forest-cover-type
forest-cover-typeforest-cover-type
forest-cover-type
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
week9_Machine_Learning.ppt
week9_Machine_Learning.pptweek9_Machine_Learning.ppt
week9_Machine_Learning.ppt
 
Tree net and_randomforests_2009
Tree net and_randomforests_2009Tree net and_randomforests_2009
Tree net and_randomforests_2009
 
Longintro
LongintroLongintro
Longintro
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
 

More from Leonardo Auslender

4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
Leonardo Auslender
 

More from Leonardo Auslender (20)

1 UMI.pdf
1 UMI.pdf1 UMI.pdf
1 UMI.pdf
 
Suppression Enhancement.pdf
Suppression Enhancement.pdfSuppression Enhancement.pdf
Suppression Enhancement.pdf
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdf
 
Linear Regression.pdf
Linear Regression.pdfLinear Regression.pdf
Linear Regression.pdf
 
2 UEDA.pdf
2 UEDA.pdf2 UEDA.pdf
2 UEDA.pdf
 
3 BEDA.pdf
3 BEDA.pdf3 BEDA.pdf
3 BEDA.pdf
 
1 EDA.pdf
1 EDA.pdf1 EDA.pdf
1 EDA.pdf
 
0 Statistics Intro.pdf
0 Statistics Intro.pdf0 Statistics Intro.pdf
0 Statistics Intro.pdf
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07
 
4 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-074 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-07
 
4 meda
4 meda4 meda
4 meda
 
3 beda
3 beda3 beda
3 beda
 
2 ueda
2 ueda2 ueda
2 ueda
 
1 eda
1 eda1 eda
1 eda
 
0 statistics intro
0 statistics intro0 statistics intro
0 statistics intro
 
Classification methods and assessment
Classification methods and assessmentClassification methods and assessment
Classification methods and assessment
 

Recently uploaded

如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
aqpto5bt
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
dq9vz1isj
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
ppy8zfkfm
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
yulianti213969
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
saurabvyas476
 

Recently uploaded (20)

如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 

4_2_Ensemble models and gradient boosting2.pdf

  • 1. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 1 5/9/2018 Ensemble models and Gradient Boosting. Leonardo Auslender Independent Statistical Consultant Leonardo.Auslender ‘at’ Gmail ‘dot’ com. Copyright 2018.
  • 2. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 2 5/9/2018 Topics to cover: 1) Why more techniques? Bias-variance tradeoff. 2)Ensembles 1) Bagging – stacking 2) Random Forests 3) Gradient Boosting (GB) 4) Gradient-descent optimization method. 5) Innards of GB and example. 6) Overall Ensembles. 7) Partial Dependency Plots (PDP) 8) Case Study. 9) Xgboost 10)On the practice of Ensembles. 11)References.
  • 3. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 3 5/9/2018
  • 4. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 4 5/9/2018 1) Why more techniques? Bias-variance tradeoff. (Broken clock is right twice a day, variance of estimation = 0, bias extremely high. Thermometer is accurate overall, but reports higher/lower temperatures at night. Unbiased, higher variance. Betting on same horse always has zero variance, possibly extremely biased). Model error can be broken down into three components mathematically. Let f be estimating function. f-hat empirically derived function. Bet on right Horse and win. Bet on wrong Horse and lose. Bet on many Horses and win. Bet on many horses and lose.
  • 5. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 5 5/9/2018 Credit : Scott Fortmann-Roe (web)
  • 6. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 6 5/9/2018 Let X1, X2, X3,,, i.i.d random variables Well known that E(X) = , and variance (E(X)) = By just averaging estimates, we lower variance and assure same aspects of bias. Let us find methods to lower or stabilize variance (at least) while keeping low bias. And maybe also, lower the bias. And since cannot be fully attained, still searching for more techniques.  Minimize general objective function: n   Minimize loss function to reduce bias. Regularization, minimize model complexity. Obj(Θ) L(Θ) Ω(Θ), L(Θ) Ω(Θ)     set of model parameters. 1 p where Ω {w ,,,,,,w }, 
  • 7. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 7 5/9/2018
  • 8. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 8 5/9/2018 Some terminology for Model combinations. Ensembles: general name Prediction/forecast combination: focusing on just outcomes Model combination for parameters: Bayesian parameter averaging We focus on ensembles as Prediction/forecast combinations.
  • 9. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 Ch. 5-9 5/9/2018
  • 10. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 10 5/9/2018 Ensembles. Bagging (bootstrap aggregation, Breiman, 1996): Adding randomness  improves function estimation. Variance reduction technique, reducing MSE. Let initial data size n. 1) Construct bootstrap sample by randomly drawing n times with replacement (note, some observations repeated). 2) Compute sample estimator (logistic or regression, tree, ANN … Tree in practice). 3) Redo B times, B large (50 – 100 or more in practice, but unknown). 4) Bagged estimator. For classification, Breiman recommends majority vote of classification for each observation. Buhlmann (2003) recommends averaging bootstrapped probabilities. Note that individual obs may not appear B times each. NB: Independent sequence of trees. What if …….? Reduces prediction error by lowering variance of aggregated predictor while maintaining bias almost constant (variance/bias trade-off). Friedman (1998) reconsidered boosting and bagging in terms of gradient descent algorithms, seen later on.
  • 11. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 11 5/9/2018 From http://dni-institute.in/blogs/bagging-algorithm-concepts-with-example/
  • 12. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 Ch. 5-12 5/9/2018 Ensembles Evaluation: Empirical studies: boosting (seen later) smaller misclassification rates compared to bagging, reduction of both bias and variance. Different boosting algorithms (Breiman’s arc-x4 and arc- gv). In cases with substantial noise, bagging performs better. Especially used in clinical studies. Why does Bagging work? Breiman: bagging successful because reduces instability of prediction method. Unstable: small perturbations in data  large changes in predictor. Experimental results show variance reduction. Studies suggest that bagging performs some smoothing on the estimates. Grandvalet (2004) argues that bootstrap sampling equalizes effects of highly influential observations. Disadvantage: cannot be visualized easily.
  • 13. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 13 5/9/2018 Ensembles Adaptive Bagging (Breiman, 2001): Mixes bias-reduction boosting with variance-reduction bagging. Uses out-of-bag obs to halt optimizer. Stacking: Previously, same technique used throughout. Stacking (Wolpert 1992) combines different algorithms on single data set. Voting is then used for final classification. Ting and Witten (1999) “stack” the probability distributions (PD) instead. Stacking is “meta-classifier”: combines methods. Pros: takes best from many methods. Cons: un-interpretable, mixture of methods become black-box of predictions. Stacking very prevalent in WEKA.
  • 14. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 14 5/9/2018 5.3) Tree World. 5.3.1) L. Breiman: Bagging. 2.2) L. Breiman: Random Forests
  • 15. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 15 Explanation by way of football example for The Saints. https://gormanalysis.com/random-forest-from-top-to-bottom/ Opponent OppRk SaintsAtHo me Expert1Pre dWin Expert2Pre dWin SaintsWon 1 Falcons 28 TRUE TRUE TRUE TRUE 2 Cowgirls 16 TRUE TRUE TRUE TRUE 3 Eagles 30 FALSE FALSE TRUE TRUE 4 Bucs 6 TRUE FALSE TRUE FALSE 5 Bucs 14 TRUE FALSE FALSE FALSE 6 Panthers 9 FALSE TRUE TRUE FALSE 7 Panthers 18 FALSE FALSE FALSE FALSE Goal: predict when Saints will win. 5 Predictors: Opponent, opponent rank, home game, expert1 and expert2 predictions. If run tree, just one split on opponent because Saints lost to Bucs and Panthers and perfect separation then, but useless for future opponents. Instead, at each step randomly select subset of 3 (or 2, or 4) features and grow multiple weak but different trees, which when combined, should be a smart model. 3 Examples: Tree2 Tree3 Tree1 Tree3 OppRank <= 15 (<=) Left (>) Right Opponent in Cowgirls, eagles, falcons Expert2 pred F =Left T= Right OppRank <= 12.5 (<=) Left (>) Right Opponent in Cowgirls, eagles, falcons (left)
  • 16. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 16 5/9/2018
  • 17. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 17 5/9/2018 Assume following test data and predictions: Opponent OppRk SaintsAtH ome Expert1Pr edWin Expert2Pr edWin 1 Falcons 1 TRUE TRUE TRUE 2 Falcons 32 TRUE TRUE FALSE 3 Falcons 32 TRUE FALSE TRUE Obs Tree1 Tree2 Tree3 MajorityVot e Sample1 FALSE FALSE TRUE FALSE Sample2 TRUE FALSE TRUE TRUE Sample3 TRUE TRUE TRUE TRUE Test data Predictions. Note that probability can be ascribed by counting # votes for each predicted target class and yield good ranking of prob for different classes. But problem: if “Opprk” (2nd best predictor) is in initial group of 3 with “opponent”, it won’t be used as splitter because “opponent” is perfect. Note that there are 10 ways to choose 3 out of 5, and each predictor appears 6 times  “Opponent” dominates 60% of trees, while Opprisk appears without “opponent” just 30% of the time. Could mitigate this effect by also sampling training obs to be used to develop model, giving Opprisk a higher chance to be root (not shown).
  • 18. Further, assume that Expert2 gives perfect predictions when Saints lose (not when they win). Right now, Expert2 as predictor is lost, but if resampling is with replacement, higher chance to use Expert2 as predictor because more losses might just appear. Summary: Data with N rows and p predictors: 1) Determine # of trees to grow. 2) For each tree Randomly sample n <= N rows with replacement. Create tree with m <= p predictors selected randomly at each non- final node. Combine different tree predictions by majority voting (classification trees) or averaging (regression trees). Note that voting can be replaced by average of probabilities, and averaging by medians.
  • 19. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 Ch. 5-19 5/9/2018 Definition of Random Forests. Decision Tree Forest: ensemble (collection) of decision trees whose predictions are combined to make overall prediction for the forest. Similar to TreeBoost (Gradient boosting) model because large number of trees are grown. However, TreeBoost generates series of trees with output of one tree going into next tree in series. In contrast, decision tree forest grows number of independent trees in parallel, and they do not interact until after all of them have been built. Disadvantage: complex model, cannot be visualized like single tree. More “black box” like neural network  advisable to create both single- tree and tree forest model. Single-tree model can be studied to get intuitive understanding of how predictor variables relate, and decision tree forest model can be used to score data and generate highly accurate predictions.
  • 20. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 Ch. 5-20 5/9/2018 Random Forests 1. Random sample of N observations with replacement (“bagging”). On average, about 2/3 of rows selected. Remaining 1/3 called “out of bag (OOB)” obs. New random selection is performed for each tree constructed. 2. Using obs selected in step 1, construct decision tree. Build tree to maximum size, without pruning. As tree is built, allow only subset of total set of predictor variables to be considered as possible splitters for each node. Select set of predictors to be considered as random subset of total set of available predictors. For example, if there are ten predictors, choose five randomly as candidate splitters. Perform new random selection for each split. Some predictors (possibly best one) will not be considered for each split, but predictor excluded from one split may be used for another split in same tree.
  • 21. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 Ch. 5-21 5/9/2018 Random Forests No Overfitting or Pruning. "Over-fitting“: problem in large, single-tree models where model fits noise in data  poor generalization power  pruning. In nearly all cases, decision tree forests do not have problem with over-fitting, and no need to prune trees in forest. Generally, more trees in forest, better fit. Internal Measure of Test Set (Generalization) Error . About 1/3 of observations excluded from each tree in forest, called “out of bag (OOB)”: each tree has different set of out-of-bag observations  each OOB set constitutes independent test sample. To measure generalization error of decision tree forest, OOB set for each tree is run through tree and error rate of prediction is computed.
  • 22. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 22 5/9/2018 Detour: Found in the Internet: PCA and RF. https://stats.stackexchange.com/questions/294791/ how-can-preprocessing-with-pca-but-keeping-the-same-dimensionality-improve-rando ?newsletter=1&nlcode=348729%7c8657 Discovery? “PCA before random forest can be useful not for dimensionality reduction but to give you data a shape where random forest can perform better. I am quite sure that in general if you transform your data with PCA keeping the same dimensionality of the original data you will have a better classification with random forest.” Answer: “Random forest struggles when the decision boundary is "diagonal" in the feature space because RF has to approximate that diagonal with lots of "rectangular" splits. To the extent that PCA re-orients the data so that splits perpendicular to the rotated & rescaled axes align well with the decision boundary, PCA will help. But there's no reason to believe that PCA will help in general, because not all decision boundaries are improved when rotated (e.g. a circle). And even if you do have a diagonal decision boundary, or a boundary that would be easier to find in a rotated space, applying PCA will only find that rotation by coincidence, because PCA has no knowledge at all about the classification component of the task (it is not "y-aware"). Also, the following caveat applies to all projects using PCA for supervised learning: data rotated by PCA may have little-to-no relevance to the classification objective.” DO NOT BELIEVE EVERYTHING THAT APPEARS IN THE WEB!!!!! BE CRITICAL!!!
  • 23. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 23 5/9/2018 Further Developments. Paluszynska (2017) focuses on providing better information on variable importance using RF. RF is constantly being researched and improved.
  • 24. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 24 5/9/2018
  • 25. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 25 5/9/2018 Detour: Underlying idea for boosting classification models (NOT yet GB). (Freund, Schapire, 2012, Boosting: Foundations and Algorithms, MIT) Start with model M(X) and obtain 80% accuracy, or 60% R2, etc. Then Y = M(X) + error1. Hypothesize that error is still correlated with Y.  error1 = G(X) + error2, where we model Error1 now, or In general Error (t - 1) = Z(X) + error (t)  Y = M(X) + G(X) + ….. + Z(X) + error (t-k). If find optimal beta weights to combined models, then Y = b1 * M(X) + b2 G(X) + …. + Bt Z(X) + error (t-k) Boosting is “Forward Stagewise Ensemble method” with single data set, iteratively reweighting observations according to previous error, especially focusing on wrongly classified observations. Philosophy: Focus on most difficult points to classify in previous step by reweighting observations.
  • 26. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 26 5/9/2018 Main idea of GB using trees (GBDT). Let Y be target, X predictors such that f 0(X) weak model to predict Y that just predicts mean value of Y. “weak” to avoid over- fitting. Improve on f 0(X) by creating f 1(X) = f 0(X) + h (x). If h perfect model  f 1(X) = y  h (x) = y - f 0(X) = residuals = negative gradients of loss (or cost) function. Residual Fitting -(y – f(x)) -1; 1
  • 27. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 27 5/9/2018 Explanation of GB by way of example.. /blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/ Predict age in following data set by way of trees, conitnuous target  regression tree. Predict age, loss function: SSE. PersonID Age LikesGardenin g PlaysVideoGa mes LikesHats 1 13 FALSE TRUE TRUE 2 14 FALSE TRUE FALSE 3 15 FALSE TRUE FALSE 4 25 TRUE TRUE TRUE 5 35 FALSE TRUE TRUE 6 49 TRUE FALSE FALSE 7 68 TRUE TRUE TRUE 8 71 TRUE FALSE FALSE 9 73 TRUE FALSE TRUE
  • 28. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 28 5/9/2018 Only 9 obs in data, thus we allow Tree to have very Small # obs in final Nodes. We want Videos variable because we suspect it’s important. But doing so (by allowing few obs in final nodes) also brought in split in “hats”, that seems irrelevant and just noise leading to over-fitting, because tree searches in smaller and smaller areas of data as it progresses. Let’s go in steps and look at the results of Tree1 (before second splits), stopping at first split, where predictions are 19.25 and 57.2 and obtain residuals. root Likes gardening F T 19.25 57.2 Hats F T Videos F T Tree 1
  • 29. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 29 5/9/2018 Run another tree using Tree1 residuals as new target. PersonID Age Tree1 Predictio n Tree1 Residual 1 13 19.25 -6.25 2 14 19.25 -5.25 3 15 19.25 -4.25 4 25 57.2 -32.2 5 35 19.25 15.75 6 49 57.2 -8.2 7 68 57.2 10.8 8 71 57.2 13.8 9 73 57.2 15.8
  • 30. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 30 5/9/2018 New root Video Games F T 7.133 -3.567 Tree 2 Note: Tree2 did not use “Likes Hats” because between Hats and VideoGames, videogames is preferred when using all obs instead of in full Tree1 in smaller region of the data where hats appear. And thus noise is avoided. Tree 1 SSE = 1994 Tree 2 SSE = 1765 PersonID Age Tree1 Prediction Tree1 Residual Tree2 Prediction Combined Prediction Final Residual 1 13 19.25 -6.25 -3.567 15.68 2.683 2 14 19.25 -5.25 -3.567 15.68 1.683 3 15 19.25 -4.25 -3.567 15.68 0.6833 4 25 57.2 -32.2 -3.567 53.63 28.63 5 35 19.25 15.75 -3.567 15.68 -19.32 6 49 57.2 -8.2 7.133 64.33 15.33 7 68 57.2 10.8 -3.567 53.63 -14.37 8 71 57.2 13.8 7.133 64.33 -6.667 9 73 57.2 15.8 7.133 64.33 -8.667 Combined pred for PersonID 1: 15.68 = 19.25 – 3.567
  • 31. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 31 5/9/2018 So Far 1) Started with ‘weak’ model F0(x) = y 2) Fitted second model to residuals h1(x) = y – F0(x) 3) Combined two previous models F2(x) = F1(x) + h1(x). Notice that h1(x) could be any type of model (stacking), not just trees. And continue re-cursing until M. Initial weak model was “mean” because well known that mean minimizes SSE. Q: how to choose M, gradient boosting hyper parameter? Usually cross-validation. 4) Alternative to mean: minimize Absolute error instead of SSE as loss function. More expensive because minimizer is median, computationally expensive. In this case, in Tree 1 above, use median (y) = 35, and obtain residuals. PersonID Age F0 Residual0 1 13 35 -22 2 14 35 -21 3 15 35 -20 4 25 35 -10 5 35 35 0 6 49 35 14 7 68 35 33 8 71 35 36 9 73 35 38
  • 32. Focus on observations 1 and 4 with respective residuals of -22 and -10 respectively to understand median case. Under SSE Loss function (standard Tree regression), a reduction in residuals of 1 unit, drops SSE by 43 and 19 resp. ( e.g., 22 * 22 – 21 * 21, 100 - 81) while for absolute loss, reduction is just 1 and 1 (22 – 21, 10 – 9)  SSE reduction will focus more on first observation because of 43, while absolute error focuses on all obs because they are all 1  Instead of training subsequent trees on residuals of F0, train h0 on gradient of loss function (L(y, F0(x)) w.r.t y-hats produced by F0(x). With absolute error loss, subsequent h trees will consider sign of every residual, as opposed to SSE loss that considers magnitude of residual. Gradient of SSE = which is “– residual”  this is a gradient descent algorithm. For Absolute Error: Each h tree groups observations into final nodes, and average gradient can be calculated in each and scaled by factor γ, such that Fm + γm hm minimizes loss function in each node. Shrinkage: For each gradient step, magnitude is multiplied by factor that ranges between 0 and 1 and called learning rate  each gradient step is shrunken allowing for slow convergence toward observed values  observations close to target values end up grouped into larger nodes, thus regularizing the method. Finally before each new tree step, row and column sampling occur to produce more different tree splits (similar to Random Forests). ˆ ˆ , ˆ | | ˆ ˆ , ( ) 1 1 ˆ                (AE) Y Y Y Y Absolute Error Y Y Y Y Y Y dAE Gradient AE or dY
  • 33. Results for SSE and Absolute Error: SSE case Age F0 PseudoR esidual0 h0 gamma0 F1 PseudoR esidual1 h1 gamma1 F2 13 40.33 -27.33 -21.08 1 19.25 -6.25 -3.567 1 15.68 14 40.33 -26.33 -21.08 1 19.25 -5.25 -3.567 1 15.68 15 40.33 -25.33 -21.08 1 19.25 -4.25 -3.567 1 15.68 25 40.33 -15.33 16.87 1 57.2 -32.2 -3.567 1 53.63 35 40.33 -5.333 -21.08 1 19.25 15.75 -3.567 1 15.68 49 40.33 8.667 16.87 1 57.2 -8.2 7.133 1 64.33 68 40.33 27.67 16.87 1 57.2 10.8 -3.567 1 53.63 71 40.33 30.67 16.87 1 57.2 13.8 7.133 1 64.33 73 40.33 32.67 16.87 1 57.2 15.8 7.133 1 64.33 h1 root -21.08 16.87 h0 Gardening F T root Videos F T 7.133 -3.567 E.g., for first observation. 40.33 is mean age, -27.33 = 13 – 40.33, -21.08 prediction due to gardening = F. F1 = 19.25 = 40.33 – 21.08. PseudoRes1 = 13 – 19.25, F2 = 19.25 – 3.567 = 15.68. Gamma0 = avg (pseudoresidual0 / h0) (by diff. values of h0). Same for gamma1.
  • 34. Results for SSE and Absolute Error: Absolute Error case. root -1 0.6 h0 Gardening F T h1 root Videos F T 0.333 -0.333 Age F0 PseudoResi dual0 h0 gamma0 F1 PseudoRes idual1 h1 gamma1 F2 13 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25 14 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25 15 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25 25 35 -1 0.6 55 68 -1 -0.3333 0.75 67.75 35 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25 49 35 1 0.6 55 68 -1 0.3333 9 71 68 35 1 0.6 55 68 -1 -0.3333 0.75 67.75 71 35 1 0.6 55 68 1 0.3333 9 71 73 35 1 0.6 55 68 1 0.3333 9 71 E.g., for 1st observation. 35 is median age, residual = -1, 1 if residual >0 or < 0 resp. F1 = 14.5 because 35 + 20.5 * (-1). F2 = 14.25 = 14.5 + 0.75 * (-0.3333). Predictions within leaf nodes computed by “mean” of obs therein. Gamma0 = median ((age – F0) / h0) = avg ((14 – 35) / -1; (15 – 35) / -1)) = 20.5 55 = (68 – 35) / .06 . Gamma1 = median ((age – F1) / h1) by different values of h1 (and of h0 for gamma0).
  • 35. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 35 5/9/2018 Quick description of GB using trees (GBDT). 1) Create very small tree as initial model, ‘weak’ learner, (e.g., tree with two terminal nodes. (  depth = 1). ‘WEAK’ avoids over-fitting and local minina, and predicts, F1, for each obs. Tree1. 2) Each tree allocates a probability of event or a mean value in each terminal node, according to the nature of the dependent variable or target. 3) Compute “residuals” (prediction error) for every observation (if 0-1 target, apply logistic transformation to linearize them, p / 1 – p). 4) Use residuals as new ‘target variable and grow second small tree on them (second stage of the process, same depth). To ensure against over-fitting, use random sample without replacement ( “stochastic gradient boosting”.) Tree2. 5) New model, once second stage is complete, we obtain concatenation of two trees, Tree1 and Tree2 and predictions F1 + F2 * gamma, gamma multiplier or shrinkage factor (called step size in gradient descent). 6) Iterate procedure of computing residuals from most recent tree, which become the target of the new model, etc. 7) In the case of a binary target variable, each tree produces at least some nodes in which ‘event’ is majority (‘events’ are typically more difficult to identify since most data sets contain very low proportion of ‘events’ in usual case). 8) Final score for each observation is obtained by summing (with weights) the different scores (probabilities) of every tree for each observation. Why does it work? Why “gradient” and “boosting”?
  • 36. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 36 5/9/2018 Comparing GBDT vs Trees in point 4 above (I). GBDT takes sample from training data to create tree at each iteration, CART does not. Below, notice differences between with sample proportion of 60% for GBDT and no sample for generic trees for the fraud data set, Total_spend is the target. Predictions are similar. IF doctor_visits < 8.5 THEN DO; /* GBDT */ _prediction_ + -1208.458663; END; ELSE DO; _prediction_ + 1360.7910083; END; IF 8.5 <= doctor_visits THEN DO; /* GENERIC TREES*/ P_pseudo_res0 = 1378.74081896893; END; ELSE DO; P_pseudo_res0 = -1290.94575707227; END;
  • 37. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 37 5/9/2018 Comparing GBDT vs Trees in point 4 above (II). Again, GBDT takes sample from training data to create tree at each iteration, CART does not. If we allow for CART to work with same proportion sample but different seed, splitting variables may be different at specific depth of tree creation. /* GBDT */ IF doctor_visits < 8.5 THEN DO; _ARB_F_ + -579.8214325; END; EDA of two samples would ELSE DO; indicate subtle differences _ARB_F_ + 701.49142697; that induce differences in END; selected splitting variables. END; / ORIGINAL TREES */ IF 183.5 <= member_duration THEN DO; P_pseudo_res0 = 1677.87318718526; END; ELSE DO; P_pseudo_res0 = -1165.32773940565; END;
  • 38. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 38 5/9/2018 More Details Friedman’s general 2001 GB algorithm: 1) Data (Y, X), Y (N, 1), X (N, p) 2) Choose # iterations M 3) Choose loss function L(Y, F(x), Error), and corresponding gradient, i.e., 0-1 loss function, and residuals are corresponding gradient. Function called ‘f’. Loss f implied by Y. 4) Choose base learner h( X, θ), say shallow trees. Algorithm: 1: initialize f0 with a constant, usually mean of Y. 2: for t = 1 to M do 3: compute negative gradient gt(x), i.e., residual from Y as next target. 4: fit a new base-learner function h(x, θt), i.e., tree. 5: find best gradient descent step-size, and min Loss f: 6: update function estimate: 8: end for (all f function are function estimates, i.e., ‘hats’). 0 < n t t t i t 1 i i γ i 1 , 1 γ argmin L(y ,f (x ) γh (x )) γ          t t 1 t t t f f (x) γ h (x,θ )
  • 39. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 39 5/9/2018 Specifics of Tree Gradient Boosting, called TreeBoost (Friedman). Friedman’s 2001 GB algorithm for tree methods: Same as previous one, and jt prediction of tree t in final node N for tree 'm'. J t jt jm j 1 jt h (x) p I(x N ) p :     t t-1 In TreeBoost Friedman proposes to find optimal in each final node instead of unique at every iteration. Then f (x)=f (x)+ i jt jm J jt t jt j 1 jt i t 1 i t i γ x N , γ h (x)I(x N ), γ argmin L(y ,f (x ) γh (x )) γ γ,        
  • 40. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 40 5/9/2018 Parallels with Stepwise (regression) methods. Stepwise starts from original Y and X, and in later iterations turns to residuals, and reduced and orthogonalized X matrix, where ‘entered’ predictors are no longer used and orthogonalized away from other predictors. GBDT uses residuals as targets, but does not orthogonalize or drop any predictors. Stepwise stops either by statistical inference, or AIC/BIC search. GBDT has a fixed number of iterations. Stepwise has no ‘gamma’ (shrinkage factor).
  • 41. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 41 5/9/2018 Setting. Hypothesize existence of function Y = f (X, betas, error). Change of paradigm, no MLE (e.g., logistic, regression, etc) but loss function. Minimize Loss function itself, its expected value called risk. Many different loss functions available, gaussian, 0-1, etc. A loss function describes the loss (or cost) associated with all possible decisions. Different decision functions or predictor functions will tend to lead to different types of mistakes. The loss function tells us which type of mistakes we should be more concerned about. For instance, estimating demand, decision function could be linear equation and loss function could be squared or absolute error. The best decision function is the function that yields the lowest expected loss, and the expected loss function is itself called risk of an estimator. 0-1 assigns 0 for correct prediction, 1 for incorrect.
  • 42. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 65 5/9/2018 Key Details. Friedman’s 2001 GB algorithm: Need 1) Loss function (usually determined by nature of Y (binary, continuous…)) (NO MLE). 2) Weak learner, typically tree stump or spline, marginally better classifier than random (but by how much?). 3) Model with T Iterations: # nodes in each tree; L2 or L1 norm of leaf weights; other. Function not directly opti T t i t 1 n T i i k i 1 t 1 ŷ tree (X) ˆ Objective function : L(y , y ) Ω(Tree ) Ω {          mized by GB.}
  • 43. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 43 5/9/2018 L2-error penalizes symmetrically away from 0, Huber penalizes less than OLS away from [-1, 1], Bernoulli and Adaboost are very similar. Note that Y ε [-1, 1] in 0-1 case here.
  • 44. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 44 5/9/2018
  • 45. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 45 5/9/2018 Gradient Descent. “Gradient” descent method to find minimum of function. Gradient: multivariate generalization of derivative of function in one dimension to many dimensions. I.e., gradient is vector of partial derivatives. In one dimension, gradient is tangent to function. Easier to work with convex and “smooth” functions. convex Non-convex
  • 46. Gradient Descent. Let L (x1, x2) = 0.5 * (x1 – 15) **2 + 0.5 * (x2 – 25) ** 2, and solve for X1 and X2 that min L by gradient descent. Steps: Take M = 100. Starting point s0 = (0, 0) Step size = 0.1 Iterate m = 1 to M 1. Calculate gradient of L at sm – 1 2. Step in direction of greatest descent (negative gradient) with step size γ, i.e., If γ mall and M large, sm minimizes L. Additional considerations: Instead of M iterations, stop when next improvement small. Use line search to choose step sizes (Line search chooses search in descent direction of minimization). How does it work in gradient boosting? Objective is Min L, starting from F0(x). For m = 1, compute gradient of L w.r.t F0(x). Then fit weak learner to gradient components  for regression tree, obtain average gradient in each final node. In each node, step in direction of avg. gradient using line search to determine step magnitude. Outcome is F1, and repeat. In symbols: Initialize model with constant: F0(x) = mean, median, etc. For m = 1 to M Compute pseudo residual fit base learner h to residuals compute step magnitude gamma m (for trees, different gamma for each node) Update Fm(x) = Fm-1(x) + γm hm(x)
  • 47. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 47 5/9/2018 “Gradient” descent Method of gradient descent is a first order optimization algorithm that is based on taking small steps in direction of the negative gradient at one point in the curve in order to find the (hopefully global) minimum value (of loss function). If it is desired to search for the maximum value instead, then the positive gradient is used and the method is then called gradient ascent. Second order not searched, solution could be local minimum. Requires starting point, possibly many to avoid local minima.
  • 48. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 48 5/9/2018
  • 49. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 Ch. 5-49 5/9/2018 Comparing full tree (depth = 6) to boosted tree residuals by iteration.. 2 GB versions: 1) with raw 20% events (M1), 2) with 50/50 mixture of events (M2). Non GB Tree (referred as maxdepth 6 for M1 data set) the most biased. Notice that M2 stabilizes earlier than M1. X axis: Iteration #. Y axis: average Residual. “Tree Depth 6” obviously unaffected by iteration since it’s single tree run. 1.5969917399003E-15 -2.9088316687833E-16 Tree depth 6 2.83E-15 0 2 4 6 8 10 Iteration -5E-15 -2.5E-15 0 2.5E-15 5E-15 MEAN_RESID_M1_TRN_TREES MEAN_RESID_M2_TRN_TREES MEAN_RESID_M1_TRN_TREES Avg residuals by iteration by model names in gradient boosting Vertical line - Mean stabilizes
  • 50. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 50 5/9/2018 Comparing full tree (depth = 6) to boosted tree residuals by iteration.. Now Y = Var. of resids. M2 has highest variance followed by Depth 6 (single tree) and then M1. M2 stabilizes earlier as well. In conclusion, M2 has lower bias and higher variance than M1 in this example, difference lies on mixture of 0-1 in target variable. 0.1218753847 8 0.1781230782 5 Depth 6 = 0.145774 0.1219 0.1404 0.159 0.1775 0.196 0.2146 Var of Resids 0 2 4 6 8 10 Iteration VAR_RESID_M2_TRN_TREES VAR_RESID_M1_TRN_TREES Variance of residuals by iteration in gradient boosting Vertical line - Variance stabilizes
  • 51. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 51 5/9/2018
  • 52. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 52 5/9/2018 Important Message N 1 Basic information on the original data set.s: 1 .. 1 Data set name ........................ train 1 . # TRN obs ............... 3595 1 Validation data set ................. validata 1 . # VAL obs .............. 2365 1 Test data set ................ 1 . # TST obs .............. 0 1 ... 1 Dep variable ....................... fraud 1 ..... 1 Pct Event Prior TRN............. 20.389 1 Pct Event Prior VAL............. 19.281 1 Pct Event Prior TEST ............ 1 TRN and VAL data sets obtained by random sampling Without replacement. .
  • 53. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 53 5/9/2018 Variable Label 1 FRAUD Fraudulent Activity yes/no total_spend Total spent on opticals 1 doctor_visits Total visits to a doctor 1 no_claims No of claims made recently 1 member_duration Membership duration 1 optom_presc Number of opticals claimed 1 num_members Number of members covered 5 3 Fraud data set, original 20% fraudsters. Study alternatives of changing number of iterations from 3 to 50 and depth from 1 to 10 with training and validation data sets. Original Percentage of fraudsters 20% in both data sets.. Notice just 5 predictors, thus max number of iterations is 50 as exaggeration. In usual large databases, number of iterations could reach 1000 or higher.
  • 54. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 54 5/9/2018 E.g., M5_VAL_GRAD_BOOSTING: M5 case with validation data set and using gradient boosting as modeling technique. Model # 10 as identifier. Requested Models: Names & Descriptions. Model # Full Model Name Model Description *** Overall Models -1 M1 Raw 20pct depth 1 iterations 3 -10 M2 Raw 20pct depth 1 iterations 10 -10 M3 Raw 20pct depth 5 iterations 3 -10 M4 Raw 20pct depth 5 iterations 10 -10 M5 Raw 20pct depth 10 iterations 50 -10 01_M1_TRN_GRAD_BOOSTING Gradient Boosting 1 02_M1_VAL_GRAD_BOOSTING Gradient Boosting 2 03_M2_TRN_GRAD_BOOSTING Gradient Boosting 3 04_M2_VAL_GRAD_BOOSTING Gradient Boosting 4 05_M3_TRN_GRAD_BOOSTING Gradient Boosting 5 06_M3_VAL_GRAD_BOOSTING Gradient Boosting 6 07_M4_TRN_GRAD_BOOSTING Gradient Boosting 7 08_M4_VAL_GRAD_BOOSTING Gradient Boosting 8 09_M5_TRN_GRAD_BOOSTING Gradient Boosting 9 10_M5_VAL_GRAD_BOOSTING Gradient Boosting 10
  • 55. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 55 5/9/2018
  • 56. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 56 5/9/2018 All agree on No_claims as First split. Disagreement at depth 2. M2 does not use member duration.
  • 57. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 57 5/9/2018 M5 (# 9) yields different importance levels.
  • 58. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 58 5/9/2018 Probability range largest for M5,
  • 59. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 59 5/9/2018 M5 best per AUROC Also when validated.
  • 60. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 60 5/9/2018
  • 61. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 61 5/9/2018 Huge jump in performance Per R-square measure.
  • 62. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 62 5/9/2018 M5 (#09) obvious winner.
  • 63. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 63 5/9/2018
  • 64. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 64 5/9/2018 Overall conclusion for GB parameters While higher values of number of iterations and depth imply longer (and possibly significant) computer runs, constraining these parameters can have significant negative effects on model results. In context of thousands of predictors, computer resource availability might significantly affect model results.
  • 65. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 65 5/9/2018
  • 66. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 66 5/9/2018 Overall Ensembles. Given specific classification study and many different modeling techniques, create logistic regression model with original target variable and the different predictions from the different models, without variable selection (this is not critical). Evaluate importance of different models either via p-values or partial dependency plots. Note: It is not Stacking, because Stacking “votes” to decide on final classification.
  • 67. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 67 5/9/2018
  • 68. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 68 5/9/2018 Partial Dependency plots (PDP). Due to GB’s (and other methods’) black-box nature, these plots show the effect of predictor X on modeled response once all other predictors have been marginalized (integrated away). Marginalized Predictors usually fixed at constant value, typically mean. Graphs may not capture nature of variable interactions especially if interaction significantly affects model outcome. Formally, PDP of F(x1, x2, xp) on X is E(F) over all vars except X. Thus, for given Xs, PDP is average of predictions in training with Xs kept constant. Since GB, Boosting, Bagging, etc are BLACK BOX models, use PDP to obtain model interpretation. Also useful for logistic models.
  • 69. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 69 5/9/2018
  • 70. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 70 5/9/2018 Analytical problem to investigate. Optical Health Care fraud insurance patients. Longer care typically involves higher treatment costs and insurance company has to set up reserves immediately as soon as a case is opened. Sometimes doctors involve in fraud. Aim: predict fraudulent charges  classification problem; we’ll use a battery of models and compare them, with and without 50/50 original training sample. Below left, original data (M1 models), right 50/50 training (M2 models). Notice 1 From ..... **************************************************************** 1 ................. Basic information on the original data set.s: 1 ................. .. 1 ................. Data set name ........................ train 1 ................. Num_observations ................ 3595 1 ................. Validation data set ................. validata 1 ................. Num_observations .............. 2365 1 ................. Test data set ................ 1 ................. Num_observations .......... 0 1 ................. ... 1 ................. Dep variable ....................... fraud 1 ................. ..... 1 ................. Pct Event Prior TRN............. 20.389 1 ................. Pct Event Prior VAL............. 19.281 1 ................. Pct Event Prior TEST ............ 1 ************************************************************* **** 1 Notice 1 From ..... **************************************************************** 1 ................. Basic information on the original data set.s: 1 ................. .. 1 ................. Data set name ........................ sampled50_50 1 ................. Num_observations ................ 1133 1 ................. Validation data set ................. validata50_50 1 ................. Num_observations .............. 4827 1 ................. Test data set ................ 1 ................. Num_observations .......... 0 1 ................. ... 1 ................. Dep variable ....................... fraud 1 ................. ..... 1 ................. Pct Event Prior TRN............. 50.838 1 ................. Pct Event Prior VAL............. 12.699 1 ................. Pct Event Prior TEST ............ 1 ***************************************************************** 1
  • 71. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 71 5/9/2018 Requested Models: Names & Descriptions. Model # Full Model Name Model Description *** Overall Models -1 M1 Raw 20pct -10 M2 50/50 prior for TRN -10 01_M1ENSEMBLE_TRN_LOGISTIC_NONE Logistic TRN NONE 1 02_M1ENSEMBLE_VAL_LOGISTIC_NONE Logistic VAL NONE 2 03_M1_TRN_BAGGING Bagging TRN Bagging 3 04_M1_TRN_GRAD_BOOSTING Gradient Boosting 4 05_M1_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE 5 06_M1_TRN_RFORESTS Random Forests 6 07_M1_TRN_TREES Trees TRN Trees 7 08_M1_VAL_BAGGING Trees VAL Trees 8 09_M1_VAL_GRAD_BOOSTING Gradient Boosting 9 10_M1_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE 10 11_M1_VAL_RFORESTS Random Forests 11 12_M1_VAL_TREES Trees VAL Trees 12 13_M2ENSEMBLE_TRN_LOGISTIC_NONE Logistic TRN NONE 13 14_M2ENSEMBLE_VAL_LOGISTIC_NONE Logistic VAL NONE 14 15_M2_TRN_BAGGING Bagging TRN Bagging 15 16_M2_TRN_GRAD_BOOSTING Gradient Boosting 16 17_M2_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE 17 18_M2_TRN_RFORESTS Random Forests 18 19_M2_TRN_TREES Trees TRN Trees 19 20_M2_VAL_BAGGING Trees VAL Trees 20 21_M2_VAL_GRAD_BOOSTING Gradient Boosting 21 22_M2_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE 22 23_M2_VAL_RFORESTS Random Forests 23 24_M2_VAL_TREES Trees VAL Trees 24 E.g., 08_M1_VAL_BAGGING: 8th model of M1 data set case, Validation and using Bagging as the modeling technique.
  • 72. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 72 5/9/2018 For models other than Tree themselves, modeled posterior probabilities via interval valued target variable. For simplicity just first 4 levels of trees are shown. Notation: M5_GB_TRN_TREES: Model M5, Tree simulation of Gradient boosting run. BG: Bagging, RF: Random Forests, LG logistic. Intention: obtain general idea of tree representation for comparison to standard tree model. . Next page: small detail for BG (Bagging), GB Gradient Boosting and Trees themselves (notice leve difference). Later, graphical comparison of vars + splits at each tree level.
  • 73. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 73 5/9/2018 Requested Tree Models: Names & Descriptions. Model Name Level 1 + prob Level 2 + prob Level 3 + prob Level 4 + prob M1_BG_TRN_TREES no_claims <= 0.5 (0.142) member_duration <= 180 (0.200) total_spend <= 52.5 (0.519) total_spend > 52.5 (0.184) member_duration > 180 (0.062) doctor_visits <= 5.5 (0.092) doctor_visits > 5.5 (0.048) no_claims > 0.5 (0.446) no_claims <= 3.5 (0.394) member_duration <= 127 (0.545) member_duration > 127 (0.320) no_claims > 3.5 (0.788) optom_presc <= 3.5 (0.783) optom_presc > 3.5 (0.817) M1_GB_TRN_TREES no_claims <= 2.5 (0.184) no_claims <= 0.5 (0.158) member_duration <= 180 (0.199) member_duration > 180 (0.102) no_claims > 0.5 (0.321) optom_presc <= 3.5 (0.287) optom_presc > 3.5 (0.615) no_claims > 2.5 (0.634) no_claims <= 4.5 (0.570) optom_presc <= 3.5 (0.537) optom_presc > 3.5 (0.839) no_claims > 4.5 (0.764) member_duration <= 303 (0.781) M1_GB_TRN_TREES no_claims > 2.5 (0.634) no_claims > 4.5 (0.764) member_duration > 303 (0.656) M1_RF_TRN_TREES total_spend <= 50.5 (0.396) no_claims <= 0.5 (0.328) member_duration <= 181 (0.375) member_duration > 181 (0.152) no_claims > 0.5 (0.552) member_duration <= 1.66 (0.648) member_duration > 1.66 (0.397) total_spend > 50.5 (0.197) no_claims <= 0.5 (0.182) optom_presc <= 5.5 (0.180) optom_presc > 5.5 (0.296) no_claims > 0.5 (0.257) total_spend <= 86.5 (0.458) total_spend > 86.5 (0.227) M1_TRN_TREES no_claims <= 0.5 (0.142) member_duration > 180 (0.062) doctor_visits <= 5.5 (0.113) doctor_visits > 5.5 (0.038) member_duration <= 150 (0.974) member_duration > 150 (0.577) member_duration <= 180 (0.201) total_spend <= 42.5 (0.718) total_spend > 42.5 (0.189) member_duration <= 325 (0.099)
  • 74. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 74 5/9/2018 Tree representations By level and By node.
  • 75. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 75 5/9/2018 Tree representation comparisons, level 1. Except for RF (03), all methods split at No_claims 0.5 but attain Different event probabilities.
  • 76. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 76 5/9/2018 RF (M2 07) splits uniquely on Optom_presc. Notice that the split values for member_duration and no_claims are not necessarily the same across models.
  • 77. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 77 5/9/2018
  • 78. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 78 5/9/2018
  • 79. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 79 5/9/2018
  • 80. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 80 5/9/2018
  • 81. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 81 5/9/2018 ETC. Next, how do variables behave in each model (omitting LG) ?
  • 82. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 82 5/9/2018 Tree representations By level and By split variable.
  • 83. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 83 5/9/2018 RF alone in selecting total_spend Notice prob < .4 in both nodes, as compared to ~0.78 above for M2_TRN_TREES. In later levels, RF continues relatively apart from other models.
  • 84. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 84 5/9/2018
  • 85. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 85 5/9/2018
  • 86. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 86 5/9/2018
  • 87. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 87 5/9/2018
  • 88. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 88 5/9/2018 Requested ENSEMBLE Tree Models: Names & Descriptions. Mod # Model Name Level 1 + Prob. Level 2 + Prob. Level 3 + Prob. Level 4 + Prob. 4 M1_NSMBL_LG_TRN_TREES p_M1_RFOR ESTS < 0.32284 ( . ) p_M1_RFOR ESTS < 0.21769 ( . ) p_M1_RFOR ESTS < 0.12995 ( . ) p_M1_RFOR ESTS >= 0.12995 ( . ) 4 p_M1_RFOR ESTS >= 0.21769 ( . ) p_M1_LOGIS TIC_STEPWI SE < 0.2138 ( . ) 4 p_M1_LOGIS TIC_STEPWI SE >= 0.2138 ( . ) 4 p_M1_RFOR ESTS >= 0.32284 ( . ) p_M1_RFOR ESTS < 0.47186 ( . ) p_M1_LOGIS TIC_STEPWI SE < 0.36438 ( . ) 4 p_M1_LOGIS TIC_STEPWI SE >= 0.36438 ( . ) 4 p_M1_RFOR ESTS >= 0.47186 ( . ) p_M1_RFOR ESTS < 0.64668 ( . ) 4 p_M1_RFOR ESTS >= 0.64668 ( . ) 4 M2_NSMBL_LG_TRN_TREES p_M2_GRAD _BOOSTING < 0.53437 ( . ) p_M2_GRAD _BOOSTING < 0.36111 ( . ) p_M2_RFOR ESTS < 0.30466 ( . ) 9 p_M2_RFOR ESTS >= 0.30466 ( . ) 9
  • 89. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 89 5/9/2018 Requested ENSEMBLE Tree Models: Names & Descriptions. Mod # Model Name Level 1 + Prob. Level 2 + Prob. Level 3 + Prob. Level 4 + Prob. M2_NSMBL_LG_TRN_TREES p_M2_GRAD _BOOSTING < 0.53437 ( . ) p_M2_GRAD _BOOSTING >= 0.36111 ( . ) p_M2_GRAD _BOOSTING < 0.47453 ( . ) 9 p_M2_GRAD _BOOSTING >= 0.47453 ( . ) 9 p_M2_GRAD _BOOSTING >= 0.53437 ( . ) p_M2_GRAD _BOOSTING < 0.62556 ( . ) p_M2_BAGG ING < 0.34876 ( . ) 9 p_M2_BAGG ING >= 0.34876 ( . ) 9 p_M2_GRAD _BOOSTING >= 0.62556 ( . ) p_M2_RFOR ESTS < 0.81591 ( . ) 9 p_M2_RFOR ESTS >= 0.81591 ( . ) 9 M1 ensembled mostly in RF, M2 in Gradient Boosting.
  • 90. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 90 5/9/2018 Conclusion on tree representations No_claims at 0.5 certainly top splitter but notice that event probabilities diverge (because RF, GB and BG model a posterior probability, not a binary event, and thus carry information from a previous model). Later splits diverge in predictors and split values. Important to view each tree model independently to gage interpretability. And also that dependent variable in models other than trees is the probability of event that resulted from BG, RF or GB. And it is important to view these recent findings in terms of variables importance.
  • 91. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 91 5/9/2018 Importance Measures For Tree based Methods.
  • 92. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 92 5/9/2018
  • 93. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 93 5/9/2018
  • 94. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 94 5/9/2018
  • 95. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 95 5/9/2018 RF and GB significant. BG, GB, STPW significant. Tree methods find no_claims as most important, logistic finds most predictors important.
  • 96. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 96 5/9/2018
  • 97. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 97 5/9/2018 Tree based methods do not reach top probability of 1.
  • 98. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 98 5/9/2018 Not over-fitted. Some strong over-fit.
  • 99. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 99 5/9/2018 Over-fit degree different Than in classif. Rates (prev. slide).
  • 100. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 100 5/9/2018 Note that TRN and VAL rank do not match. Lower VAL ranked Models tend to overfit more.
  • 101. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 101 5/9/2018
  • 102. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 102 5/9/2018
  • 103. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 103 5/9/2018
  • 104. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 104 5/9/2018
  • 105. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 105 5/9/2018 M2-Ensemble has best average Validation ranking, Random Forests worst.
  • 106. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 106 5/9/2018 The two ensembles and the two gradient boosting are best performers.
  • 107. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 107 5/9/2018
  • 108. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 108 5/9/2018 50/50: scales Shifted up.
  • 109. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 109 5/9/2018
  • 110. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 110 5/9/2018
  • 111. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 111 5/9/2018 Very interesting almost U relationship, conditioned on Other vars in model.
  • 112. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 112 5/9/2018
  • 113. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 113 5/9/2018
  • 114. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 114 5/9/2018
  • 115. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 115 5/9/2018 Different K-S values.
  • 116. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 116 5/9/2018
  • 117. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 117 5/9/2018 While all tree models choose No_claims as most important, 50/50 trees (M2_TREES) selected just no_claims, while M1_TREES selected 3 additional predictors. BG, RF and GB Are not similarly affected.
  • 118. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 118 5/9/2018 M2 tree grows smaller trees and lowers miscl. Fro 0.5 to about 0.27, M1 from 0.2 to about 0.15.
  • 119. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 119 5/9/2018 Similarly for ASE.
  • 120. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 120 5/9/2018 M1_tree achieve a wider range of posterior probabilities.
  • 121. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 121 5/9/2018 Conclusion on 50/50 resampling. In this example, 50/50 resampled models yielded a smaller Tree with worse performance than its raw counterpart. Actual performance (for best models) was not affected by 50/50 or raw modeling.
  • 122. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 122 5/9/2018
  • 123. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 123 5/9/2018 XGBoost Developed by Chen and Guestrin (2016) XGBoost: A Scalable Tree Boosting System. Claims: Faster and better than neural networks and Random Forests. Uses 2nd order gradients of loss functions based on Taylor expansions of loss functions, plugged into same algorithm for greater generalization. In addition, transforms loss function into more sophisticated objective function containing regularization terms, that penalizes tree growth, with penalty proportional to the size of the node weights thus preventing overfitting. More efficient than GB due to parallel computing on single computer (10 times faster). Algorithm takes advantage of advanced decomposition of objective function that allows for outperforming GB. Not yet SAS available. Available in R, Julia, Python, CLI. Tool used in many champion models in recent competitions (Kaggle, etc.). See also Foster’s (2017) XGboostExplainer.
  • 124. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 124 5/9/2018
  • 125. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 125 5/9/2018 Comments on GB. 1) Not immediately apparent what weak classifier is for GB (e.g., by varying depth in our case). Likewise, number of iterations is big issue. In our simple example, M6 GB was best performer. Still, overall modeling benefited from ensembling all methods as measured by either AUROC or Cum Lift or ensemble p-values. 2) The posterior probability ranges are vastly different and thus the tendency to classify observations by the .5 threshold is too simplistic. 3) The PDPs show that different methods find distinct multivariate structures. Interestingly, the ensemble p-values show a decreasing tendency by logistic and trees and a strong S shaped tendency by M6 GB, which could mean that M6 GB alone tends to overshoot its predictions. 4) GB relatively unaffected by 50/50 mixture.
  • 126. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 126 5/9/2018 Comments on GB. 5) While on classification GB problems, predictions are within [0, 1], for continuous target problems, predictions can be beyond the range of the target variable  headaches. This is due to the fact that GB models residual at each iteration, not the original target; this can lead to surprises, such as negative predictions when Y takes only non-negative residual values, contrary to the original Tree algorithm. 6) Shrinkage parameter and early stopping (# trees) act as regularizers but combined effect not known and could be ineffective. 7) If shrinkage too small, and allow large T, model is large, expensive to compute, implement and understand.
  • 127. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 127 5/9/2018 Drawbacks of GB. 1) IT IS NOT MAGIC, it won’t solve ALL modeling needs, but best off-the-shelf tool. Still need to look for transformations, odd issues, missing values, etc. 2) As all tree methods, categorical variables with many levels can make it impossible to obtain model. E.g., zip codes. 3) Memory requirements can be very large, especially with large iterations, typical problem of ensemble methods. 4) Large number of iterations  slow speed to obtain predictions  on-line scoring may require trade-off between complexity and time available. Once GB is learned, parallelization certainly helps. 5) No simple algorithm to capture interactions because of base- learners. 6) No simple rules to determine gamma, # of iterations or depth of simple learner. Need to try different combinations and possibly recalibrate in time. 7) Still, one of the most powerful methods available.
  • 128. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 128 5/9/2018 Un-reviewed Catboost DeepForest gcForest Use of tree methods for continuous target variable. …
  • 129. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 129 5/9/2018 2.11) References Auslender L. (1998): Alacart, poor man’s classification trees, NESUG. Breiman L., Friedman J., Olshen R., Stone J. (1984): Classification and Regression Trees, Wadsworth. Chen and Guestrin (2016): XGBoost: A Scalable Tree Boosting System. Chipman H., George E., McCulloch R.: BART, Bayesian additive regression Trees, The Annals of Statistics. Foster D. (2017): New R package that makes Xgboost Interpretable, https://medium.com/applied-data- science/new-r-package-the-xgboost-explainer-51dd7d1aa211 Friedman, J. (2001).Greedy boosting approximation: a gradient boosting machine. Ann.Stat. 29, 1189– 1232.doi:10.1214/aos/1013203451 Paluszynska A. (2017): Structural mining and knowledge extraction from random forest with applications to The Cancer Genome Atlas project (https://www.google.com/url?q=https%3A%2F%2Frawgit.com%2FgeneticsMiNIng%2FBlackBoxOpener %2Fmaster%2FrandomForestExplainer_Master_thesis.pdf&sa=D&sntz=1&usg=AFQjCNHTJONZK24L ioDeOB0KZnwLkn98fw and https://mi2datalab.github.io/randomForestExplainer/) Quinlan J. Ross (1993): C4.5: programs for machine learning, Morgan Kaufmann Publshers.
  • 130. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 130 5/9/2018 Earlier literature on combining methods: Winkler, RL. and Makridakis, S. (1983). The combination of forecasts. J. R. Statis. Soc. A. 146(2), 150-157. Makridakis, S. and Winkler, R.L. (1983). Averages of Forecasts: Some Empirical Results,. Management Science, 29(9) 987-996. Bates, J.M. and Granger, C.W. (1969). The combination of forecasts. Or, 451-468.
  • 131. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 131 5/9/2018
  • 132. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 132 5/9/2018 1) Can you explain in nontechnical language the idea of maximum likelihood estimation?, of SVM? 2) Contrast GB with RF. 3) In what way is over-fitting like a glove? Like an umbrella? 4) Would ensemble models always improve on individual models? 5) Would you select variables by way of tree methods to use in linear methods later on? Yes? No? why? 6) In Tree regression, final predictions are means. Could better predictions be obtained by regression model instead? A logistic for a binary target? Discuss. 7) There are 9 coins, 8 of which are of equal weight, and there’s one scale. How many steps until you identify the odd coin? 8) Why are manhole covers round? 9) You obtain 100% accuracy in validation of classification model. Are you a genius? Yes, no, why? 10)If 85% of witnesses saw blue car during accident, and 15% saw red car, what is probability (car is blue)?
  • 133. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 133 5/9/2018 Counter-interview questions (you ask the interviewer). 1) How do you measure the height of a building with just a barometer? Give three answers at least. 2) Two players A and B take turns saying a positive integer number from 1 to 9. The numbers are added until whoever reaches 100 or above, loses. Is there a strategy to never lose? (aborting a game midway is acceptable, but give reasoning). 3) There are two jugs, one that holds 5 gallons, the other one 3, and a nearby water fountain. How do you put exactly (less than one ounce deviation is fine) 4 ounces in the 5 gallon jug?
  • 134. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 Ch. 5-134 5/9/2018 for now