SlideShare a Scribd company logo
1 of 78
Download to read offline
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 15/18/2018
Ensemble models and
Gradient Boosting, part 1.
Leonardo Auslender
Independent Statistical Consultant
Leonardo ‘dot’ Auslender ‘at’
Gmail ‘dot’ com.
Copyright 2018.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 25/18/2018
Topics to cover:
1) Why more techniques? Bias-variance tradeoff.
2)Ensembles
1) Bagging – stacking
2) Random Forests
3) Gradient Boosting (GB)
4) Gradient-descent optimization method.
5) Innards of GB and example.
6) Overall Ensembles.
7) Partial Dependency Plots (PDP)
8) Case Studies: a. GB different parameters, b. raw data vs 50/50.
9) Xgboost
10)On the practice of Ensembles.
11)References.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 35/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 45/18/2018
1) Why more techniques? Bias-variance tradeoff.
(Broken clock is right twice a day, variance of estimation = 0, bias extremely high.
Thermometer is accurate overall, but reports higher/lower temperatures at night. Unbiased,
higher variance. Betting on same horse always has zero variance, possibly extremely biased).
Model error can be broken down into three components mathematically. Let f
be estimating function. f-hat empirically derived function.
Bet on right
Horse and win.
Bet on wrong
Horse and lose.
Bet on many
Horses and win.
Bet on many horses
and lose.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 55/18/2018
Credit : Scott Fortmann-Roe (web)
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 65/18/2018
Let X1, X2, X3,,, i.i.d random variables
Well known that E(X) = , and variance (E(X)) =
By just averaging estimates, we lower variance and assure same
aspects of bias.
Let us find methods to lower or stabilize variance (at least) while
keeping low bias. And maybe also, lower the bias.
And since cannot be fully attained, still searching for more
techniques.
 Minimize general objective function:
n 
Minimize loss function to reduce bias.
Regularization, minimize model complexity.
Obj(Θ) L(Θ) Ω(Θ),
L(Θ)
Ω(Θ)
 


set of model parameters.1 pwhere Ω {w ,,,,,,w },
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 75/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 85/18/2018
Some terminology for Model combinations.
Ensembles: general name
Prediction/forecast combination: focusing on just
outcomes
Model combination for parameters:
Bayesian parameter averaging
We focus on ensembles as Prediction/forecast
combinations.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-95/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 105/18/2018
Ensembles.
Bagging (bootstrap aggregation, Breiman, 1996): Adding randomness 
improves function estimation. Variance reduction technique, reducing
MSE. Let initial data size n.
1) Construct bootstrap sample by randomly drawing n times with replacement
(note, some observations repeated).
2) Compute sample estimator (logistic or regression, tree, ANN … Tree in
practice).
3) Redo B times, B large (50 – 100 or more in practice, but unknown).
4) Bagged estimator. For classification, Breiman recommends majority vote of
classification for each observation. Buhlmann (2003) recommends averaging
bootstrapped probabilities. Note that individual obs may not appear B times
each.
NB: Independent sequence of trees. What if …….?
Reduces prediction error by lowering variance of aggregated predictor
while maintaining bias almost constant (variance/bias trade-off).
Friedman (1998) reconsidered boosting and bagging in terms of
gradient descent algorithms, seen later on.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 115/18/2018
From http://dni-institute.in/blogs/bagging-algorithm-concepts-with-example/
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-125/18/2018
Ensembles
Evaluation:
Empirical studies: boosting (seen later) smaller misclassification
rates compared to bagging, reduction of both bias and
variance. Different boosting algorithms (Breiman’s arc-x4 and arc-
gv). In cases with substantial noise, bagging performs better.
Especially used in clinical studies.
Why does Bagging work?
Breiman: bagging successful because reduces instability of
prediction method. Unstable: small perturbations in data  large
changes in predictor. Experimental results show variance
reduction. Studies suggest that bagging performs some
smoothing on the estimates. Grandvalet (2004) argues that
bootstrap sampling equalizes effects of highly influential
observations.
Disadvantage: cannot be visualized easily.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 135/18/2018
Ensembles
Adaptive Bagging (Breiman, 2001): Mixes bias-reduction boosting
with variance-reduction bagging. Uses out-of-bag obs to halt
optimizer.
Stacking:
Previously, same technique used throughout. Stacking (Wolpert 1992)
combines different algorithms on single data set. Voting is then
used for final classification. Ting and Witten (1999) “stack” the
probability distributions (PD) instead.
Stacking is “meta-classifier”: combines methods.
Pros: takes best from many methods. Cons: un-interpretable,
mixture of methods become black-box of predictions.
Stacking very prevalent in WEKA.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 145/18/2018
5.3) Tree World.
5.3.1) L. Breiman: Bagging.
2.2) L. Breiman: Random
Forests
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 15
Explanation by way of football example for The Saints.
https://gormanalysis.com/random-forest-from-top-to-bottom/
Opponent OppRk
SaintsAtHo
me
Expert1Pre
dWin
Expert2Pre
dWin
SaintsWon
1 Falcons 28 TRUE TRUE TRUE TRUE
2 Cowgirls 16 TRUE TRUE TRUE TRUE
3 Eagles 30 FALSE FALSE TRUE TRUE
4 Bucs 6 TRUE FALSE TRUE FALSE
5 Bucs 14 TRUE FALSE FALSE FALSE
6 Panthers 9 FALSE TRUE TRUE FALSE
7 Panthers 18 FALSE FALSE FALSE FALSE
Goal: predict when Saints will win. 5 Predictors: Opponent, opponent rank, home
game, expert1 and expert2 predictions. If run tree, just one split on opponent because
Saints lost to Bucs and Panthers and perfect separation then, but useless for future
opponents. Instead, at each step randomly select subset of 3 (or 2, or 4) features and
grow multiple weak but different trees, which when combined, should be a smart model.
3 Examples: Tree2 Tree3
Tree1 Tree3
OppRank <= 15
(<=) Left (>) Right
Opponent in Cowgirls,
eagles, falcons
Expert2 pred
F =Left T= Right
OppRank <= 12.5
(<=) Left (>) Right
Opponent in Cowgirls,
eagles, falcons (left)
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 165/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 175/18/2018
Assume following test data and predictions:
Opponent OppRk
SaintsAtH
ome
Expert1Pr
edWin
Expert2Pr
edWin
1 Falcons 1 TRUE TRUE TRUE
2 Falcons 32 TRUE TRUE FALSE
3 Falcons 32 TRUE FALSE TRUE
Obs Tree1 Tree2 Tree3
MajorityVot
e
Sample1 FALSE FALSE TRUE FALSE
Sample2 TRUE FALSE TRUE TRUE
Sample3 TRUE TRUE TRUE TRUE
Test data
Predictions.
Note that probability can be ascribed by counting # votes for each predicted target class and yield
good ranking of prob for different classes. But problem: if “Opprk” (2nd best predictor) is in initial group
of 3 with “opponent”, it won’t be used as splitter because “opponent” is perfect. Note that there are 10
ways to choose 3 out of 5, and each predictor appears 6 times 
“Opponent” dominates 60% of trees, while Opprisk appears without “opponent” just
30% of the time. Could mitigate this effect by also sampling training obs to be used to
develop model, giving Opprisk a higher chance to be root (not shown).
Further, assume that Expert2 gives perfect predictions when Saints
lose (not when they win). Right now, Expert2 as predictor is lost, but if
resampling is with replacement, higher chance to use Expert2 as
predictor because more losses might just appear.
Summary:
Data with N rows and p predictors:
1) Determine # of trees to grow.
2) For each tree
Randomly sample n <= N rows with replacement.
Create tree with m <= p predictors selected randomly at each non-
final node.
Combine different tree predictions by majority voting (classification
trees) or averaging (regression trees). Note that voting can be
replaced by average of probabilities, and averaging by medians.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-195/18/2018
Definition of Random Forests.
Decision Tree Forest: ensemble (collection) of decision trees whose
predictions are combined to make overall prediction for the forest.
Similar to TreeBoost (Gradient boosting) model because large number
of trees are grown. However, TreeBoost generates series of trees with
output of one tree going into next tree in series. In contrast, decision
tree forest grows number of independent trees in parallel, and they do
not interact until after all of them have been built.
Disadvantage: complex model, cannot be visualized like single tree.
More “black box” like neural network  advisable to create both single-
tree and tree forest model.
Single-tree model can be studied to get intuitive understanding of how
predictor variables relate, and decision tree forest model can be used
to score data and generate highly accurate predictions.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-205/18/2018
Random Forests
1. Random sample of N observations with replacement (“bagging”).
On average, about 2/3 of rows selected. Remaining 1/3 called “out
of bag (OOB)” obs. New random selection is performed for each
tree constructed.
2. Using obs selected in step 1, construct decision tree. Build tree to
maximum size, without pruning. As tree is built, allow only subset of
total set of predictor variables to be considered as possible splitters
for each node. Select set of predictors to be considered as random
subset of total set of available predictors.
For example, if there are ten predictors, choose five randomly as
candidate splitters. Perform new random selection for each split. Some
predictors (possibly best one) will not be considered for each split, but
predictor excluded from one split may be used for another split in same
tree.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-215/18/2018
Random Forests
No Overfitting or Pruning.
"Over-fitting“: problem in large, single-tree models where model fits
noise in data  poor generalization power  pruning. In nearly all
cases, decision tree forests do not have problem with over-fitting, and no
need to prune trees in forest. Generally, more trees in forest, better fit.
Internal Measure of Test Set (Generalization) Error .
About 1/3 of observations excluded from each tree in forest, called “out
of bag (OOB)”: each tree has different set of out-of-bag observations 
each OOB set constitutes independent test sample.
To measure generalization error of decision tree forest, OOB set for each
tree is run through tree and error rate of prediction is computed.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 225/18/2018
Detour: Found in the Internet: PCA and RF.
https://stats.stackexchange.com/questions/294791/
how-can-preprocessing-with-pca-but-keeping-the-same-dimensionality-improve-rando
?newsletter=1&nlcode=348729%7c8657
Discovery?
“PCA before random forest can be useful not for dimensionality reduction but to give you data
a shape where random forest can perform better.
I am quite sure that in general if you transform your data with PCA keeping the same
dimensionality of the original data you will have a better classification with random forest.”
Answer:
“Random forest struggles when the decision boundary is "diagonal" in the feature space
because RF has to approximate that diagonal with lots of "rectangular" splits. To the extent that
PCA re-orients the data so that splits perpendicular to the rotated & rescaled axes align well
with the decision boundary, PCA will help. But there's no reason to believe that PCA will help in
general, because not all decision boundaries are improved when rotated (e.g. a circle). And
even if you do have a diagonal decision boundary, or a boundary that would be easier to find in
a rotated space, applying PCA will only find that rotation by coincidence, because PCA has no
knowledge at all about the classification component of the task (it is not "y-aware").
Also, the following caveat applies to all projects using PCA for supervised learning: data rotated by
PCA may have little-to-no relevance to the classification objective.”
DO NOT BELIEVE EVERYTHING THAT APPEARS IN THE WEB!!!!! BE CRITICAL!!!
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 235/18/2018
Further Developments.
Paluszynska (2017) focuses on providing better information
on variable importance using RF.
RF is constantly being researched and improved.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 245/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 255/18/2018
Detour: Underlying idea for boosting classification models (NOT yet GB).
(Freund, Schapire, 2012, Boosting: Foundations and Algorithms, MIT)
Start with model M(X) and obtain 80% accuracy, or 60% R2, etc.
Then Y = M(X) + error1. Hypothesize that error is still correlated with Y.
 error1 = G(X) + error2, where we model Error1 now, or
In general Error (t - 1) = Z(X) + error (t) 
Y = M(X) + G(X) + ….. + Z(X) + error (t-k). If find optimal beta weights to
combined models, then
Y = b1 * M(X) + b2 G(X) + …. + Bt Z(X) + error (t-k)
Boosting is “Forward Stagewise Ensemble method” with single data set,
iteratively reweighting observations according to previous error, especially focusing on
wrongly classified observations.
Philosophy: Focus on most difficult points to classify in previous step by
reweighting observations.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 265/18/2018
Main idea of GB using trees (GBDT).
Let Y be target, X predictors such that f 0(X) weak model to
predict Y that just predicts mean value of Y. “weak” to avoid over-
fitting.
Improve on f 0(X) by creating f 1(X) = f 0(X) + h (x). If h perfect
model  f 1(X) = y  h (x) = y - f 0(X) = residuals = negative
gradients of loss (or cost) function.
Residual
Fitting
-(y – f(x))
-1; 1
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 275/18/2018
Explanation of GB by way of example..
/blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
Predict age in following data set by way of trees, conitnuous target  regression tree.
Predict age, loss function: SSE.
PersonID Age
LikesGardenin
g
PlaysVideoGa
mes
LikesHats
1 13 FALSE TRUE TRUE
2 14 FALSE TRUE FALSE
3 15 FALSE TRUE FALSE
4 25 TRUE TRUE TRUE
5 35 FALSE TRUE TRUE
6 49 TRUE FALSE FALSE
7 68 TRUE TRUE TRUE
8 71 TRUE FALSE FALSE
9 73 TRUE FALSE TRUE
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 285/18/2018
Only 9 obs in data, thus we allow Tree to have very Small # obs in final Nodes.
We want Videos variable because we suspect it’s important. But doing so (by
allowing few obs in final nodes) also brought in split in “hats”, that seems
irrelevant and just noise leading to over-fitting, because tree searches in smaller
and smaller areas of data as it progresses.
Let’s go in steps and look at the results of Tree1 (before second splits), stopping at first
split, where predictions are 19.25 and 57.2 and obtain residuals.
root
Likes gardening
F T
19.25 57.2
Hats
F T
Videos
F T
Tree 1
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 295/18/2018
Run another tree using Tree1 residuals as new target.
PersonID Age
Tree1
Predictio
n
Tree1
Residual
1 13 19.25 -6.25
2 14 19.25 -5.25
3 15 19.25 -4.25
4 25 57.2 -32.2
5 35 19.25 15.75
6 49 57.2 -8.2
7 68 57.2 10.8
8 71 57.2 13.8
9 73 57.2 15.8
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 305/18/2018
New root
Video Games
F T
7.133 -3.567
Tree 2
Note: Tree2 did not use “Likes Hats” because between Hats and VideoGames, videogames is
preferred when using all obs instead of in full Tree1 in smaller region of the data where hats appear.
And thus noise is avoided.
Tree 1 SSE = 1994 Tree 2 SSE = 1765
PersonID Age
Tree1
Prediction
Tree1
Residual
Tree2
Prediction
Combined
Prediction
Final
Residual
1 13 19.25 -6.25 -3.567 15.68 2.683
2 14 19.25 -5.25 -3.567 15.68 1.683
3 15 19.25 -4.25 -3.567 15.68 0.6833
4 25 57.2 -32.2 -3.567 53.63 28.63
5 35 19.25 15.75 -3.567 15.68 -19.32
6 49 57.2 -8.2 7.133 64.33 15.33
7 68 57.2 10.8 -3.567 53.63 -14.37
8 71 57.2 13.8 7.133 64.33 -6.667
9 73 57.2 15.8 7.133 64.33 -8.667
Combined pred
for PersonID 1:
15.68 = 19.25
– 3.567
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 315/18/2018
So Far
1) Started with ‘weak’ model F0(x) = y
2) Fitted second model to residuals h1(x) = y – F0(x)
3) Combined two previous models F2(x) = F1(x) + h1(x).
Notice that h1(x) could be any type of model (stacking), not just trees. And
continue re-cursing until M.
Initial weak model was “mean” because well known that mean minimizes SSE.
Q: how to choose M, gradient boosting hyper parameter? Usually cross-validation.
4) Alternative to mean: minimize Absolute error instead of SSE as loss function.
More expensive because minimizer is median, computationally expensive. In this case, in
Tree 1 above, use median (y) = 35, and obtain residuals.
PersonID Age F0 Residual0
1 13 35 -22
2 14 35 -21
3 15 35 -20
4 25 35 -10
5 35 35 0
6 49 35 14
7 68 35 33
8 71 35 36
9 73 35 38
Focus on observations 1 and 4 with respective residuals of -22 and -10 respectively to
understand median case. Under SSE Loss function (standard Tree regression), a reduction in
residuals of 1 unit, drops SSE by 43 and 19 resp. ( e.g., 22 * 22 – 21 * 21, 100 - 81) while for absolute
loss, reduction is just 1 and 1 (22 – 21, 10 – 9) 
SSE reduction will focus more on first observation because of 43, while absolute error focuses
on all obs because they are all 1 
Instead of training subsequent trees on residuals of F0, train h0 on gradient of loss function (L(y, F0(x))
w.r.t y-hats produced by F0(x). With absolute error loss, subsequent h trees will consider sign of every
residual, as opposed to SSE loss that considers magnitude of residual.
Gradient of SSE =
which is “– residual”  this is a gradient descent algorithm. For Absolute Error:
Each h tree groups observations into final nodes, and average gradient can be calculated in
each and scaled by factor γ, such that Fm + γm hm minimizes loss function in each node.
Shrinkage: For each gradient step, magnitude is multiplied by factor that ranges between 0 and 1 and
called learning rate  each gradient step is shrunken allowing for slow convergence toward observed
values  observations close to target values end up grouped into larger nodes, thus regularizing the
method.
Finally before each new tree step, row and column sampling occur to produce more different
tree splits (similar to Random Forests).
ˆ ˆ,ˆ| |
ˆ ˆ,
( ) 1 1
ˆ
  
   
 
  
(AE)
Y Y Y Y
Absolute Error Y Y
Y Y Y Y
dAE
Gradient AE or
dY
Results for SSE and Absolute Error: SSE case
Age F0
PseudoR
esidual0
h0 gamma0 F1
PseudoR
esidual1
h1 gamma1 F2
13 40.33 -27.33 -21.08 1 19.25 -6.25 -3.567 1 15.68
14 40.33 -26.33 -21.08 1 19.25 -5.25 -3.567 1 15.68
15 40.33 -25.33 -21.08 1 19.25 -4.25 -3.567 1 15.68
25 40.33 -15.33 16.87 1 57.2 -32.2 -3.567 1 53.63
35 40.33 -5.333 -21.08 1 19.25 15.75 -3.567 1 15.68
49 40.33 8.667 16.87 1 57.2 -8.2 7.133 1 64.33
68 40.33 27.67 16.87 1 57.2 10.8 -3.567 1 53.63
71 40.33 30.67 16.87 1 57.2 13.8 7.133 1 64.33
73 40.33 32.67 16.87 1 57.2 15.8 7.133 1 64.33
h1
root
-21.08 16.87
h0
Gardening
F T
root
Videos
F T
7.133 -3.567
E.g., for first observation. 40.33 is mean age, -27.33 = 13 – 40.33, -21.08 prediction due to
gardening = F. F1 = 19.25 = 40.33 – 21.08. PseudoRes1 = 13 – 19.25, F2 = 19.25 – 3.567 = 15.68.
Gamma0 = avg (pseudoresidual0 / h0) (by diff. values of h0). Same for gamma1.
Results for SSE and Absolute Error: Absolute Error case.
root
-1 0.6
h0
Gardening
F T
h1
root
Videos
F T
0.333 -0.333
Age F0
PseudoResi
dual0
h0 gamma0 F1
PseudoRes
idual1
h1 gamma1 F2
13 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25
14 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25
15 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25
25 35 -1 0.6 55 68 -1 -0.3333 0.75 67.75
35 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25
49 35 1 0.6 55 68 -1 0.3333 9 71
68 35 1 0.6 55 68 -1 -0.3333 0.75 67.75
71 35 1 0.6 55 68 1 0.3333 9 71
73 35 1 0.6 55 68 1 0.3333 9 71
E.g., for 1st observation. 35 is median age, residual = -1, 1 if residual >0 or < 0 resp.
F1 = 14.5 because 35 + 20.5 * (-1).
F2 = 14.25 = 14.5 + 0.75 * (-0.3333).
Predictions within leaf nodes computed by “mean” of obs therein.
Gamma0 = median ((age – F0) / h0) = avg ((14 – 35) / -1; (15 – 35) / -1)) = 20.5 55 = (68 – 35) / .06 .
Gamma1 = median ((age – F1) / h1) by different values of h1 (and of h0 for gamma0).
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 355/18/2018
Quick description of GB using trees (GBDT).
1) Create very small tree as initial model, ‘weak’ learner, (e.g., tree with two terminal nodes. ( 
depth = 1). ‘WEAK’ avoids over-fitting and local minina, and predicts, F1, for each obs. Tree1.
2) Each tree allocates a probability of event or a mean value in each terminal node, according
to the nature of the dependent variable or target.
3) Compute “residuals” (prediction error) for every observation (if 0-1 target, apply logistic
transformation to linearize them, p / 1 – p).
4) Use residuals as new ‘target variable and grow second small tree on them (second stage of
the process, same depth). To ensure against over-fitting, use random sample without
replacement ( “stochastic gradient boosting”.) Tree2.
5) New model, once second stage is complete, we obtain concatenation of two trees, Tree1 and
Tree2 and predictions F1 + F2 * gamma, gamma multiplier or shrinkage factor (called step
size in gradient descent).
6) Iterate procedure of computing residuals from most recent tree, which become the target of
the new model, etc.
7) In the case of a binary target variable, each tree produces at least some nodes in which
‘event’ is majority (‘events’ are typically more difficult to identify since most data sets
contain very low proportion of ‘events’ in usual case).
8) Final score for each observation is obtained by summing (with weights) the different scores
(probabilities) of every tree for each observation.
Why does it work? Why “gradient” and “boosting”?
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 365/18/2018
Comparing GBDT vs Trees in point 4 above (I).
GBDT takes sample from training data to create tree at each iteration, CART
does not. Below, notice differences between with sample proportion of 60%
for GBDT and no sample for generic trees for the fraud data set,
Total_spend is the target. Predictions are similar.
IF doctor_visits < 8.5 THEN DO; /* GBDT */
_prediction_ + -1208.458663;
END;
ELSE DO;
_prediction_ + 1360.7910083;
END;
IF 8.5 <= doctor_visits THEN DO; /* GENERIC TREES*/
P_pseudo_res0 = 1378.74081896893;
END;
ELSE DO;
P_pseudo_res0 = -1290.94575707227;
END;
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 375/18/2018
Comparing GBDT vs Trees in point 4 above (II).
Again, GBDT takes sample from training data to create tree at each
iteration, CART does not. If we allow for CART to work with same
proportion sample but different seed, splitting variables may be different at
specific depth of tree creation.
/* GBDT */
IF doctor_visits < 8.5 THEN DO;
_ARB_F_ + -579.8214325;
END; EDA of two samples would
ELSE DO; indicate subtle differences
_ARB_F_ + 701.49142697; that induce differences in
END; selected splitting variables.
END;
/ ORIGINAL TREES */
IF 183.5 <= member_duration THEN DO;
P_pseudo_res0 = 1677.87318718526;
END;
ELSE DO;
P_pseudo_res0 = -1165.32773940565;
END;
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 385/18/2018
More Details
Friedman’s general 2001 GB algorithm:
1) Data (Y, X), Y (N, 1), X (N, p)
2) Choose # iterations M
3) Choose loss function L(Y, F(x), Error), and corresponding gradient, i.e., 0-1 loss
function, and residuals are corresponding gradient. Function called ‘f’. Loss f
implied by Y.
4) Choose base learner h( X, θ), say shallow trees.
Algorithm:
1: initialize f0 with a constant, usually mean of Y.
2: for t = 1 to M do
3: compute negative gradient gt(x), i.e., residual from Y as next target.
4: fit a new base-learner function h(x, θt), i.e., tree.
5: find best gradient descent step-size, and min Loss f:
6: update function estimate:
8: end for
(all f function are function estimates, i.e., ‘hats’).
0 <
n
t t ti t 1 i i
γ i 1
, 1γ argmin L(y ,f (x ) γh (x )) γ

 
 t t 1 t t tf f (x) γ h (x,θ )
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 395/18/2018
Specifics of Tree Gradient Boosting, called TreeBoost (Friedman).
Friedman’s 2001 GB algorithm for tree methods:
Same as previous one, and
jtprediction of tree t in final node N
for tree 'm'.
J
t jt jm
j 1
jt
h (x) p I(x N )
p :

 
t t-1
In TreeBoost Friedman proposes to find optimal
in each final node instead of
unique at every iteration. Then
f (x)=f (x)+
i jt
jm
J
jt t jt
j 1
jt i t 1 i t i
γ x N
,
γ h (x)I(x N ),
γ argmin L(y ,f (x ) γh (x ))
γ
γ,




 


Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 405/18/2018
Parallels with Stepwise (regression) methods.
Stepwise starts from original Y and X, and in later iterations
turns to residuals, and reduced and orthogonalized X matrix,
where ‘entered’ predictors are no longer used and
orthogonalized away from other predictors.
GBDT uses residuals as targets, but does not orthogonalize or
drop any predictors.
Stepwise stops either by statistical inference, or AIC/BIC
search. GBDT has a fixed number of iterations.
Stepwise has no ‘gamma’ (shrinkage factor).
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 415/18/2018
Setting.
Hypothesize existence of function Y = f (X, betas, error). Change of
paradigm, no MLE (e.g., logistic, regression, etc) but loss function.
Minimize Loss function itself, its expected value called risk. Many different
loss functions available, gaussian, 0-1, etc.
A loss function describes the loss (or cost) associated with all possible
decisions. Different decision functions or predictor functions will tend
to lead to different types of mistakes. The loss function tells us which
type of mistakes we should be more concerned about.
For instance, estimating demand, decision function could be linear equation
and loss function could be squared or absolute error.
The best decision function is the function that yields the lowest expected
loss, and the expected loss function is itself called risk of an estimator. 0-1
assigns 0 for correct prediction, 1 for incorrect.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 655/18/2018
Key Details.
Friedman’s 2001 GB algorithm: Need
1) Loss function (usually determined by nature of Y (binary,
continuous…)) (NO MLE).
2) Weak learner, typically tree stump or spline, marginally better
classifier than random (but by how much?).
3) Model with T Iterations:
# nodes in each tree;
L2 or L1 norm of leaf weights;
other. Function not directly
opti
T
ti
t 1
n T
i i k
i 1 t 1
ˆy tree (X)
ˆObjective function : L(y , y ) Ω(Tree )
Ω {

 




 
mized by GB.}
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 435/18/2018
L2-error penalizes symmetrically away from 0, Huber penalizes less than OLS
away from [-1, 1], Bernoulli and Adaboost are very similar. Note that Y ε [-1, 1]
in 0-1 case here.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 445/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 455/18/2018
Gradient Descent.
“Gradient” descent method to find minimum of function.
Gradient: multivariate generalization of derivative of function in one
dimension to many dimensions. I.e., gradient is vector of partial
derivatives. In one dimension, gradient is tangent to function.
Easier to work with convex and “smooth” functions.
convex Non-convex
Gradient Descent.
Let L (x1, x2) = 0.5 * (x1 – 15) **2 + 0.5 * (x2 – 25) ** 2, and solve for X1 and X2 that min L by gradient
descent.
Steps:
Take M = 100. Starting point s0 = (0, 0) Step size = 0.1
Iterate m = 1 to M
1. Calculate gradient of L at sm – 1
2. Step in direction of greatest descent (negative gradient) with step size γ, i.e.,
If γ mall and M large, sm minimizes L.
Additional considerations:
Instead of M iterations, stop when next improvement small.
Use line search to choose step sizes (Line search chooses search in descent direction of
minimization).
How does it work in gradient boosting?
Objective is Min L, starting from F0(x). For m = 1, compute gradient of L w.r.t F0(x). Then fit weak learner
to gradient components  for regression tree, obtain average gradient in each final node. In each node,
step in direction of avg. gradient using line search to determine step magnitude. Outcome is F1, and
repeat. In symbols:
Initialize model with constant: F0(x) = mean, median, etc.
For m = 1 to M Compute pseudo residual
fit base learner h to residuals
compute step magnitude gamma m (for trees, different gamma for
each node)
Update Fm(x) = Fm-1(x) + γm hm(x)
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 475/18/2018
“Gradient” descent
Method of gradient descent is a first order optimization algorithm that is based on taking
small steps in direction of the negative gradient at one point in the curve in order to find
the (hopefully global) minimum value (of loss function). If it is desired to search for the
maximum value instead, then the positive gradient is used and the method is then called
gradient ascent.
Second order not searched, solution could be local minimum.
Requires starting point, possibly many to avoid local minima.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 485/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-495/18/2018
Comparing full tree (depth = 6) to boosted tree residuals by iteration..
2 GB versions: 1) with raw 20% events (M1), 2) with 50/50 mixture of events (M2). Non GB
Tree (referred as maxdepth 6 for M1 data set) the most biased. Notice that M2 stabilizes
earlier than M1. X axis: Iteration #. Y axis: average Residual. “Tree Depth 6” obviously
unaffected by iteration since it’s single tree run.
1.5969917399003E-15
-2.9088316687833E-16
Tree depth 6 2.83E-15
0 2 4 6 8 10
Iteration
-5E-15
-2.5E-15
0
2.5E-15
5E-15
MEAN_RESID_M1_TRN_TREES
MEAN_RESID_M2_TRN_TREESMEAN_RESID_M1_TRN_TREES
Avg residuals by iteration by model names in gradient boosting
Vertical line - Mean stabilizes
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 505/18/2018
Comparing full tree (depth = 6) to boosted tree residuals by iteration..
Now Y = Var. of resids. M2 has highest variance followed by Depth 6 (single tree) and
then M1. M2 stabilizes earlier as well. In conclusion, M2 has lower bias and higher variance
than M1 in this example, difference lies on mixture of 0-1 in target variable.
0.1218753847
8
0.1781230782
5
Depth 6 = 0.145774
0.1219
0.1404
0.159
0.1775
0.196
0.2146
VarofResids
0 2 4 6 8 10
Iteration
VAR_RESID_M2_TRN_TREESVAR_RESID_M1_TRN_TREES
Variance of residuals by iteration in gradient boosting
Vertical line - Variance stabilizes
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 515/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 525/18/2018
Important Message N
1
Basic information on the original data set.s:
1
..
1
Data set name ........................ train
1
. # TRN obs ............... 3595
1
Validation data set ................. validata
1
. # VAL obs .............. 2365
1
Test data set ................
1
. # TST obs .............. 0
1
...
1
Dep variable ....................... fraud
1
.....
1
Pct Event Prior TRN............. 20.389
1
Pct Event Prior VAL............. 19.281
1
Pct Event Prior TEST ............
1
TRN and VAL data sets obtained by random sampling
Without replacement. .
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 535/18/2018
Variable Label
1
FRAUD Fraudulent Activity yes/no
total_spend Total spent on opticals
1
doctor_visits Total visits to a doctor
1
no_claims No of claims made recently
1
member_duration Membership duration
1
optom_presc Number of opticals claimed
1
num_members Number of members covered
5
3
Fraud data set, original 20% fraudsters.
Study alternatives of changing number of iterations from 3
to 50 and depth from 1 to 10 with training and validation
data sets.
Original Percentage of fraudsters 20% in both data sets..
Notice just 5 predictors, thus max number of iterations is 50
as exaggeration. In usual large databases, number of
iterations could reach 1000 or higher.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 545/18/2018
E.g., M5_VAL_GRAD_BOOSTING: M5 case with validation data set and using
gradient boosting as modeling technique. Model # 10 as identifier.
Requested Models: Names & Descriptions.
Model #
Full Model Name Model Description
***
Overall Models
-1
M1 Raw 20pct depth 1 iterations 3
-10
M2 Raw 20pct depth 1 iterations 10
-10
M3 Raw 20pct depth 5 iterations 3
-10
M4 Raw 20pct depth 5 iterations 10
-10
M5 Raw 20pct depth 10 iterations 50
-10
01_M1_TRN_GRAD_BOOSTING Gradient Boosting
1
02_M1_VAL_GRAD_BOOSTING Gradient Boosting
2
03_M2_TRN_GRAD_BOOSTING Gradient Boosting
3
04_M2_VAL_GRAD_BOOSTING Gradient Boosting
4
05_M3_TRN_GRAD_BOOSTING Gradient Boosting
5
06_M3_VAL_GRAD_BOOSTING Gradient Boosting
6
07_M4_TRN_GRAD_BOOSTING Gradient Boosting
7
08_M4_VAL_GRAD_BOOSTING Gradient Boosting
8
09_M5_TRN_GRAD_BOOSTING Gradient Boosting
9
10_M5_VAL_GRAD_BOOSTING Gradient Boosting 10
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 555/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 565/18/2018
All agree on No_claims as First
split but at different values and
yield different event probs.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 575/18/2018
Note M2 split
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 585/18/2018
Constrained GB parameters may create undesirable models
But parameters with high values may lead to running times
That are too long, especially when models have to be
Re-touched.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 595/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 605/18/2018
Variable importance is model dependent, could lead to misleading
conclusions. .
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 615/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 625/18/2018
Goodness
Of Fit.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 635/18/2018
Probability range largest for M5,
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 645/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 655/18/2018
M5 best per AUROC Also when validated.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 665/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 675/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 685/18/2018
Specific GOFs
In rank order.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 695/18/2018
GOF ranks
GOF measure
rankAUROC
Avg
Square
Error
Cum Lift
3rd bin
Cum
Resp
Rate 3rd Gini
Rsquare
Cramer
Tjur
rank rank rank rank rank rank
Unw.
Mean
Model Name
5 5 5 5 5 5 5.00
01_M1_TRN_GRAD_BOOSTING
03_M2_TRN_GRAD_BOOSTING
4 4 4 4 4 4 4.00
05_M3_TRN_GRAD_BOOSTING
3 3 3 3 3 3 3.00
07_M4_TRN_GRAD_BOOSTING
2 2 2 2 2 2 2.00
09_M5_TRN_GRAD_BOOSTING
1 1 1 1 1 1 1.00
GOF ranks
GOF measure
rankAUROC
Avg
Square
Error
Cum Lift
3rd bin
Cum
Resp
Rate 3rd Gini
Rsquare
Cramer
Tjur
rank rank rank rank rank rank
Unw.
Mean
Model Name
5 5 5 5 5 5 5.00
02_M1_VAL_GRAD_BOOSTING
04_M2_VAL_GRAD_BOOSTING
4 4 4 4 4 4 4.00
06_M3_VAL_GRAD_BOOSTING
3 3 3 3 3 3 3.00
08_M4_VAL_GRAD_BOOSTING
2 2 2 2 2 2 2.00
10_M5_VAL_GRAD_BOOSTING
1 1 1 1 1 1 1.00
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 705/18/2018
M5 winner.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 715/18/2018
Huge jump in performance
Per R-square measure.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 725/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 735/18/2018
Overall conclusion for GB parameters
While higher values of number of iterations and depth imply
longer (and possibly significant) computer runs,
constraining these parameters can have significant negative
effects on model results.
In context of thousands of predictors, computer resource
availability might significantly affect model results.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 745/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 755/18/2018
Overall Ensembles.
Given specific classification study and many different modeling techniques,
create logistic regression model with original target variable and the different
predictions from the different models, without variable selection (this is not
critical).
Evaluate importance of different models either via p-values or partial
dependency plots.
Note: It is not Stacking, because Stacking “votes” to decide on final
classification.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 765/18/2018
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 775/18/2018
Partial Dependency plots (PDP).
Due to GB’s (and other methods’) black-box nature, these plots show the
effect of predictor X on modeled response once all other predictors
have been marginalized (integrated away). Marginalized Predictors
usually fixed at constant value, typically mean.
Graphs may not capture nature of variable interactions especially if
interaction significantly affects model outcome.
Formally, PDP of F(x1, x2, xp) on X is E(F) over all vars except X. Thus, for
given Xs, PDP is average of predictions in training with Xs kept constant.
Since GB, Boosting, Bagging, etc are BLACK BOX models, use PDP to
obtain model interpretation. Also useful for logistic models.
Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 785/18/2018

More Related Content

Similar to 4 2 ensemble models and grad boost part 1

Download It
Download ItDownload It
Download It
butest
 
week9_Machine_Learning.ppt
week9_Machine_Learning.pptweek9_Machine_Learning.ppt
week9_Machine_Learning.ppt
butest
 

Similar to 4 2 ensemble models and grad boost part 1 (20)

4_1_Tree World.pdf
4_1_Tree World.pdf4_1_Tree World.pdf
4_1_Tree World.pdf
 
4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07
 
EPFL workshop on sparsity
EPFL workshop on sparsityEPFL workshop on sparsity
EPFL workshop on sparsity
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf
 
4 MEDA.pdf
4 MEDA.pdf4 MEDA.pdf
4 MEDA.pdf
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Data Science An Engineering Implementation Perspective
Data Science An Engineering Implementation PerspectiveData Science An Engineering Implementation Perspective
Data Science An Engineering Implementation Perspective
 
4 meda
4 meda4 meda
4 meda
 
4 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-074 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-07
 
M3R.FINAL
M3R.FINALM3R.FINAL
M3R.FINAL
 
Download It
Download ItDownload It
Download It
 
Automatic Differentiation and SciML in Reality: What can go wrong, and what t...
Automatic Differentiation and SciML in Reality: What can go wrong, and what t...Automatic Differentiation and SciML in Reality: What can go wrong, and what t...
Automatic Differentiation and SciML in Reality: What can go wrong, and what t...
 
GDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game DevelopmentGDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game Development
 
week9_Machine_Learning.ppt
week9_Machine_Learning.pptweek9_Machine_Learning.ppt
week9_Machine_Learning.ppt
 
Algoritma Random Forest beserta aplikasi nya
Algoritma Random Forest beserta aplikasi nyaAlgoritma Random Forest beserta aplikasi nya
Algoritma Random Forest beserta aplikasi nya
 
Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.
Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.
Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.
 
An Experimental Study about Simple Decision Trees for Bagging Ensemble on Dat...
An Experimental Study about Simple Decision Trees for Bagging Ensemble on Dat...An Experimental Study about Simple Decision Trees for Bagging Ensemble on Dat...
An Experimental Study about Simple Decision Trees for Bagging Ensemble on Dat...
 
Adam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the OddballsAdam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the Oddballs
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
 

More from Leonardo Auslender

4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
Leonardo Auslender
 

More from Leonardo Auslender (18)

1 UMI.pdf
1 UMI.pdf1 UMI.pdf
1 UMI.pdf
 
Suppression Enhancement.pdf
Suppression Enhancement.pdfSuppression Enhancement.pdf
Suppression Enhancement.pdf
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdf
 
Linear Regression.pdf
Linear Regression.pdfLinear Regression.pdf
Linear Regression.pdf
 
2 UEDA.pdf
2 UEDA.pdf2 UEDA.pdf
2 UEDA.pdf
 
3 BEDA.pdf
3 BEDA.pdf3 BEDA.pdf
3 BEDA.pdf
 
1 EDA.pdf
1 EDA.pdf1 EDA.pdf
1 EDA.pdf
 
0 Statistics Intro.pdf
0 Statistics Intro.pdf0 Statistics Intro.pdf
0 Statistics Intro.pdf
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf
 
3 beda
3 beda3 beda
3 beda
 
2 ueda
2 ueda2 ueda
2 ueda
 
1 eda
1 eda1 eda
1 eda
 
0 statistics intro
0 statistics intro0 statistics intro
0 statistics intro
 
Classification methods and assessment
Classification methods and assessmentClassification methods and assessment
Classification methods and assessment
 
Linear regression
Linear regressionLinear regression
Linear regression
 

Recently uploaded

obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
siskavia95
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
pwgnohujw
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
yulianti213969
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
aqpto5bt
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 

Recently uploaded (20)

obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 

4 2 ensemble models and grad boost part 1

  • 1. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 15/18/2018 Ensemble models and Gradient Boosting, part 1. Leonardo Auslender Independent Statistical Consultant Leonardo ‘dot’ Auslender ‘at’ Gmail ‘dot’ com. Copyright 2018.
  • 2. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 25/18/2018 Topics to cover: 1) Why more techniques? Bias-variance tradeoff. 2)Ensembles 1) Bagging – stacking 2) Random Forests 3) Gradient Boosting (GB) 4) Gradient-descent optimization method. 5) Innards of GB and example. 6) Overall Ensembles. 7) Partial Dependency Plots (PDP) 8) Case Studies: a. GB different parameters, b. raw data vs 50/50. 9) Xgboost 10)On the practice of Ensembles. 11)References.
  • 3. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 35/18/2018
  • 4. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 45/18/2018 1) Why more techniques? Bias-variance tradeoff. (Broken clock is right twice a day, variance of estimation = 0, bias extremely high. Thermometer is accurate overall, but reports higher/lower temperatures at night. Unbiased, higher variance. Betting on same horse always has zero variance, possibly extremely biased). Model error can be broken down into three components mathematically. Let f be estimating function. f-hat empirically derived function. Bet on right Horse and win. Bet on wrong Horse and lose. Bet on many Horses and win. Bet on many horses and lose.
  • 5. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 55/18/2018 Credit : Scott Fortmann-Roe (web)
  • 6. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 65/18/2018 Let X1, X2, X3,,, i.i.d random variables Well known that E(X) = , and variance (E(X)) = By just averaging estimates, we lower variance and assure same aspects of bias. Let us find methods to lower or stabilize variance (at least) while keeping low bias. And maybe also, lower the bias. And since cannot be fully attained, still searching for more techniques.  Minimize general objective function: n  Minimize loss function to reduce bias. Regularization, minimize model complexity. Obj(Θ) L(Θ) Ω(Θ), L(Θ) Ω(Θ)     set of model parameters.1 pwhere Ω {w ,,,,,,w },
  • 7. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 75/18/2018
  • 8. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 85/18/2018 Some terminology for Model combinations. Ensembles: general name Prediction/forecast combination: focusing on just outcomes Model combination for parameters: Bayesian parameter averaging We focus on ensembles as Prediction/forecast combinations.
  • 9. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-95/18/2018
  • 10. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 105/18/2018 Ensembles. Bagging (bootstrap aggregation, Breiman, 1996): Adding randomness  improves function estimation. Variance reduction technique, reducing MSE. Let initial data size n. 1) Construct bootstrap sample by randomly drawing n times with replacement (note, some observations repeated). 2) Compute sample estimator (logistic or regression, tree, ANN … Tree in practice). 3) Redo B times, B large (50 – 100 or more in practice, but unknown). 4) Bagged estimator. For classification, Breiman recommends majority vote of classification for each observation. Buhlmann (2003) recommends averaging bootstrapped probabilities. Note that individual obs may not appear B times each. NB: Independent sequence of trees. What if …….? Reduces prediction error by lowering variance of aggregated predictor while maintaining bias almost constant (variance/bias trade-off). Friedman (1998) reconsidered boosting and bagging in terms of gradient descent algorithms, seen later on.
  • 11. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 115/18/2018 From http://dni-institute.in/blogs/bagging-algorithm-concepts-with-example/
  • 12. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-125/18/2018 Ensembles Evaluation: Empirical studies: boosting (seen later) smaller misclassification rates compared to bagging, reduction of both bias and variance. Different boosting algorithms (Breiman’s arc-x4 and arc- gv). In cases with substantial noise, bagging performs better. Especially used in clinical studies. Why does Bagging work? Breiman: bagging successful because reduces instability of prediction method. Unstable: small perturbations in data  large changes in predictor. Experimental results show variance reduction. Studies suggest that bagging performs some smoothing on the estimates. Grandvalet (2004) argues that bootstrap sampling equalizes effects of highly influential observations. Disadvantage: cannot be visualized easily.
  • 13. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 135/18/2018 Ensembles Adaptive Bagging (Breiman, 2001): Mixes bias-reduction boosting with variance-reduction bagging. Uses out-of-bag obs to halt optimizer. Stacking: Previously, same technique used throughout. Stacking (Wolpert 1992) combines different algorithms on single data set. Voting is then used for final classification. Ting and Witten (1999) “stack” the probability distributions (PD) instead. Stacking is “meta-classifier”: combines methods. Pros: takes best from many methods. Cons: un-interpretable, mixture of methods become black-box of predictions. Stacking very prevalent in WEKA.
  • 14. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 145/18/2018 5.3) Tree World. 5.3.1) L. Breiman: Bagging. 2.2) L. Breiman: Random Forests
  • 15. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 15 Explanation by way of football example for The Saints. https://gormanalysis.com/random-forest-from-top-to-bottom/ Opponent OppRk SaintsAtHo me Expert1Pre dWin Expert2Pre dWin SaintsWon 1 Falcons 28 TRUE TRUE TRUE TRUE 2 Cowgirls 16 TRUE TRUE TRUE TRUE 3 Eagles 30 FALSE FALSE TRUE TRUE 4 Bucs 6 TRUE FALSE TRUE FALSE 5 Bucs 14 TRUE FALSE FALSE FALSE 6 Panthers 9 FALSE TRUE TRUE FALSE 7 Panthers 18 FALSE FALSE FALSE FALSE Goal: predict when Saints will win. 5 Predictors: Opponent, opponent rank, home game, expert1 and expert2 predictions. If run tree, just one split on opponent because Saints lost to Bucs and Panthers and perfect separation then, but useless for future opponents. Instead, at each step randomly select subset of 3 (or 2, or 4) features and grow multiple weak but different trees, which when combined, should be a smart model. 3 Examples: Tree2 Tree3 Tree1 Tree3 OppRank <= 15 (<=) Left (>) Right Opponent in Cowgirls, eagles, falcons Expert2 pred F =Left T= Right OppRank <= 12.5 (<=) Left (>) Right Opponent in Cowgirls, eagles, falcons (left)
  • 16. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 165/18/2018
  • 17. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 175/18/2018 Assume following test data and predictions: Opponent OppRk SaintsAtH ome Expert1Pr edWin Expert2Pr edWin 1 Falcons 1 TRUE TRUE TRUE 2 Falcons 32 TRUE TRUE FALSE 3 Falcons 32 TRUE FALSE TRUE Obs Tree1 Tree2 Tree3 MajorityVot e Sample1 FALSE FALSE TRUE FALSE Sample2 TRUE FALSE TRUE TRUE Sample3 TRUE TRUE TRUE TRUE Test data Predictions. Note that probability can be ascribed by counting # votes for each predicted target class and yield good ranking of prob for different classes. But problem: if “Opprk” (2nd best predictor) is in initial group of 3 with “opponent”, it won’t be used as splitter because “opponent” is perfect. Note that there are 10 ways to choose 3 out of 5, and each predictor appears 6 times  “Opponent” dominates 60% of trees, while Opprisk appears without “opponent” just 30% of the time. Could mitigate this effect by also sampling training obs to be used to develop model, giving Opprisk a higher chance to be root (not shown).
  • 18. Further, assume that Expert2 gives perfect predictions when Saints lose (not when they win). Right now, Expert2 as predictor is lost, but if resampling is with replacement, higher chance to use Expert2 as predictor because more losses might just appear. Summary: Data with N rows and p predictors: 1) Determine # of trees to grow. 2) For each tree Randomly sample n <= N rows with replacement. Create tree with m <= p predictors selected randomly at each non- final node. Combine different tree predictions by majority voting (classification trees) or averaging (regression trees). Note that voting can be replaced by average of probabilities, and averaging by medians.
  • 19. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-195/18/2018 Definition of Random Forests. Decision Tree Forest: ensemble (collection) of decision trees whose predictions are combined to make overall prediction for the forest. Similar to TreeBoost (Gradient boosting) model because large number of trees are grown. However, TreeBoost generates series of trees with output of one tree going into next tree in series. In contrast, decision tree forest grows number of independent trees in parallel, and they do not interact until after all of them have been built. Disadvantage: complex model, cannot be visualized like single tree. More “black box” like neural network  advisable to create both single- tree and tree forest model. Single-tree model can be studied to get intuitive understanding of how predictor variables relate, and decision tree forest model can be used to score data and generate highly accurate predictions.
  • 20. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-205/18/2018 Random Forests 1. Random sample of N observations with replacement (“bagging”). On average, about 2/3 of rows selected. Remaining 1/3 called “out of bag (OOB)” obs. New random selection is performed for each tree constructed. 2. Using obs selected in step 1, construct decision tree. Build tree to maximum size, without pruning. As tree is built, allow only subset of total set of predictor variables to be considered as possible splitters for each node. Select set of predictors to be considered as random subset of total set of available predictors. For example, if there are ten predictors, choose five randomly as candidate splitters. Perform new random selection for each split. Some predictors (possibly best one) will not be considered for each split, but predictor excluded from one split may be used for another split in same tree.
  • 21. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-215/18/2018 Random Forests No Overfitting or Pruning. "Over-fitting“: problem in large, single-tree models where model fits noise in data  poor generalization power  pruning. In nearly all cases, decision tree forests do not have problem with over-fitting, and no need to prune trees in forest. Generally, more trees in forest, better fit. Internal Measure of Test Set (Generalization) Error . About 1/3 of observations excluded from each tree in forest, called “out of bag (OOB)”: each tree has different set of out-of-bag observations  each OOB set constitutes independent test sample. To measure generalization error of decision tree forest, OOB set for each tree is run through tree and error rate of prediction is computed.
  • 22. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 225/18/2018 Detour: Found in the Internet: PCA and RF. https://stats.stackexchange.com/questions/294791/ how-can-preprocessing-with-pca-but-keeping-the-same-dimensionality-improve-rando ?newsletter=1&nlcode=348729%7c8657 Discovery? “PCA before random forest can be useful not for dimensionality reduction but to give you data a shape where random forest can perform better. I am quite sure that in general if you transform your data with PCA keeping the same dimensionality of the original data you will have a better classification with random forest.” Answer: “Random forest struggles when the decision boundary is "diagonal" in the feature space because RF has to approximate that diagonal with lots of "rectangular" splits. To the extent that PCA re-orients the data so that splits perpendicular to the rotated & rescaled axes align well with the decision boundary, PCA will help. But there's no reason to believe that PCA will help in general, because not all decision boundaries are improved when rotated (e.g. a circle). And even if you do have a diagonal decision boundary, or a boundary that would be easier to find in a rotated space, applying PCA will only find that rotation by coincidence, because PCA has no knowledge at all about the classification component of the task (it is not "y-aware"). Also, the following caveat applies to all projects using PCA for supervised learning: data rotated by PCA may have little-to-no relevance to the classification objective.” DO NOT BELIEVE EVERYTHING THAT APPEARS IN THE WEB!!!!! BE CRITICAL!!!
  • 23. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 235/18/2018 Further Developments. Paluszynska (2017) focuses on providing better information on variable importance using RF. RF is constantly being researched and improved.
  • 24. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 245/18/2018
  • 25. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 255/18/2018 Detour: Underlying idea for boosting classification models (NOT yet GB). (Freund, Schapire, 2012, Boosting: Foundations and Algorithms, MIT) Start with model M(X) and obtain 80% accuracy, or 60% R2, etc. Then Y = M(X) + error1. Hypothesize that error is still correlated with Y.  error1 = G(X) + error2, where we model Error1 now, or In general Error (t - 1) = Z(X) + error (t)  Y = M(X) + G(X) + ….. + Z(X) + error (t-k). If find optimal beta weights to combined models, then Y = b1 * M(X) + b2 G(X) + …. + Bt Z(X) + error (t-k) Boosting is “Forward Stagewise Ensemble method” with single data set, iteratively reweighting observations according to previous error, especially focusing on wrongly classified observations. Philosophy: Focus on most difficult points to classify in previous step by reweighting observations.
  • 26. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 265/18/2018 Main idea of GB using trees (GBDT). Let Y be target, X predictors such that f 0(X) weak model to predict Y that just predicts mean value of Y. “weak” to avoid over- fitting. Improve on f 0(X) by creating f 1(X) = f 0(X) + h (x). If h perfect model  f 1(X) = y  h (x) = y - f 0(X) = residuals = negative gradients of loss (or cost) function. Residual Fitting -(y – f(x)) -1; 1
  • 27. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 275/18/2018 Explanation of GB by way of example.. /blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/ Predict age in following data set by way of trees, conitnuous target  regression tree. Predict age, loss function: SSE. PersonID Age LikesGardenin g PlaysVideoGa mes LikesHats 1 13 FALSE TRUE TRUE 2 14 FALSE TRUE FALSE 3 15 FALSE TRUE FALSE 4 25 TRUE TRUE TRUE 5 35 FALSE TRUE TRUE 6 49 TRUE FALSE FALSE 7 68 TRUE TRUE TRUE 8 71 TRUE FALSE FALSE 9 73 TRUE FALSE TRUE
  • 28. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 285/18/2018 Only 9 obs in data, thus we allow Tree to have very Small # obs in final Nodes. We want Videos variable because we suspect it’s important. But doing so (by allowing few obs in final nodes) also brought in split in “hats”, that seems irrelevant and just noise leading to over-fitting, because tree searches in smaller and smaller areas of data as it progresses. Let’s go in steps and look at the results of Tree1 (before second splits), stopping at first split, where predictions are 19.25 and 57.2 and obtain residuals. root Likes gardening F T 19.25 57.2 Hats F T Videos F T Tree 1
  • 29. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 295/18/2018 Run another tree using Tree1 residuals as new target. PersonID Age Tree1 Predictio n Tree1 Residual 1 13 19.25 -6.25 2 14 19.25 -5.25 3 15 19.25 -4.25 4 25 57.2 -32.2 5 35 19.25 15.75 6 49 57.2 -8.2 7 68 57.2 10.8 8 71 57.2 13.8 9 73 57.2 15.8
  • 30. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 305/18/2018 New root Video Games F T 7.133 -3.567 Tree 2 Note: Tree2 did not use “Likes Hats” because between Hats and VideoGames, videogames is preferred when using all obs instead of in full Tree1 in smaller region of the data where hats appear. And thus noise is avoided. Tree 1 SSE = 1994 Tree 2 SSE = 1765 PersonID Age Tree1 Prediction Tree1 Residual Tree2 Prediction Combined Prediction Final Residual 1 13 19.25 -6.25 -3.567 15.68 2.683 2 14 19.25 -5.25 -3.567 15.68 1.683 3 15 19.25 -4.25 -3.567 15.68 0.6833 4 25 57.2 -32.2 -3.567 53.63 28.63 5 35 19.25 15.75 -3.567 15.68 -19.32 6 49 57.2 -8.2 7.133 64.33 15.33 7 68 57.2 10.8 -3.567 53.63 -14.37 8 71 57.2 13.8 7.133 64.33 -6.667 9 73 57.2 15.8 7.133 64.33 -8.667 Combined pred for PersonID 1: 15.68 = 19.25 – 3.567
  • 31. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 315/18/2018 So Far 1) Started with ‘weak’ model F0(x) = y 2) Fitted second model to residuals h1(x) = y – F0(x) 3) Combined two previous models F2(x) = F1(x) + h1(x). Notice that h1(x) could be any type of model (stacking), not just trees. And continue re-cursing until M. Initial weak model was “mean” because well known that mean minimizes SSE. Q: how to choose M, gradient boosting hyper parameter? Usually cross-validation. 4) Alternative to mean: minimize Absolute error instead of SSE as loss function. More expensive because minimizer is median, computationally expensive. In this case, in Tree 1 above, use median (y) = 35, and obtain residuals. PersonID Age F0 Residual0 1 13 35 -22 2 14 35 -21 3 15 35 -20 4 25 35 -10 5 35 35 0 6 49 35 14 7 68 35 33 8 71 35 36 9 73 35 38
  • 32. Focus on observations 1 and 4 with respective residuals of -22 and -10 respectively to understand median case. Under SSE Loss function (standard Tree regression), a reduction in residuals of 1 unit, drops SSE by 43 and 19 resp. ( e.g., 22 * 22 – 21 * 21, 100 - 81) while for absolute loss, reduction is just 1 and 1 (22 – 21, 10 – 9)  SSE reduction will focus more on first observation because of 43, while absolute error focuses on all obs because they are all 1  Instead of training subsequent trees on residuals of F0, train h0 on gradient of loss function (L(y, F0(x)) w.r.t y-hats produced by F0(x). With absolute error loss, subsequent h trees will consider sign of every residual, as opposed to SSE loss that considers magnitude of residual. Gradient of SSE = which is “– residual”  this is a gradient descent algorithm. For Absolute Error: Each h tree groups observations into final nodes, and average gradient can be calculated in each and scaled by factor γ, such that Fm + γm hm minimizes loss function in each node. Shrinkage: For each gradient step, magnitude is multiplied by factor that ranges between 0 and 1 and called learning rate  each gradient step is shrunken allowing for slow convergence toward observed values  observations close to target values end up grouped into larger nodes, thus regularizing the method. Finally before each new tree step, row and column sampling occur to produce more different tree splits (similar to Random Forests). ˆ ˆ,ˆ| | ˆ ˆ, ( ) 1 1 ˆ             (AE) Y Y Y Y Absolute Error Y Y Y Y Y Y dAE Gradient AE or dY
  • 33. Results for SSE and Absolute Error: SSE case Age F0 PseudoR esidual0 h0 gamma0 F1 PseudoR esidual1 h1 gamma1 F2 13 40.33 -27.33 -21.08 1 19.25 -6.25 -3.567 1 15.68 14 40.33 -26.33 -21.08 1 19.25 -5.25 -3.567 1 15.68 15 40.33 -25.33 -21.08 1 19.25 -4.25 -3.567 1 15.68 25 40.33 -15.33 16.87 1 57.2 -32.2 -3.567 1 53.63 35 40.33 -5.333 -21.08 1 19.25 15.75 -3.567 1 15.68 49 40.33 8.667 16.87 1 57.2 -8.2 7.133 1 64.33 68 40.33 27.67 16.87 1 57.2 10.8 -3.567 1 53.63 71 40.33 30.67 16.87 1 57.2 13.8 7.133 1 64.33 73 40.33 32.67 16.87 1 57.2 15.8 7.133 1 64.33 h1 root -21.08 16.87 h0 Gardening F T root Videos F T 7.133 -3.567 E.g., for first observation. 40.33 is mean age, -27.33 = 13 – 40.33, -21.08 prediction due to gardening = F. F1 = 19.25 = 40.33 – 21.08. PseudoRes1 = 13 – 19.25, F2 = 19.25 – 3.567 = 15.68. Gamma0 = avg (pseudoresidual0 / h0) (by diff. values of h0). Same for gamma1.
  • 34. Results for SSE and Absolute Error: Absolute Error case. root -1 0.6 h0 Gardening F T h1 root Videos F T 0.333 -0.333 Age F0 PseudoResi dual0 h0 gamma0 F1 PseudoRes idual1 h1 gamma1 F2 13 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25 14 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25 15 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25 25 35 -1 0.6 55 68 -1 -0.3333 0.75 67.75 35 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25 49 35 1 0.6 55 68 -1 0.3333 9 71 68 35 1 0.6 55 68 -1 -0.3333 0.75 67.75 71 35 1 0.6 55 68 1 0.3333 9 71 73 35 1 0.6 55 68 1 0.3333 9 71 E.g., for 1st observation. 35 is median age, residual = -1, 1 if residual >0 or < 0 resp. F1 = 14.5 because 35 + 20.5 * (-1). F2 = 14.25 = 14.5 + 0.75 * (-0.3333). Predictions within leaf nodes computed by “mean” of obs therein. Gamma0 = median ((age – F0) / h0) = avg ((14 – 35) / -1; (15 – 35) / -1)) = 20.5 55 = (68 – 35) / .06 . Gamma1 = median ((age – F1) / h1) by different values of h1 (and of h0 for gamma0).
  • 35. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 355/18/2018 Quick description of GB using trees (GBDT). 1) Create very small tree as initial model, ‘weak’ learner, (e.g., tree with two terminal nodes. (  depth = 1). ‘WEAK’ avoids over-fitting and local minina, and predicts, F1, for each obs. Tree1. 2) Each tree allocates a probability of event or a mean value in each terminal node, according to the nature of the dependent variable or target. 3) Compute “residuals” (prediction error) for every observation (if 0-1 target, apply logistic transformation to linearize them, p / 1 – p). 4) Use residuals as new ‘target variable and grow second small tree on them (second stage of the process, same depth). To ensure against over-fitting, use random sample without replacement ( “stochastic gradient boosting”.) Tree2. 5) New model, once second stage is complete, we obtain concatenation of two trees, Tree1 and Tree2 and predictions F1 + F2 * gamma, gamma multiplier or shrinkage factor (called step size in gradient descent). 6) Iterate procedure of computing residuals from most recent tree, which become the target of the new model, etc. 7) In the case of a binary target variable, each tree produces at least some nodes in which ‘event’ is majority (‘events’ are typically more difficult to identify since most data sets contain very low proportion of ‘events’ in usual case). 8) Final score for each observation is obtained by summing (with weights) the different scores (probabilities) of every tree for each observation. Why does it work? Why “gradient” and “boosting”?
  • 36. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 365/18/2018 Comparing GBDT vs Trees in point 4 above (I). GBDT takes sample from training data to create tree at each iteration, CART does not. Below, notice differences between with sample proportion of 60% for GBDT and no sample for generic trees for the fraud data set, Total_spend is the target. Predictions are similar. IF doctor_visits < 8.5 THEN DO; /* GBDT */ _prediction_ + -1208.458663; END; ELSE DO; _prediction_ + 1360.7910083; END; IF 8.5 <= doctor_visits THEN DO; /* GENERIC TREES*/ P_pseudo_res0 = 1378.74081896893; END; ELSE DO; P_pseudo_res0 = -1290.94575707227; END;
  • 37. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 375/18/2018 Comparing GBDT vs Trees in point 4 above (II). Again, GBDT takes sample from training data to create tree at each iteration, CART does not. If we allow for CART to work with same proportion sample but different seed, splitting variables may be different at specific depth of tree creation. /* GBDT */ IF doctor_visits < 8.5 THEN DO; _ARB_F_ + -579.8214325; END; EDA of two samples would ELSE DO; indicate subtle differences _ARB_F_ + 701.49142697; that induce differences in END; selected splitting variables. END; / ORIGINAL TREES */ IF 183.5 <= member_duration THEN DO; P_pseudo_res0 = 1677.87318718526; END; ELSE DO; P_pseudo_res0 = -1165.32773940565; END;
  • 38. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 385/18/2018 More Details Friedman’s general 2001 GB algorithm: 1) Data (Y, X), Y (N, 1), X (N, p) 2) Choose # iterations M 3) Choose loss function L(Y, F(x), Error), and corresponding gradient, i.e., 0-1 loss function, and residuals are corresponding gradient. Function called ‘f’. Loss f implied by Y. 4) Choose base learner h( X, θ), say shallow trees. Algorithm: 1: initialize f0 with a constant, usually mean of Y. 2: for t = 1 to M do 3: compute negative gradient gt(x), i.e., residual from Y as next target. 4: fit a new base-learner function h(x, θt), i.e., tree. 5: find best gradient descent step-size, and min Loss f: 6: update function estimate: 8: end for (all f function are function estimates, i.e., ‘hats’). 0 < n t t ti t 1 i i γ i 1 , 1γ argmin L(y ,f (x ) γh (x )) γ     t t 1 t t tf f (x) γ h (x,θ )
  • 39. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 395/18/2018 Specifics of Tree Gradient Boosting, called TreeBoost (Friedman). Friedman’s 2001 GB algorithm for tree methods: Same as previous one, and jtprediction of tree t in final node N for tree 'm'. J t jt jm j 1 jt h (x) p I(x N ) p :    t t-1 In TreeBoost Friedman proposes to find optimal in each final node instead of unique at every iteration. Then f (x)=f (x)+ i jt jm J jt t jt j 1 jt i t 1 i t i γ x N , γ h (x)I(x N ), γ argmin L(y ,f (x ) γh (x )) γ γ,        
  • 40. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 405/18/2018 Parallels with Stepwise (regression) methods. Stepwise starts from original Y and X, and in later iterations turns to residuals, and reduced and orthogonalized X matrix, where ‘entered’ predictors are no longer used and orthogonalized away from other predictors. GBDT uses residuals as targets, but does not orthogonalize or drop any predictors. Stepwise stops either by statistical inference, or AIC/BIC search. GBDT has a fixed number of iterations. Stepwise has no ‘gamma’ (shrinkage factor).
  • 41. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 415/18/2018 Setting. Hypothesize existence of function Y = f (X, betas, error). Change of paradigm, no MLE (e.g., logistic, regression, etc) but loss function. Minimize Loss function itself, its expected value called risk. Many different loss functions available, gaussian, 0-1, etc. A loss function describes the loss (or cost) associated with all possible decisions. Different decision functions or predictor functions will tend to lead to different types of mistakes. The loss function tells us which type of mistakes we should be more concerned about. For instance, estimating demand, decision function could be linear equation and loss function could be squared or absolute error. The best decision function is the function that yields the lowest expected loss, and the expected loss function is itself called risk of an estimator. 0-1 assigns 0 for correct prediction, 1 for incorrect.
  • 42. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 655/18/2018 Key Details. Friedman’s 2001 GB algorithm: Need 1) Loss function (usually determined by nature of Y (binary, continuous…)) (NO MLE). 2) Weak learner, typically tree stump or spline, marginally better classifier than random (but by how much?). 3) Model with T Iterations: # nodes in each tree; L2 or L1 norm of leaf weights; other. Function not directly opti T ti t 1 n T i i k i 1 t 1 ˆy tree (X) ˆObjective function : L(y , y ) Ω(Tree ) Ω {          mized by GB.}
  • 43. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 435/18/2018 L2-error penalizes symmetrically away from 0, Huber penalizes less than OLS away from [-1, 1], Bernoulli and Adaboost are very similar. Note that Y ε [-1, 1] in 0-1 case here.
  • 44. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 445/18/2018
  • 45. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 455/18/2018 Gradient Descent. “Gradient” descent method to find minimum of function. Gradient: multivariate generalization of derivative of function in one dimension to many dimensions. I.e., gradient is vector of partial derivatives. In one dimension, gradient is tangent to function. Easier to work with convex and “smooth” functions. convex Non-convex
  • 46. Gradient Descent. Let L (x1, x2) = 0.5 * (x1 – 15) **2 + 0.5 * (x2 – 25) ** 2, and solve for X1 and X2 that min L by gradient descent. Steps: Take M = 100. Starting point s0 = (0, 0) Step size = 0.1 Iterate m = 1 to M 1. Calculate gradient of L at sm – 1 2. Step in direction of greatest descent (negative gradient) with step size γ, i.e., If γ mall and M large, sm minimizes L. Additional considerations: Instead of M iterations, stop when next improvement small. Use line search to choose step sizes (Line search chooses search in descent direction of minimization). How does it work in gradient boosting? Objective is Min L, starting from F0(x). For m = 1, compute gradient of L w.r.t F0(x). Then fit weak learner to gradient components  for regression tree, obtain average gradient in each final node. In each node, step in direction of avg. gradient using line search to determine step magnitude. Outcome is F1, and repeat. In symbols: Initialize model with constant: F0(x) = mean, median, etc. For m = 1 to M Compute pseudo residual fit base learner h to residuals compute step magnitude gamma m (for trees, different gamma for each node) Update Fm(x) = Fm-1(x) + γm hm(x)
  • 47. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 475/18/2018 “Gradient” descent Method of gradient descent is a first order optimization algorithm that is based on taking small steps in direction of the negative gradient at one point in the curve in order to find the (hopefully global) minimum value (of loss function). If it is desired to search for the maximum value instead, then the positive gradient is used and the method is then called gradient ascent. Second order not searched, solution could be local minimum. Requires starting point, possibly many to avoid local minima.
  • 48. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 485/18/2018
  • 49. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-495/18/2018 Comparing full tree (depth = 6) to boosted tree residuals by iteration.. 2 GB versions: 1) with raw 20% events (M1), 2) with 50/50 mixture of events (M2). Non GB Tree (referred as maxdepth 6 for M1 data set) the most biased. Notice that M2 stabilizes earlier than M1. X axis: Iteration #. Y axis: average Residual. “Tree Depth 6” obviously unaffected by iteration since it’s single tree run. 1.5969917399003E-15 -2.9088316687833E-16 Tree depth 6 2.83E-15 0 2 4 6 8 10 Iteration -5E-15 -2.5E-15 0 2.5E-15 5E-15 MEAN_RESID_M1_TRN_TREES MEAN_RESID_M2_TRN_TREESMEAN_RESID_M1_TRN_TREES Avg residuals by iteration by model names in gradient boosting Vertical line - Mean stabilizes
  • 50. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 505/18/2018 Comparing full tree (depth = 6) to boosted tree residuals by iteration.. Now Y = Var. of resids. M2 has highest variance followed by Depth 6 (single tree) and then M1. M2 stabilizes earlier as well. In conclusion, M2 has lower bias and higher variance than M1 in this example, difference lies on mixture of 0-1 in target variable. 0.1218753847 8 0.1781230782 5 Depth 6 = 0.145774 0.1219 0.1404 0.159 0.1775 0.196 0.2146 VarofResids 0 2 4 6 8 10 Iteration VAR_RESID_M2_TRN_TREESVAR_RESID_M1_TRN_TREES Variance of residuals by iteration in gradient boosting Vertical line - Variance stabilizes
  • 51. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 515/18/2018
  • 52. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 525/18/2018 Important Message N 1 Basic information on the original data set.s: 1 .. 1 Data set name ........................ train 1 . # TRN obs ............... 3595 1 Validation data set ................. validata 1 . # VAL obs .............. 2365 1 Test data set ................ 1 . # TST obs .............. 0 1 ... 1 Dep variable ....................... fraud 1 ..... 1 Pct Event Prior TRN............. 20.389 1 Pct Event Prior VAL............. 19.281 1 Pct Event Prior TEST ............ 1 TRN and VAL data sets obtained by random sampling Without replacement. .
  • 53. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 535/18/2018 Variable Label 1 FRAUD Fraudulent Activity yes/no total_spend Total spent on opticals 1 doctor_visits Total visits to a doctor 1 no_claims No of claims made recently 1 member_duration Membership duration 1 optom_presc Number of opticals claimed 1 num_members Number of members covered 5 3 Fraud data set, original 20% fraudsters. Study alternatives of changing number of iterations from 3 to 50 and depth from 1 to 10 with training and validation data sets. Original Percentage of fraudsters 20% in both data sets.. Notice just 5 predictors, thus max number of iterations is 50 as exaggeration. In usual large databases, number of iterations could reach 1000 or higher.
  • 54. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 545/18/2018 E.g., M5_VAL_GRAD_BOOSTING: M5 case with validation data set and using gradient boosting as modeling technique. Model # 10 as identifier. Requested Models: Names & Descriptions. Model # Full Model Name Model Description *** Overall Models -1 M1 Raw 20pct depth 1 iterations 3 -10 M2 Raw 20pct depth 1 iterations 10 -10 M3 Raw 20pct depth 5 iterations 3 -10 M4 Raw 20pct depth 5 iterations 10 -10 M5 Raw 20pct depth 10 iterations 50 -10 01_M1_TRN_GRAD_BOOSTING Gradient Boosting 1 02_M1_VAL_GRAD_BOOSTING Gradient Boosting 2 03_M2_TRN_GRAD_BOOSTING Gradient Boosting 3 04_M2_VAL_GRAD_BOOSTING Gradient Boosting 4 05_M3_TRN_GRAD_BOOSTING Gradient Boosting 5 06_M3_VAL_GRAD_BOOSTING Gradient Boosting 6 07_M4_TRN_GRAD_BOOSTING Gradient Boosting 7 08_M4_VAL_GRAD_BOOSTING Gradient Boosting 8 09_M5_TRN_GRAD_BOOSTING Gradient Boosting 9 10_M5_VAL_GRAD_BOOSTING Gradient Boosting 10
  • 55. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 555/18/2018
  • 56. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 565/18/2018 All agree on No_claims as First split but at different values and yield different event probs.
  • 57. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 575/18/2018 Note M2 split
  • 58. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 585/18/2018 Constrained GB parameters may create undesirable models But parameters with high values may lead to running times That are too long, especially when models have to be Re-touched.
  • 59. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 595/18/2018
  • 60. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 605/18/2018 Variable importance is model dependent, could lead to misleading conclusions. .
  • 61. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 615/18/2018
  • 62. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 625/18/2018 Goodness Of Fit.
  • 63. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 635/18/2018 Probability range largest for M5,
  • 64. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 645/18/2018
  • 65. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 655/18/2018 M5 best per AUROC Also when validated.
  • 66. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 665/18/2018
  • 67. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 675/18/2018
  • 68. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 685/18/2018 Specific GOFs In rank order.
  • 69. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 695/18/2018 GOF ranks GOF measure rankAUROC Avg Square Error Cum Lift 3rd bin Cum Resp Rate 3rd Gini Rsquare Cramer Tjur rank rank rank rank rank rank Unw. Mean Model Name 5 5 5 5 5 5 5.00 01_M1_TRN_GRAD_BOOSTING 03_M2_TRN_GRAD_BOOSTING 4 4 4 4 4 4 4.00 05_M3_TRN_GRAD_BOOSTING 3 3 3 3 3 3 3.00 07_M4_TRN_GRAD_BOOSTING 2 2 2 2 2 2 2.00 09_M5_TRN_GRAD_BOOSTING 1 1 1 1 1 1 1.00 GOF ranks GOF measure rankAUROC Avg Square Error Cum Lift 3rd bin Cum Resp Rate 3rd Gini Rsquare Cramer Tjur rank rank rank rank rank rank Unw. Mean Model Name 5 5 5 5 5 5 5.00 02_M1_VAL_GRAD_BOOSTING 04_M2_VAL_GRAD_BOOSTING 4 4 4 4 4 4 4.00 06_M3_VAL_GRAD_BOOSTING 3 3 3 3 3 3 3.00 08_M4_VAL_GRAD_BOOSTING 2 2 2 2 2 2 2.00 10_M5_VAL_GRAD_BOOSTING 1 1 1 1 1 1 1.00
  • 70. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 705/18/2018 M5 winner.
  • 71. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 715/18/2018 Huge jump in performance Per R-square measure.
  • 72. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 725/18/2018
  • 73. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 735/18/2018 Overall conclusion for GB parameters While higher values of number of iterations and depth imply longer (and possibly significant) computer runs, constraining these parameters can have significant negative effects on model results. In context of thousands of predictors, computer resource availability might significantly affect model results.
  • 74. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 745/18/2018
  • 75. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 755/18/2018 Overall Ensembles. Given specific classification study and many different modeling techniques, create logistic regression model with original target variable and the different predictions from the different models, without variable selection (this is not critical). Evaluate importance of different models either via p-values or partial dependency plots. Note: It is not Stacking, because Stacking “votes” to decide on final classification.
  • 76. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 765/18/2018
  • 77. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 775/18/2018 Partial Dependency plots (PDP). Due to GB’s (and other methods’) black-box nature, these plots show the effect of predictor X on modeled response once all other predictors have been marginalized (integrated away). Marginalized Predictors usually fixed at constant value, typically mean. Graphs may not capture nature of variable interactions especially if interaction significantly affects model outcome. Formally, PDP of F(x1, x2, xp) on X is E(F) over all vars except X. Thus, for given Xs, PDP is average of predictions in training with Xs kept constant. Since GB, Boosting, Bagging, etc are BLACK BOX models, use PDP to obtain model interpretation. Also useful for logistic models.
  • 78. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 785/18/2018