4_2_Ensemble models and gradient boosting2.pdf

Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 1
5/9/2018
Ensemble models and
Gradient Boosting.
Leonardo Auslender
Independent Statistical Consultant
Leonardo.Auslender ‘at’
Gmail ‘dot’ com.
Copyright 2018.

5/9/2018
Topics to cover:
1) Why more techniques? Bias-variance tradeoff.
2)Ensembles
1) Bagging – stacking
2) Random Forests
3) Gradient Boosting (GB)
4) Gradient-descent optimization method.
5) Innards of GB and example.
6) Overall Ensembles.
7) Partial Dependency Plots (PDP)
8) Case Study.
9) Xgboost
10)On the practice of Ensembles.
11)References.

5/9/2018

5/9/2018
1) Why more techniques? Bias-variance tradeoff.
(Broken clock is right twice a day, variance of estimation = 0, bias extremely high.
Thermometer is accurate overall, but reports higher/lower temperatures at night. Unbiased,
higher variance. Betting on same horse always has zero variance, possibly extremely biased).
Model error can be broken down into three components mathematically. Let f
be estimating function. f-hat empirically derived function.
Bet on right
Horse and win.
Bet on wrong
Horse and lose.
Bet on many
Horses and win.
Bet on many horses
and lose.

5/9/2018
Credit : Scott Fortmann-Roe (web)

5/9/2018
Let X1, X2, X3,,, i.i.d random variables
Well known that E(X) = , and variance (E(X)) =
By just averaging estimates, we lower variance and assure same
aspects of bias.
Let us find methods to lower or stabilize variance (at least) while
keeping low bias. And maybe also, lower the bias.
And since cannot be fully attained, still searching for more
techniques.
 Minimize general objective function:
n
 
Minimize loss function to reduce bias.
Regularization, minimize model complexity.
Obj(Θ) L(Θ) Ω(Θ),
L(Θ)
Ω(Θ)
 


set of model parameters.
1 p
where Ω {w ,,,,,,w },


5/9/2018

5/9/2018
Some terminology for Model combinations.
Ensembles: general name
Prediction/forecast combination: focusing on just
outcomes
Model combination for parameters:
Bayesian parameter averaging
We focus on ensembles as Prediction/forecast
combinations.

Leonardo Auslender – Copyright 2018 Ch. 5-9
5/9/2018

5/9/2018
Ensembles.
Bagging (bootstrap aggregation, Breiman, 1996): Adding randomness 
improves function estimation. Variance reduction technique, reducing
MSE. Let initial data size n.
1) Construct bootstrap sample by randomly drawing n times with replacement
(note, some observations repeated).
2) Compute sample estimator (logistic or regression, tree, ANN … Tree in
practice).
3) Redo B times, B large (50 – 100 or more in practice, but unknown).
4) Bagged estimator. For classification, Breiman recommends majority vote of
classification for each observation. Buhlmann (2003) recommends averaging
bootstrapped probabilities. Note that individual obs may not appear B times
each.
NB: Independent sequence of trees. What if …….?
Reduces prediction error by lowering variance of aggregated predictor
while maintaining bias almost constant (variance/bias trade-off).
Friedman (1998) reconsidered boosting and bagging in terms of
gradient descent algorithms, seen later on.

5/9/2018
From http://dni-institute.in/blogs/bagging-algorithm-concepts-with-example/

5/9/2018
Ensembles
Evaluation:
Empirical studies: boosting (seen later) smaller misclassification
rates compared to bagging, reduction of both bias and
variance. Different boosting algorithms (Breiman’s arc-x4 and arc-
gv). In cases with substantial noise, bagging performs better.
Especially used in clinical studies.
Why does Bagging work?
Breiman: bagging successful because reduces instability of
prediction method. Unstable: small perturbations in data  large
changes in predictor. Experimental results show variance
reduction. Studies suggest that bagging performs some
smoothing on the estimates. Grandvalet (2004) argues that
bootstrap sampling equalizes effects of highly influential
observations.
Disadvantage: cannot be visualized easily.

5/9/2018
Ensembles
Adaptive Bagging (Breiman, 2001): Mixes bias-reduction boosting
with variance-reduction bagging. Uses out-of-bag obs to halt
optimizer.
Stacking:
Previously, same technique used throughout. Stacking (Wolpert 1992)
combines different algorithms on single data set. Voting is then
used for final classification. Ting and Witten (1999) “stack” the
probability distributions (PD) instead.
Stacking is “meta-classifier”: combines methods.
Pros: takes best from many methods. Cons: un-interpretable,
mixture of methods become black-box of predictions.
Stacking very prevalent in WEKA.

5/9/2018
5.3) Tree World.
5.3.1) L. Breiman: Bagging.
2.2) L. Breiman: Random
Forests

Explanation by way of football example for The Saints.
https://gormanalysis.com/random-forest-from-top-to-bottom/
Opponent OppRk
SaintsAtHo
me
Expert1Pre
dWin
Expert2Pre
dWin
SaintsWon
1 Falcons 28 TRUE TRUE TRUE TRUE
2 Cowgirls 16 TRUE TRUE TRUE TRUE
3 Eagles 30 FALSE FALSE TRUE TRUE
4 Bucs 6 TRUE FALSE TRUE FALSE
5 Bucs 14 TRUE FALSE FALSE FALSE
6 Panthers 9 FALSE TRUE TRUE FALSE
7 Panthers 18 FALSE FALSE FALSE FALSE
Goal: predict when Saints will win. 5 Predictors: Opponent, opponent rank, home
game, expert1 and expert2 predictions. If run tree, just one split on opponent because
Saints lost to Bucs and Panthers and perfect separation then, but useless for future
opponents. Instead, at each step randomly select subset of 3 (or 2, or 4) features and
grow multiple weak but different trees, which when combined, should be a smart model.
3 Examples: Tree2 Tree3
Tree1 Tree3
OppRank <= 15
(<=) Left (>) Right
Opponent in Cowgirls,
eagles, falcons
Expert2 pred
F =Left T= Right
OppRank <= 12.5
(<=) Left (>) Right
Opponent in Cowgirls,
eagles, falcons (left)

5/9/2018

5/9/2018
Assume following test data and predictions:
Opponent OppRk
SaintsAtH
ome
Expert1Pr
edWin
Expert2Pr
edWin
1 Falcons 1 TRUE TRUE TRUE
2 Falcons 32 TRUE TRUE FALSE
3 Falcons 32 TRUE FALSE TRUE
Obs Tree1 Tree2 Tree3
MajorityVot
e
Sample1 FALSE FALSE TRUE FALSE
Sample2 TRUE FALSE TRUE TRUE
Sample3 TRUE TRUE TRUE TRUE
Test data
Predictions.
Note that probability can be ascribed by counting # votes for each predicted target class and yield
good ranking of prob for different classes. But problem: if “Opprk” (2nd best predictor) is in initial group
of 3 with “opponent”, it won’t be used as splitter because “opponent” is perfect. Note that there are 10
ways to choose 3 out of 5, and each predictor appears 6 times 
“Opponent” dominates 60% of trees, while Opprisk appears without “opponent” just
30% of the time. Could mitigate this effect by also sampling training obs to be used to
develop model, giving Opprisk a higher chance to be root (not shown).

Further, assume that Expert2 gives perfect predictions when Saints
lose (not when they win). Right now, Expert2 as predictor is lost, but if
resampling is with replacement, higher chance to use Expert2 as
predictor because more losses might just appear.
Summary:
Data with N rows and p predictors:
1) Determine # of trees to grow.
2) For each tree
Randomly sample n <= N rows with replacement.
Create tree with m <= p predictors selected randomly at each non-
final node.
Combine different tree predictions by majority voting (classification
trees) or averaging (regression trees). Note that voting can be
replaced by average of probabilities, and averaging by medians.

5/9/2018
Definition of Random Forests.
Decision Tree Forest: ensemble (collection) of decision trees whose
predictions are combined to make overall prediction for the forest.
Similar to TreeBoost (Gradient boosting) model because large number
of trees are grown. However, TreeBoost generates series of trees with
output of one tree going into next tree in series. In contrast, decision
tree forest grows number of independent trees in parallel, and they do
not interact until after all of them have been built.
Disadvantage: complex model, cannot be visualized like single tree.
More “black box” like neural network  advisable to create both single-
tree and tree forest model.
Single-tree model can be studied to get intuitive understanding of how
predictor variables relate, and decision tree forest model can be used
to score data and generate highly accurate predictions.

5/9/2018
Random Forests
1. Random sample of N observations with replacement (“bagging”).
On average, about 2/3 of rows selected. Remaining 1/3 called “out
of bag (OOB)” obs. New random selection is performed for each
tree constructed.
2. Using obs selected in step 1, construct decision tree. Build tree to
maximum size, without pruning. As tree is built, allow only subset of
total set of predictor variables to be considered as possible splitters
for each node. Select set of predictors to be considered as random
subset of total set of available predictors.
For example, if there are ten predictors, choose five randomly as
candidate splitters. Perform new random selection for each split. Some
predictors (possibly best one) will not be considered for each split, but
predictor excluded from one split may be used for another split in same
tree.

5/9/2018
Random Forests
No Overfitting or Pruning.
"Over-fitting“: problem in large, single-tree models where model fits
noise in data  poor generalization power  pruning. In nearly all
cases, decision tree forests do not have problem with over-fitting, and no
need to prune trees in forest. Generally, more trees in forest, better fit.
Internal Measure of Test Set (Generalization) Error .
About 1/3 of observations excluded from each tree in forest, called “out
of bag (OOB)”: each tree has different set of out-of-bag observations 
each OOB set constitutes independent test sample.
To measure generalization error of decision tree forest, OOB set for each
tree is run through tree and error rate of prediction is computed.

5/9/2018
Detour: Found in the Internet: PCA and RF.
https://stats.stackexchange.com/questions/294791/
how-can-preprocessing-with-pca-but-keeping-the-same-dimensionality-improve-rando
?newsletter=1&nlcode=348729%7c8657
Discovery?
“PCA before random forest can be useful not for dimensionality reduction but to give you data
a shape where random forest can perform better.
I am quite sure that in general if you transform your data with PCA keeping the same
dimensionality of the original data you will have a better classification with random forest.”
Answer:
“Random forest struggles when the decision boundary is "diagonal" in the feature space
because RF has to approximate that diagonal with lots of "rectangular" splits. To the extent that
PCA re-orients the data so that splits perpendicular to the rotated & rescaled axes align well
with the decision boundary, PCA will help. But there's no reason to believe that PCA will help in
general, because not all decision boundaries are improved when rotated (e.g. a circle). And
even if you do have a diagonal decision boundary, or a boundary that would be easier to find in
a rotated space, applying PCA will only find that rotation by coincidence, because PCA has no
knowledge at all about the classification component of the task (it is not "y-aware").
Also, the following caveat applies to all projects using PCA for supervised learning: data rotated by
PCA may have little-to-no relevance to the classification objective.”
DO NOT BELIEVE EVERYTHING THAT APPEARS IN THE WEB!!!!! BE CRITICAL!!!

5/9/2018
Further Developments.
Paluszynska (2017) focuses on providing better information
on variable importance using RF.
RF is constantly being researched and improved.

5/9/2018

5/9/2018
Detour: Underlying idea for boosting classification models (NOT yet GB).
(Freund, Schapire, 2012, Boosting: Foundations and Algorithms, MIT)
Start with model M(X) and obtain 80% accuracy, or 60% R2, etc.
Then Y = M(X) + error1. Hypothesize that error is still correlated with Y.
 error1 = G(X) + error2, where we model Error1 now, or
In general Error (t - 1) = Z(X) + error (t) 
Y = M(X) + G(X) + ….. + Z(X) + error (t-k). If find optimal beta weights to
combined models, then
Y = b1 * M(X) + b2 G(X) + …. + Bt Z(X) + error (t-k)
Boosting is “Forward Stagewise Ensemble method” with single data set,
iteratively reweighting observations according to previous error, especially focusing on
wrongly classified observations.
Philosophy: Focus on most difficult points to classify in previous step by
reweighting observations.

5/9/2018
Main idea of GB using trees (GBDT).
Let Y be target, X predictors such that f 0(X) weak model to
predict Y that just predicts mean value of Y. “weak” to avoid over-
fitting.
Improve on f 0(X) by creating f 1(X) = f 0(X) + h (x). If h perfect
model  f 1(X) = y  h (x) = y - f 0(X) = residuals = negative
gradients of loss (or cost) function.
Residual
Fitting
-(y – f(x))
-1; 1

5/9/2018
Explanation of GB by way of example..
/blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
Predict age in following data set by way of trees, conitnuous target  regression tree.
Predict age, loss function: SSE.
PersonID Age
LikesGardenin
g
PlaysVideoGa
mes
LikesHats
1 13 FALSE TRUE TRUE
2 14 FALSE TRUE FALSE
3 15 FALSE TRUE FALSE
4 25 TRUE TRUE TRUE
5 35 FALSE TRUE TRUE
6 49 TRUE FALSE FALSE
7 68 TRUE TRUE TRUE
8 71 TRUE FALSE FALSE
9 73 TRUE FALSE TRUE

5/9/2018
Only 9 obs in data, thus we allow Tree to have very Small # obs in final Nodes.
We want Videos variable because we suspect it’s important. But doing so (by
allowing few obs in final nodes) also brought in split in “hats”, that seems
irrelevant and just noise leading to over-fitting, because tree searches in smaller
and smaller areas of data as it progresses.
Let’s go in steps and look at the results of Tree1 (before second splits), stopping at first
split, where predictions are 19.25 and 57.2 and obtain residuals.
root
Likes gardening
F T
19.25 57.2
Hats
F T
Videos
F T
Tree 1

5/9/2018
Run another tree using Tree1 residuals as new target.
PersonID Age
Tree1
Predictio
n
Tree1
Residual
1 13 19.25 -6.25
2 14 19.25 -5.25
3 15 19.25 -4.25
4 25 57.2 -32.2
5 35 19.25 15.75
6 49 57.2 -8.2
7 68 57.2 10.8
8 71 57.2 13.8
9 73 57.2 15.8

5/9/2018
New root
Video Games
F T
7.133 -3.567
Tree 2
Note: Tree2 did not use “Likes Hats” because between Hats and VideoGames, videogames is
preferred when using all obs instead of in full Tree1 in smaller region of the data where hats appear.
And thus noise is avoided.
Tree 1 SSE = 1994 Tree 2 SSE = 1765
PersonID Age
Tree1
Prediction
Tree1
Residual
Tree2
Prediction
Combined
Prediction
Final
Residual
1 13 19.25 -6.25 -3.567 15.68 2.683
2 14 19.25 -5.25 -3.567 15.68 1.683
3 15 19.25 -4.25 -3.567 15.68 0.6833
4 25 57.2 -32.2 -3.567 53.63 28.63
5 35 19.25 15.75 -3.567 15.68 -19.32
6 49 57.2 -8.2 7.133 64.33 15.33
7 68 57.2 10.8 -3.567 53.63 -14.37
8 71 57.2 13.8 7.133 64.33 -6.667
9 73 57.2 15.8 7.133 64.33 -8.667
Combined pred
for PersonID 1:
15.68 = 19.25
– 3.567

5/9/2018
So Far
1) Started with ‘weak’ model F0(x) = y
2) Fitted second model to residuals h1(x) = y – F0(x)
3) Combined two previous models F2(x) = F1(x) + h1(x).
Notice that h1(x) could be any type of model (stacking), not just trees. And
continue re-cursing until M.
Initial weak model was “mean” because well known that mean minimizes SSE.
Q: how to choose M, gradient boosting hyper parameter? Usually cross-validation.
4) Alternative to mean: minimize Absolute error instead of SSE as loss function.
More expensive because minimizer is median, computationally expensive. In this case, in
Tree 1 above, use median (y) = 35, and obtain residuals.
PersonID Age F0 Residual0
1 13 35 -22
2 14 35 -21
3 15 35 -20
4 25 35 -10
5 35 35 0
6 49 35 14
7 68 35 33
8 71 35 36
9 73 35 38

Focus on observations 1 and 4 with respective residuals of -22 and -10 respectively to
understand median case. Under SSE Loss function (standard Tree regression), a reduction in
residuals of 1 unit, drops SSE by 43 and 19 resp. ( e.g., 22 * 22 – 21 * 21, 100 - 81) while for absolute
loss, reduction is just 1 and 1 (22 – 21, 10 – 9) 
SSE reduction will focus more on first observation because of 43, while absolute error focuses
on all obs because they are all 1 
Instead of training subsequent trees on residuals of F0, train h0 on gradient of loss function (L(y, F0(x))
w.r.t y-hats produced by F0(x). With absolute error loss, subsequent h trees will consider sign of every
residual, as opposed to SSE loss that considers magnitude of residual.
Gradient of SSE =
which is “– residual”  this is a gradient descent algorithm. For Absolute Error:
Each h tree groups observations into final nodes, and average gradient can be calculated in
each and scaled by factor γ, such that Fm + γm hm minimizes loss function in each node.
Shrinkage: For each gradient step, magnitude is multiplied by factor that ranges between 0 and 1 and
called learning rate  each gradient step is shrunken allowing for slow convergence toward observed
values  observations close to target values end up grouped into larger nodes, thus regularizing the
method.
Finally before each new tree step, row and column sampling occur to produce more different
tree splits (similar to Random Forests).
ˆ ˆ
,
ˆ
| |
ˆ ˆ
,
( ) 1 1
ˆ
  

   
 


  
(AE)
Y Y Y Y
Absolute Error Y Y
Y Y Y Y
dAE
Gradient AE or
dY

Results for SSE and Absolute Error: SSE case
Age F0
PseudoR
esidual0
h0 gamma0 F1
PseudoR
esidual1
h1 gamma1 F2
13 40.33 -27.33 -21.08 1 19.25 -6.25 -3.567 1 15.68
14 40.33 -26.33 -21.08 1 19.25 -5.25 -3.567 1 15.68
15 40.33 -25.33 -21.08 1 19.25 -4.25 -3.567 1 15.68
25 40.33 -15.33 16.87 1 57.2 -32.2 -3.567 1 53.63
35 40.33 -5.333 -21.08 1 19.25 15.75 -3.567 1 15.68
49 40.33 8.667 16.87 1 57.2 -8.2 7.133 1 64.33
68 40.33 27.67 16.87 1 57.2 10.8 -3.567 1 53.63
71 40.33 30.67 16.87 1 57.2 13.8 7.133 1 64.33
73 40.33 32.67 16.87 1 57.2 15.8 7.133 1 64.33
h1
root
-21.08 16.87
h0
Gardening
F T
root
Videos
F T
7.133 -3.567
E.g., for first observation. 40.33 is mean age, -27.33 = 13 – 40.33, -21.08 prediction due to
gardening = F. F1 = 19.25 = 40.33 – 21.08. PseudoRes1 = 13 – 19.25, F2 = 19.25 – 3.567 = 15.68.
Gamma0 = avg (pseudoresidual0 / h0) (by diff. values of h0). Same for gamma1.

Results for SSE and Absolute Error: Absolute Error case.
root
-1 0.6
h0
Gardening
F T
h1
root
Videos
F T
0.333 -0.333
Age F0
PseudoResi
dual0
h0 gamma0 F1
PseudoRes
idual1
h1 gamma1 F2
13 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25
14 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25
15 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25
25 35 -1 0.6 55 68 -1 -0.3333 0.75 67.75
35 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25
49 35 1 0.6 55 68 -1 0.3333 9 71
68 35 1 0.6 55 68 -1 -0.3333 0.75 67.75
71 35 1 0.6 55 68 1 0.3333 9 71
73 35 1 0.6 55 68 1 0.3333 9 71
E.g., for 1st observation. 35 is median age, residual = -1, 1 if residual >0 or < 0 resp.
F1 = 14.5 because 35 + 20.5 * (-1).
F2 = 14.25 = 14.5 + 0.75 * (-0.3333).
Predictions within leaf nodes computed by “mean” of obs therein.
Gamma0 = median ((age – F0) / h0) = avg ((14 – 35) / -1; (15 – 35) / -1)) = 20.5 55 = (68 – 35) / .06 .
Gamma1 = median ((age – F1) / h1) by different values of h1 (and of h0 for gamma0).

5/9/2018
Quick description of GB using trees (GBDT).
1) Create very small tree as initial model, ‘weak’ learner, (e.g., tree with two terminal nodes. ( 
depth = 1). ‘WEAK’ avoids over-fitting and local minina, and predicts, F1, for each obs. Tree1.
2) Each tree allocates a probability of event or a mean value in each terminal node, according
to the nature of the dependent variable or target.
3) Compute “residuals” (prediction error) for every observation (if 0-1 target, apply logistic
transformation to linearize them, p / 1 – p).
4) Use residuals as new ‘target variable and grow second small tree on them (second stage of
the process, same depth). To ensure against over-fitting, use random sample without
replacement ( “stochastic gradient boosting”.) Tree2.
5) New model, once second stage is complete, we obtain concatenation of two trees, Tree1 and
Tree2 and predictions F1 + F2 * gamma, gamma multiplier or shrinkage factor (called step
size in gradient descent).
6) Iterate procedure of computing residuals from most recent tree, which become the target of
the new model, etc.
7) In the case of a binary target variable, each tree produces at least some nodes in which
‘event’ is majority (‘events’ are typically more difficult to identify since most data sets
contain very low proportion of ‘events’ in usual case).
8) Final score for each observation is obtained by summing (with weights) the different scores
(probabilities) of every tree for each observation.
Why does it work? Why “gradient” and “boosting”?

5/9/2018
Comparing GBDT vs Trees in point 4 above (I).
GBDT takes sample from training data to create tree at each iteration, CART
does not. Below, notice differences between with sample proportion of 60%
for GBDT and no sample for generic trees for the fraud data set,
Total_spend is the target. Predictions are similar.
IF doctor_visits < 8.5 THEN DO; /* GBDT */
_prediction_ + -1208.458663;
END;
ELSE DO;
_prediction_ + 1360.7910083;
END;
IF 8.5 <= doctor_visits THEN DO; /* GENERIC TREES*/
P_pseudo_res0 = 1378.74081896893;
END;
ELSE DO;
P_pseudo_res0 = -1290.94575707227;
END;

5/9/2018
Comparing GBDT vs Trees in point 4 above (II).
Again, GBDT takes sample from training data to create tree at each
iteration, CART does not. If we allow for CART to work with same
proportion sample but different seed, splitting variables may be different at
specific depth of tree creation.
/* GBDT */
IF doctor_visits < 8.5 THEN DO;
_ARB_F_ + -579.8214325;
END; EDA of two samples would
ELSE DO; indicate subtle differences
_ARB_F_ + 701.49142697; that induce differences in
END; selected splitting variables.
END;
/ ORIGINAL TREES */
IF 183.5 <= member_duration THEN DO;
P_pseudo_res0 = 1677.87318718526;
END;
ELSE DO;
P_pseudo_res0 = -1165.32773940565;
END;

5/9/2018
More Details
Friedman’s general 2001 GB algorithm:
1) Data (Y, X), Y (N, 1), X (N, p)
2) Choose # iterations M
3) Choose loss function L(Y, F(x), Error), and corresponding gradient, i.e., 0-1 loss
function, and residuals are corresponding gradient. Function called ‘f’. Loss f
implied by Y.
4) Choose base learner h( X, θ), say shallow trees.
Algorithm:
1: initialize f0 with a constant, usually mean of Y.
2: for t = 1 to M do
3: compute negative gradient gt(x), i.e., residual from Y as next target.
4: fit a new base-learner function h(x, θt), i.e., tree.
5: find best gradient descent step-size, and min Loss f:
6: update function estimate:
8: end for
(all f function are function estimates, i.e., ‘hats’).
0 <
n
t t t
i t 1 i i
γ i 1
, 1
γ argmin L(y ,f (x ) γh (x )) γ



 


 
t t 1 t t t
f f (x) γ h (x,θ )

5/9/2018
Specifics of Tree Gradient Boosting, called TreeBoost (Friedman).
Friedman’s 2001 GB algorithm for tree methods:
Same as previous one, and
jt
prediction of tree t in final node N
for tree 'm'.
J
t jt jm
j 1
jt
h (x) p I(x N )
p :

 

t t-1
In TreeBoost Friedman proposes to find optimal
in each final node instead of
unique at every iteration. Then
f (x)=f (x)+
i jt
jm
J
jt t jt
j 1
jt i t 1 i t i
γ x N
,
γ h (x)I(x N ),
γ argmin L(y ,f (x ) γh (x ))
γ
γ,




 



5/9/2018
Parallels with Stepwise (regression) methods.
Stepwise starts from original Y and X, and in later iterations
turns to residuals, and reduced and orthogonalized X matrix,
where ‘entered’ predictors are no longer used and
orthogonalized away from other predictors.
GBDT uses residuals as targets, but does not orthogonalize or
drop any predictors.
Stepwise stops either by statistical inference, or AIC/BIC
search. GBDT has a fixed number of iterations.
Stepwise has no ‘gamma’ (shrinkage factor).

5/9/2018
Setting.
Hypothesize existence of function Y = f (X, betas, error). Change of
paradigm, no MLE (e.g., logistic, regression, etc) but loss function.
Minimize Loss function itself, its expected value called risk. Many different
loss functions available, gaussian, 0-1, etc.
A loss function describes the loss (or cost) associated with all possible
decisions. Different decision functions or predictor functions will tend
to lead to different types of mistakes. The loss function tells us which
type of mistakes we should be more concerned about.
For instance, estimating demand, decision function could be linear equation
and loss function could be squared or absolute error.
The best decision function is the function that yields the lowest expected
loss, and the expected loss function is itself called risk of an estimator. 0-1
assigns 0 for correct prediction, 1 for incorrect.

5/9/2018
Key Details.
Friedman’s 2001 GB algorithm: Need
1) Loss function (usually determined by nature of Y (binary,
continuous…)) (NO MLE).
2) Weak learner, typically tree stump or spline, marginally better
classifier than random (but by how much?).
3) Model with T Iterations:
# nodes in each tree;
L2 or L1 norm of leaf weights;
other. Function not directly
opti
T
t
i
t 1
n T
i i k
i 1 t 1
ŷ tree (X)
ˆ
Objective function : L(y , y ) Ω(Tree )
Ω {

 




 
mized by GB.}

5/9/2018
L2-error penalizes symmetrically away from 0, Huber penalizes less than OLS
away from [-1, 1], Bernoulli and Adaboost are very similar. Note that Y ε [-1, 1]
in 0-1 case here.

5/9/2018

5/9/2018
Gradient Descent.
“Gradient” descent method to find minimum of function.
Gradient: multivariate generalization of derivative of function in one
dimension to many dimensions. I.e., gradient is vector of partial
derivatives. In one dimension, gradient is tangent to function.
Easier to work with convex and “smooth” functions.
convex Non-convex

Gradient Descent.
Let L (x1, x2) = 0.5 * (x1 – 15) **2 + 0.5 * (x2 – 25) ** 2, and solve for X1 and X2 that min L by gradient
descent.
Steps:
Take M = 100. Starting point s0 = (0, 0) Step size = 0.1
Iterate m = 1 to M
1. Calculate gradient of L at sm – 1
2. Step in direction of greatest descent (negative gradient) with step size γ, i.e.,
If γ mall and M large, sm minimizes L.
Additional considerations:
Instead of M iterations, stop when next improvement small.
Use line search to choose step sizes (Line search chooses search in descent direction of
minimization).
How does it work in gradient boosting?
Objective is Min L, starting from F0(x). For m = 1, compute gradient of L w.r.t F0(x). Then fit weak learner
to gradient components  for regression tree, obtain average gradient in each final node. In each node,
step in direction of avg. gradient using line search to determine step magnitude. Outcome is F1, and
repeat. In symbols:
Initialize model with constant: F0(x) = mean, median, etc.
For m = 1 to M Compute pseudo residual
fit base learner h to residuals
compute step magnitude gamma m (for trees, different gamma for
each node)
Update Fm(x) = Fm-1(x) + γm hm(x)

5/9/2018
“Gradient” descent
Method of gradient descent is a first order optimization algorithm that is based on taking
small steps in direction of the negative gradient at one point in the curve in order to find
the (hopefully global) minimum value (of loss function). If it is desired to search for the
maximum value instead, then the positive gradient is used and the method is then called
gradient ascent.
Second order not searched, solution could be local minimum.
Requires starting point, possibly many to avoid local minima.

5/9/2018

5/9/2018
Comparing full tree (depth = 6) to boosted tree residuals by iteration..
2 GB versions: 1) with raw 20% events (M1), 2) with 50/50 mixture of events (M2). Non GB
Tree (referred as maxdepth 6 for M1 data set) the most biased. Notice that M2 stabilizes
earlier than M1. X axis: Iteration #. Y axis: average Residual. “Tree Depth 6” obviously
unaffected by iteration since it’s single tree run.
1.5969917399003E-15
-2.9088316687833E-16
Tree depth 6 2.83E-15
0 2 4 6 8 10
Iteration
-5E-15
-2.5E-15
0
2.5E-15
5E-15
MEAN_RESID_M1_TRN_TREES
Avg residuals by iteration by model names in gradient boosting
Vertical line - Mean stabilizes

5/9/2018
Comparing full tree (depth = 6) to boosted tree residuals by iteration..
Now Y = Var. of resids. M2 has highest variance followed by Depth 6 (single tree) and
then M1. M2 stabilizes earlier as well. In conclusion, M2 has lower bias and higher variance
than M1 in this example, difference lies on mixture of 0-1 in target variable.
0.1218753847
8
0.1781230782
5
Depth 6 = 0.145774
0.1219
0.1404
0.159
0.1775
0.196
0.2146
Var
of
Resids
0 2 4 6 8 10
Iteration
VAR_RESID_M2_TRN_TREES
VAR_RESID_M1_TRN_TREES
Variance of residuals by iteration in gradient boosting
Vertical line - Variance stabilizes

5/9/2018

5/9/2018
Important Message N
1
Basic information on the original data set.s:
1
..
1
Data set name ........................ train
1
. # TRN obs ............... 3595
1
Validation data set ................. validata
1
. # VAL obs .............. 2365
1
Test data set ................
1
. # TST obs .............. 0
1
...
1
Dep variable ....................... fraud
1
.....
1
Pct Event Prior TRN............. 20.389
1
Pct Event Prior VAL............. 19.281
1
Pct Event Prior TEST ............
1
TRN and VAL data sets obtained by random sampling
Without replacement. .

5/9/2018
Variable Label
1
FRAUD Fraudulent Activity yes/no
total_spend Total spent on opticals
1
doctor_visits Total visits to a doctor
1
no_claims No of claims made recently
1
member_duration Membership duration
1
optom_presc Number of opticals claimed
1
num_members Number of members covered
5
3
Fraud data set, original 20% fraudsters.
Study alternatives of changing number of iterations from 3
to 50 and depth from 1 to 10 with training and validation
data sets.
Original Percentage of fraudsters 20% in both data sets..
Notice just 5 predictors, thus max number of iterations is 50
as exaggeration. In usual large databases, number of
iterations could reach 1000 or higher.

5/9/2018
E.g., M5_VAL_GRAD_BOOSTING: M5 case with validation data set and using
gradient boosting as modeling technique. Model # 10 as identifier.
Requested Models: Names & Descriptions. Model #
Full Model Name Model Description
***
Overall Models
-1
M1 Raw 20pct depth 1 iterations 3
-10
-10
-10
-10
-10
01_M1_TRN_GRAD_BOOSTING Gradient Boosting
1
02_M1_VAL_GRAD_BOOSTING Gradient Boosting
2
3
4
5
6
7
8
9
10

5/9/2018

5/9/2018
All agree on
No_claims as
First split.
Disagreement at depth 2. M2 does not use member duration.

5/9/2018
M5 (# 9) yields different importance levels.

5/9/2018
Probability range largest for M5,

5/9/2018
M5 best per AUROC Also when validated.

5/9/2018

5/9/2018
Huge jump in performance
Per R-square measure.

5/9/2018
M5 (#09) obvious winner.

5/9/2018

5/9/2018
Overall conclusion for GB parameters
While higher values of number of iterations and depth imply
longer (and possibly significant) computer runs,
constraining these parameters can have significant negative
effects on model results.
In context of thousands of predictors, computer resource
availability might significantly affect model results.

5/9/2018

5/9/2018
Overall Ensembles.
Given specific classification study and many different modeling techniques,
create logistic regression model with original target variable and the different
predictions from the different models, without variable selection (this is not
critical).
Evaluate importance of different models either via p-values or partial
dependency plots.
Note: It is not Stacking, because Stacking “votes” to decide on final
classification.

5/9/2018

5/9/2018
Partial Dependency plots (PDP).
Due to GB’s (and other methods’) black-box nature, these plots show the
effect of predictor X on modeled response once all other predictors
have been marginalized (integrated away). Marginalized Predictors
usually fixed at constant value, typically mean.
Graphs may not capture nature of variable interactions especially if
interaction significantly affects model outcome.
Formally, PDP of F(x1, x2, xp) on X is E(F) over all vars except X. Thus, for
given Xs, PDP is average of predictions in training with Xs kept constant.
Since GB, Boosting, Bagging, etc are BLACK BOX models, use PDP to
obtain model interpretation. Also useful for logistic models.

5/9/2018

5/9/2018
Analytical problem to investigate.
Optical Health Care fraud insurance patients. Longer care typically involves higher
treatment costs and insurance company has to set up reserves immediately as soon
as a case is opened. Sometimes doctors involve in fraud.
Aim: predict fraudulent charges  classification problem; we’ll use a battery of
models and compare them, with and without 50/50 original training sample. Below
left, original data (M1 models), right 50/50 training (M2 models).
Notice
1
From .....
**************************************************************** 1
................. Basic information on the original data
set.s: 1
................. ..
1
................. Data set name ........................ train 1
................. Num_observations ................ 3595 1
................. Validation data set ................. validata 1
................. Num_observations .............. 2365 1
................. Test data set ................ 1
................. Num_observations .......... 0 1
................. ... 1
................. Dep variable ....................... fraud 1
................. ..... 1
................. Pct Event Prior TRN............. 20.389 1
................. Pct Event Prior VAL............. 19.281 1
................. Pct Event Prior TEST ............ 1
*************************************************************
**** 1
Notice
1
From .....
****************************************************************
1
................. Basic information on the original data set.s:
1
................. ..
1
................. Data set name ........................ sampled50_50
1
................. Num_observations ................ 1133
1
................. Validation data set ................. validata50_50
1
................. Num_observations .............. 4827
1
................. Test data set ................
1
................. Num_observations .......... 0
1
................. ...
1
................. Dep variable ....................... fraud
1
................. .....
1
................. Pct Event Prior TRN............. 50.838
1
................. Pct Event Prior VAL............. 12.699
1
................. Pct Event Prior TEST ............
1
*****************************************************************
1

5/9/2018
Requested Models: Names & Descriptions. Model #
Full Model Name Model Description
***
Overall Models
-1
M1 Raw 20pct -10
M2 50/50 prior for TRN -10
01_M1ENSEMBLE_TRN_LOGISTIC_NONE Logistic TRN NONE 1
02_M1ENSEMBLE_VAL_LOGISTIC_NONE Logistic VAL NONE 2
03_M1_TRN_BAGGING Bagging TRN Bagging 3
04_M1_TRN_GRAD_BOOSTING Gradient Boosting 4
05_M1_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE 5
06_M1_TRN_RFORESTS Random Forests 6
07_M1_TRN_TREES Trees TRN Trees 7
08_M1_VAL_BAGGING Trees VAL Trees 8
09_M1_VAL_GRAD_BOOSTING Gradient Boosting 9
10_M1_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE 10
11_M1_VAL_RFORESTS Random Forests 11
12_M1_VAL_TREES Trees VAL Trees 12
13_M2ENSEMBLE_TRN_LOGISTIC_NONE Logistic TRN NONE 13
14_M2ENSEMBLE_VAL_LOGISTIC_NONE Logistic VAL NONE 14
15_M2_TRN_BAGGING Bagging TRN Bagging 15
16_M2_TRN_GRAD_BOOSTING Gradient Boosting 16
17_M2_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE 17
18_M2_TRN_RFORESTS Random Forests 18
19_M2_TRN_TREES Trees TRN Trees 19
20_M2_VAL_BAGGING Trees VAL Trees 20
21_M2_VAL_GRAD_BOOSTING Gradient Boosting 21
22_M2_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE 22
23_M2_VAL_RFORESTS Random Forests 23
24_M2_VAL_TREES Trees VAL Trees 24
E.g., 08_M1_VAL_BAGGING: 8th model of M1 data set case, Validation and
using Bagging as the modeling technique.

5/9/2018
For models other than Tree themselves, modeled posterior
probabilities via interval valued target variable.
For simplicity just first 4 levels of trees are shown.
Notation: M5_GB_TRN_TREES: Model M5, Tree simulation of
Gradient boosting run. BG: Bagging, RF: Random Forests, LG logistic.
Intention: obtain general idea of tree representation for
comparison to standard tree model. .
Next page: small detail for BG (Bagging), GB Gradient Boosting and Trees
themselves (notice leve difference). Later, graphical comparison of vars + splits at
each tree level.

5/9/2018
Requested Tree Models: Names & Descriptions.
Model Name Level 1 + prob Level 2 + prob Level 3 + prob Level 4 + prob
M1_BG_TRN_TREES no_claims <= 0.5
(0.142)
member_duration <= 180
(0.200)
total_spend <= 52.5 (0.519)
total_spend > 52.5 (0.184)
member_duration > 180
(0.062)
doctor_visits <= 5.5 (0.092)
doctor_visits > 5.5 (0.048)
no_claims > 0.5 (0.446) no_claims <= 3.5 (0.394) member_duration <= 127
(0.545)
(0.320)
no_claims > 3.5 (0.788) optom_presc <= 3.5 (0.783)
optom_presc > 3.5 (0.817)
M1_GB_TRN_TREES no_claims <= 2.5
(0.184)
no_claims <= 0.5 (0.158) member_duration <= 180
(0.199)
(0.102)
no_claims > 0.5 (0.321) optom_presc <= 3.5 (0.287)
no_claims > 2.5 (0.634) no_claims <= 4.5 (0.570) optom_presc <= 3.5 (0.537)
no_claims > 4.5 (0.764) member_duration <= 303
(0.781)
M1_GB_TRN_TREES no_claims > 2.5
(0.634)
no_claims > 4.5 (0.764) member_duration > 303
(0.656)
M1_RF_TRN_TREES total_spend <= 50.5
(0.396)
no_claims <= 0.5 (0.328) member_duration <= 181
(0.375)
(0.152)
no_claims > 0.5 (0.552) member_duration <= 1.66
(0.648)
member_duration > 1.66
(0.397)
total_spend > 50.5
(0.197)
no_claims <= 0.5 (0.182) optom_presc <= 5.5 (0.180)
no_claims > 0.5 (0.257) total_spend <= 86.5 (0.458)
total_spend > 86.5 (0.227)
M1_TRN_TREES no_claims <= 0.5
(0.142)
(0.062)
doctor_visits <= 5.5 (0.113)
doctor_visits > 5.5 (0.038) member_duration <= 150
(0.974)
(0.577)
member_duration <= 180
(0.201)
total_spend <= 42.5 (0.718)
total_spend > 42.5 (0.189) member_duration <= 325
(0.099)

5/9/2018
Tree representations
By level and
By node.

5/9/2018
Tree representation comparisons, level 1.
Except for RF (03), all methods split at No_claims 0.5 but attain Different event probabilities.

5/9/2018
RF (M2 07) splits uniquely on Optom_presc. Notice that the split values for member_duration and
no_claims are not necessarily the same across models.

5/9/2018

5/9/2018
ETC. Next, how do variables behave in each model (omitting
LG) ?

5/9/2018
Tree representations
By level and
By split variable.

5/9/2018
RF alone in selecting total_spend
Notice prob < .4 in both nodes, as compared to
~0.78 above for M2_TRN_TREES. In later
levels, RF continues relatively apart from other
models.

5/9/2018

5/9/2018
Requested ENSEMBLE Tree Models: Names & Descriptions. Mod #
Model Name Level 1 +
Prob.
Level 2 +
Prob.
Level 3 +
Prob.
Level 4 +
Prob.
4
M1_NSMBL_LG_TRN_TREES p_M1_RFOR
ESTS <
0.32284 ( . )
p_M1_RFOR
ESTS <
0.21769 ( . )
p_M1_RFOR
ESTS <
0.12995 ( . )
p_M1_RFOR
ESTS >=
0.12995 ( . ) 4
p_M1_RFOR
ESTS >=
0.21769 ( . )
p_M1_LOGIS
TIC_STEPWI
SE < 0.2138 (
. )
4
p_M1_LOGIS
TIC_STEPWI
SE >= 0.2138
( . ) 4
p_M1_RFOR
ESTS >=
0.32284 ( . )
p_M1_RFOR
ESTS <
0.47186 ( . )
p_M1_LOGIS
TIC_STEPWI
SE < 0.36438
( . )
4
p_M1_LOGIS
TIC_STEPWI
SE >=
0.36438 ( . ) 4
p_M1_RFOR
ESTS >=
0.47186 ( . )
p_M1_RFOR
ESTS <
0.64668 ( . ) 4
p_M1_RFOR
ESTS >=
0.64668 ( . ) 4
M2_NSMBL_LG_TRN_TREES p_M2_GRAD
_BOOSTING
< 0.53437 ( . )
p_M2_GRAD
_BOOSTING
< 0.36111 ( . )
p_M2_RFOR
ESTS <
0.30466 ( . ) 9
p_M2_RFOR
ESTS >=
0.30466 ( . ) 9

5/9/2018
Requested ENSEMBLE Tree Models: Names & Descriptions. Mod #
Model Name Level 1 +
Prob.
Level 2 +
Prob.
Level 3 +
Prob.
Level 4 +
Prob.
M2_NSMBL_LG_TRN_TREES p_M2_GRAD
_BOOSTING
< 0.53437 ( . )
p_M2_GRAD
_BOOSTING
>= 0.36111 ( .
)
p_M2_GRAD
_BOOSTING
< 0.47453 ( . )
9
p_M2_GRAD
_BOOSTING
>= 0.47453 ( .
) 9
p_M2_GRAD
_BOOSTING
>= 0.53437 ( .
)
p_M2_GRAD
_BOOSTING
< 0.62556 ( . )
p_M2_BAGG
ING <
0.34876 ( . ) 9
p_M2_BAGG
ING >=
0.34876 ( . ) 9
p_M2_GRAD
_BOOSTING
>= 0.62556 ( .
)
p_M2_RFOR
ESTS <
0.81591 ( . ) 9
p_M2_RFOR
ESTS >=
0.81591 ( . ) 9
M1 ensembled mostly in RF, M2 in Gradient Boosting.

5/9/2018
Conclusion on tree representations
No_claims at 0.5 certainly top splitter but notice that event
probabilities diverge (because RF, GB and BG model a
posterior probability, not a binary event, and thus carry
information from a previous model). Later splits diverge in
predictors and split values.
Important to view each tree model independently to gage
interpretability. And also that dependent variable in models
other than trees is the probability of event that resulted
from BG, RF or GB.
And it is important to view these recent findings in terms of
variables importance.

5/9/2018
Importance
Measures
For Tree based
Methods.

5/9/2018

5/9/2018
RF and GB significant.
BG, GB, STPW significant.
Tree methods find no_claims as most important, logistic finds most predictors important.

5/9/2018

5/9/2018
Tree based methods do not reach top probability of 1.

5/9/2018
Not over-fitted. Some strong over-fit.

5/9/2018
Over-fit degree different
Than in classif. Rates (prev. slide).

5/9/2018
Note that TRN and VAL rank do not match. Lower VAL ranked Models tend to overfit
more.

5/9/2018

5/9/2018
M2-Ensemble has best average Validation ranking, Random Forests worst.

5/9/2018
The two ensembles and the two gradient boosting are best performers.

5/9/2018

5/9/2018
50/50: scales
Shifted up.

5/9/2018

5/9/2018
Very interesting almost U relationship, conditioned on
Other vars in model.

5/9/2018

5/9/2018
Different K-S values.

5/9/2018

5/9/2018
While all tree models choose No_claims as most important, 50/50 trees
(M2_TREES) selected just no_claims, while M1_TREES selected 3 additional
predictors. BG, RF and GB Are not similarly affected.

5/9/2018
M2 tree grows smaller trees and lowers miscl. Fro 0.5 to about
0.27, M1 from 0.2 to about 0.15.

5/9/2018
Similarly for ASE.

5/9/2018
M1_tree achieve a wider range of posterior probabilities.

5/9/2018
Conclusion on 50/50 resampling.
In this example, 50/50 resampled models yielded a
smaller Tree with worse performance than its raw
counterpart.
Actual performance (for best models) was not affected
by 50/50 or raw modeling.

5/9/2018

5/9/2018
XGBoost
Developed by Chen and Guestrin (2016) XGBoost: A Scalable Tree
Boosting System.
Claims: Faster and better than neural networks and Random Forests.
Uses 2nd order gradients of loss functions based on Taylor expansions of loss functions,
plugged into same algorithm for greater generalization. In addition, transforms loss function
into more sophisticated objective function containing regularization terms, that penalizes tree
growth, with penalty proportional to the size of the node weights thus preventing overfitting.
More efficient than GB due to parallel computing on single computer (10 times faster).
Algorithm takes advantage of advanced decomposition of objective function that allows for
outperforming GB.
Not yet SAS available. Available in R, Julia, Python, CLI.
Tool used in many champion models in recent competitions (Kaggle, etc.).
See also Foster’s (2017) XGboostExplainer.

5/9/2018

5/9/2018
Comments on GB.
1) Not immediately apparent what weak classifier is for GB (e.g., by
varying depth in our case). Likewise, number of iterations is big
issue. In our simple example, M6 GB was best performer. Still, overall
modeling benefited from ensembling all methods as measured by
either AUROC or Cum Lift or ensemble p-values.
2) The posterior probability ranges are vastly different and thus the
tendency to classify observations by the .5 threshold is too simplistic.
3) The PDPs show that different methods find distinct multivariate
structures. Interestingly, the ensemble p-values show a decreasing
tendency by logistic and trees and a strong S shaped tendency
by M6 GB, which could mean that M6 GB alone tends to
overshoot its predictions.
4) GB relatively unaffected by 50/50 mixture.

5/9/2018
Comments on GB.
5) While on classification GB problems, predictions are within [0, 1], for
continuous target problems, predictions can be beyond the range of the
target variable  headaches.
This is due to the fact that GB models residual at each iteration, not the
original target; this can lead to surprises, such as negative predictions
when Y takes only non-negative residual values, contrary to the original
Tree algorithm.
6) Shrinkage parameter and early stopping (# trees) act as regularizers
but combined effect not known and could be ineffective.
7) If shrinkage too small, and allow large T, model is large, expensive
to compute, implement and understand.

5/9/2018
Drawbacks of GB.
1) IT IS NOT MAGIC, it won’t solve ALL modeling needs,
but best off-the-shelf tool. Still need to look for
transformations, odd issues, missing values, etc.
2) As all tree methods, categorical variables with many levels can
make it impossible to obtain model. E.g., zip codes.
3) Memory requirements can be very large, especially with large
iterations, typical problem of ensemble methods.
4) Large number of iterations  slow speed to obtain predictions 
on-line scoring may require trade-off between complexity and time
available. Once GB is learned, parallelization certainly helps.
5) No simple algorithm to capture interactions because of base-
learners.
6) No simple rules to determine gamma, # of iterations or depth of
simple learner. Need to try different combinations and possibly
recalibrate in time.
7) Still, one of the most powerful methods available.

5/9/2018
Un-reviewed
Catboost
DeepForest
gcForest
Use of tree methods for continuous target variable.
…

5/9/2018
2.11) References
Auslender L. (1998): Alacart, poor man’s classification trees, NESUG.
Breiman L., Friedman J., Olshen R., Stone J. (1984): Classification and Regression Trees, Wadsworth.
Chen and Guestrin (2016): XGBoost: A Scalable Tree Boosting System.
Chipman H., George E., McCulloch R.: BART, Bayesian additive regression Trees, The Annals of
Statistics.
Foster D. (2017): New R package that makes Xgboost Interpretable, https://medium.com/applied-data-
science/new-r-package-the-xgboost-explainer-51dd7d1aa211
Friedman, J. (2001).Greedy boosting approximation: a gradient boosting machine. Ann.Stat. 29, 1189–
1232.doi:10.1214/aos/1013203451
Paluszynska A. (2017): Structural mining and knowledge extraction from random forest with
applications to The Cancer Genome Atlas project
(https://www.google.com/url?q=https%3A%2F%2Frawgit.com%2FgeneticsMiNIng%2FBlackBoxOpener
%2Fmaster%2FrandomForestExplainer_Master_thesis.pdf&sa=D&sntz=1&usg=AFQjCNHTJONZK24L
ioDeOB0KZnwLkn98fw and https://mi2datalab.github.io/randomForestExplainer/)
Quinlan J. Ross (1993): C4.5: programs for machine learning, Morgan Kaufmann Publshers.

5/9/2018
Earlier literature on combining methods:
Winkler, RL. and Makridakis, S. (1983). The combination of
forecasts. J. R. Statis. Soc. A. 146(2), 150-157.
Makridakis, S. and Winkler, R.L. (1983). Averages of
Forecasts: Some Empirical Results,. Management Science,
29(9) 987-996.
Bates, J.M. and Granger, C.W. (1969). The combination of
forecasts. Or, 451-468.

5/9/2018

5/9/2018
1) Can you explain in nontechnical language the idea of
maximum likelihood estimation?, of SVM?
2) Contrast GB with RF.
3) In what way is over-fitting like a glove? Like an umbrella?
4) Would ensemble models always improve on individual models?
5) Would you select variables by way of tree methods to use in linear
methods later on? Yes? No? why?
6) In Tree regression, final predictions are means. Could better
predictions be obtained by regression model instead? A logistic for
a binary target? Discuss.
7) There are 9 coins, 8 of which are of equal weight, and there’s one
scale. How many steps until you identify the odd coin?
8) Why are manhole covers round?
9) You obtain 100% accuracy in validation of classification model.
Are you a genius? Yes, no, why?
10)If 85% of witnesses saw blue car during accident, and 15% saw
red car, what is probability (car is blue)?

5/9/2018
Counter-interview questions (you ask the interviewer).
1) How do you measure the height of a building with just a
barometer? Give three answers at least.
2) Two players A and B take turns saying a positive integer
number from 1 to 9. The numbers are added until
whoever reaches 100 or above, loses. Is there a strategy
to never lose? (aborting a game midway is acceptable, but
give reasoning).
3) There are two jugs, one that holds 5 gallons, the other one
3, and a nearby water fountain. How do you put exactly (less
than one ounce deviation is fine) 4 ounces in the 5 gallon
jug?

5/9/2018
for now

4_2_Ensemble models and gradient boosting2.pdf

Recommended

Recommended

More Related Content

Similar to 4_2_Ensemble models and gradient boosting2.pdf

Similar to 4_2_Ensemble models and gradient boosting2.pdf (20)

More from Leonardo Auslender

More from Leonardo Auslender (20)

Recently uploaded

Recently uploaded (20)

4_2_Ensemble models and gradient boosting2.pdf