4_2_Ensemble models and grad boost part 3.pdf

Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 1
7/1/2018
Ensemble models and
Gradient Boosting, part 3.
Leonardo Auslender
Independent Statistical Consultant
Leonardo ‘dot’ Auslender ‘at’
Gmail ‘dot’ com.
Copyright 2018.

7/1/2018
Studies
2.8.c: Comparison of methods but focusing on whether
raw vs 50/50 re-sampling makes a difference.

7/1/2018

7/1/2018
Aim: study performance of Fraud models with original 20%
fraud events by altering percentage of events.
3 studies:
M1 5% events
M2 20% events, original
M3 50% Events.
Validation data set is random sample from original 20% data
set for all three studies.
Battery of models as in previous study, similar graphs for
evaluation.

7/1/2018
Model Name Item Information
M1 TRN data set train05
. TRN num obs 954
VAL data set validata
. VAL num obs 2365
TST data set
. TST num obs
Dep. Var fraud
TRN % Events 5.346
VAL % Events 19.281
M2 TRN data set train
. TRN num obs 3595
. VAL num obs 2365
TST data set
. TST num obs
Dep. Var fraud
TRN % Events 20.389
VAL % Events 19.281
M3 TRN data set train50
. TRN num obs 1133
. VAL num obs 2365
TST data set
M3 . TST num obs 1
2
Dep. Var fraud
1
TRN % Events 50.838
1
VAL % Events 19.281
1

7/1/2018
Requested Models: Names & Descriptions. Model #
Full Model Name Model Description
***
Overall Models
-1
M1 Raw 05pct
-10
M2 Raw 20pct
-10
M3 50pct
-10
01_M1_NSMBL_TRN_LOGISTIC_NONE Logistic TRN NONE Ensemble
1
02_M1_NSMBL_VAL_LOGISTIC_NONE Logistic VAL NONE Ensemble
2
03_M1_TRN_BAGGING Bagging TRN Bagging
3
04_M1_TRN_GRAD_BOOSTING Gradient Boosting
4
05_M1_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE
5
06_M1_TRN_RFORESTS Random Forests
6
07_M1_TRN_TREES Trees TRN Trees
7
08_M1_VAL_BAGGING Trees VAL Trees
8
09_M1_VAL_GRAD_BOOSTING Gradient Boosting
9
10_M1_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE
10
11_M1_VAL_RFORESTS Random Forests
11
12_M1_VAL_TREES Trees VAL Trees
12
13_M2_NSMBL_TRN_LOGISTIC_NONE Logistic TRN NONE Ensemble
13
14_M2_NSMBL_VAL_LOGISTIC_NONE Logistic VAL NONE Ensemble
14
15_M2_TRN_BAGGING Bagging TRN Bagging
15
16_M2_TRN_GRAD_BOOSTING Gradient Boosting
16
17_M2_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE
17
And similarly for rest of M2 and all of M3.

7/1/2018

7/1/2018
Top split level, nodes 2 and 3. Note: 3 M1 (02, 03, 05) models split on member_duration,
but corresponding M2 and M3 on no_claims. Lonely M1 GB. Previously, just No_claims.

7/1/2018
Same info, different categorization. Omitted next levels.

7/1/2018
Omitted rest for brevity. Some conclusions:
Extreme imbalance has caused different initial split variable and
therefore different model structure as opposed to more balanced data
sets.
In the more balanced cases, even the splitting value has mostly not
changed. The probability of event in resulting nodes is different due to
different initial event rates.
The difference in splitting variables is not necessarily “BAD”. Note that
the sample size for more imbalanced data sets is smaller.

7/1/2018

7/1/2018
M1 models choose different important variable. 50/50 trees (M3_TREES selected just
no_claims, while M2_TREES selected 3 additional predictors. BG, RF and GB are not
similarly affected. M1 trees have no important variables, most affected by imbalance.

7/1/2018

7/1/2018
M2 (50/50) stops earlier, but val miscl higher than for raw (M1).

7/1/2018
Similar results (see previous slide)

7/1/2018

7/1/2018
Logistic shows monotonically increasing relationship, while GB more jagged and
increasing. Just one variable shown. M1 unbalanced case seriously affected in
comparison. No_vclaims obviously positively affects Prob fraud.

7/1/2018
M1 logistic suffers due to event imbalance.

7/1/2018

7/1/2018
No_claims and Optom_presc positively associated with prob Fraud (Training).

7/1/2018
No_claims and Optom_presc positively associated with prob
Fraud (Validation).

7/1/2018
TRN GB has # 3 No_claims as most important, others flat.

7/1/2018
VAL GB repeats No_claims as most important, adds doctor_visits and optom_presc
As positive effects. Corresponding logistic does not point to doctor_visits.
Corresponding M1 and M3 almost identical.

7/1/2018

7/1/2018
# 7 VAL GB by far most important.

7/1/2018
Confluence of curves at point of event prior. Forests perform
Very well at TRN but not at VAL.

7/1/2018
#28 VAL log works to ‘bring’ down positive GB VAL slope.

7/1/2018

7/1/2018
Class imbalance shifts curve down and with flatter slope.

7/1/2018
Val results for RF in ensemble models are flat.

7/1/2018
Goodness of fit
And model
Selection.

7/1/2018
GOF ranks VAL
GOF measure
rank
AUROC
Avg
Square
Error
Cum Lift
3rd bin
Cum
Resp
Rate
3rd Gini
Rsquare
Cramer
Tjur
Rank Rank Rank Rank Rank Rank
Unw.
Mean
Unw.
Median
Model Name
6 3 6 6 6 3 5.00 6.00
02_M1_NSMBL_VAL_LOGISTIC_
NONE
08_M1_VAL_BAGGING
14 13 15 15 14 14 14.17 14.00
09_M1_VAL_GRAD_BOOSTING
5 4 1 1 5 4 3.33 4.00
10_M1_VAL_LOGISTIC_STEPWI
SE 16 11 16 16 16 17 15.33 16.00
11_M1_VAL_RFORESTS
13 16 14 14 13 15 14.17 14.00
12_M1_VAL_TREES
18 18 18 18 18 18 18.00 18.00
NONE 1 1 2 2 1 1 1.33 1.00
20_M2_VAL_BAGGING
9 7 9 9 9 12 9.17 9.00
3 5 3 3 3 5 3.67 3.00
SE 10 8 11 11 10 10 10.00 10.00
23_M2_VAL_RFORESTS
17 12 17 17 17 16 16.00 17.00
24_M2_VAL_TREES
12 10 10 10 12 8 10.33 10.00
NONE 2 2 4 4 2 2 2.67 2.00
32_M3_VAL_BAGGING
8 14 8 8 8 7 8.83 8.00
4 6 5 5 4 6 5.00 5.00
SE 11 9 12 12 11 11 11.00 11.00
35_M3_VAL_RFORESTS
7 15 7 7 7 9 8.67 7.00
Model Name
36_M3_VAL_TREES
15 17 13 13 15 13 14.33
14.0
0

7/1/2018
M2 VAL ensemble best, and M1 VAL GB best of single models.
M3 VAL performance of single models is lackluster.

7/1/2018

7/1/2018
Conclusion on re-sampling.
In this example, 50/50 M3 resampled models yielded a
smaller Tree with no discernible difference in
performance to its M2 counterpart. M1 trees failed to
perform and the other M1 methods performed acceptably
well.
Actual performance (for best models) was not affected by
50/50 or raw modeling. Extreme imbalance seriously
affected raw trees, but not other variants.
The overall winner in all cases was GB, when evaluated at
VAL. Models suffer when event prior is seriously
imbalanced except for GB.

7/1/2018

7/1/2018
XGBoost
Developed by Chen and Guestrin (2016) XGBoost: A Scalable Tree
Boosting System.
Claims: Faster and better than neural networks and Random Forests.
Uses 2nd order gradients of loss functions based on Taylor expansions of loss functions,
plugged into same algorithm for greater generalization. In addition, transforms loss function
into more sophisticated objective function containing regularization terms, that penalizes tree
growth, with penalty proportional to the size of the node weights thus preventing overfitting.
More efficient than GB due to parallel computing on single computer (10 times faster).
Algorithm takes advantage of advanced decomposition of objective function that allows for
outperforming GB.
Not yet SAS available. Available in R, Julia, Python, CLI.
Tool used in many champion models in recent competitions (Kaggle, etc.).
See also Foster’s (2017) XGboostExplainer.

7/1/2018

7/1/2018
Comments
1) Not immediately apparent what weak classifier is for GB (e.g., by
varying depth in our case). Likewise, number of iterations is big
issue. In our simple example in first study, M6 GB was best performer.
Still, overall modeling benefited from ensembling all methods as
measured by either AUROC or Cum Lift or ensemble p-values.
2) The posterior probability ranges are vastly different and thus the
tendency to classify observations by the .5 threshold is too simplistic.
3) The PDPs show that different methods find distinct multivariate
structures. Interestingly, the ensemble p-values show a decreasing
tendency by logistic and trees and a strong S shaped tendency
by M6 GB (first study), which could mean that M6 GB alone tends
to overshoot its predictions.
4) GB relatively unaffected by 50/50 mixture.

7/1/2018
Comments
5) While on classification GB problems, predictions are within [0, 1], for
continuous target problems, predictions can be beyond the range of the
target variable  headaches.
This is due to the fact that GB models residual at each iteration, not the
original target; this can lead to surprises, such as negative predictions
when Y takes only non-negative residual values, contrary to the original
Tree algorithm.
6) Shrinkage parameter and early stopping (# trees) act as regularizers
but combined effect not known and could be ineffective.
7) If shrinkage too small, and allow large T, model is large, expensive
to compute, implement and understand.
8) Random Forests over-fitted. A larger study should incorporate changes
in its parameters for better validation.

7/1/2018
Comments
9) Model interpretation is difficult in the case of BG, RF and BG (and not
trivial for the other methods either). PDPs for logistic regression variables
show monotonic relationships, while those of GB variables are very
nonlinear. PDPs for other methods were not created.

7/1/2018
Drawbacks of GB.
1) IT IS NOT MAGIC, it won’t solve ALL modeling needs,
but best off-the-shelf tool. Still need to look for
transformations, odd issues, missing values, etc.
2) As all tree methods, categorical variables with many levels can
make it impossible to obtain model. E.g., zip codes.
3) Memory requirements can be very large, especially with large
iterations, typical problem of ensemble methods.
4) Large number of iterations  slow speed to obtain predictions 
on-line scoring may require trade-off between complexity and time
available. Once GB is learned, parallelization certainly helps.
5) No simple algorithm to capture interactions because of base-
learners.
6) No simple rules to determine gamma, # of iterations or depth of
simple learner. Need to try different combinations and possibly
recalibrate in time.
7) Still, one of the most powerful methods available.

7/1/2018
Un-reviewed
Catboost
DeepForest
gcForest
Use of tree methods for continuous target variable.
Naïve-Bayes
Bootstrapping.
…

7/1/2018
2.11) References
Auslender L. (1998): Alacart, poor man’s classification trees, NESUG.
Breiman L., Friedman J., Olshen R., Stone J. (1984): Classification and Regression Trees, Wadsworth.
Chen and Guestrin (2016): XGBoost: A Scalable Tree Boosting System.
Chipman H., George E., McCulloch R.: BART, Bayesian additive regression Trees, The Annals of
Statistics.
Foster D. (2017): New R package that makes Xgboost Interpretable, https://medium.com/applied-data-
science/new-r-package-the-xgboost-explainer-51dd7d1aa211
Friedman, J. (2001).Greedy boosting approximation: a gradient boosting machine. Ann.Stat. 29, 1189–
1232.doi:10.1214/aos/1013203451
Paluszynska A. (2017): Structural mining and knowledge extraction from random forest with
applications to The Cancer Genome Atlas project
(https://www.google.com/url?q=https%3A%2F%2Frawgit.com%2FgeneticsMiNIng%2FBlackBoxOpener
%2Fmaster%2FrandomForestExplainer_Master_thesis.pdf&sa=D&sntz=1&usg=AFQjCNHTJONZK24L
ioDeOB0KZnwLkn98fw and https://mi2datalab.github.io/randomForestExplainer/)
Quinlan J. Ross (1993): C4.5: programs for machine learning, Morgan Kaufmann Publshers.

7/1/2018
Earlier literature on combining methods:
Winkler, RL. and Makridakis, S. (1983). The combination of
forecasts. J. R. Statis. Soc. A. 146(2), 150-157.
Makridakis, S. and Winkler, R.L. (1983). Averages of
Forecasts: Some Empirical Results,. Management Science,
29(9) 987-996.
Bates, J.M. and Granger, C.W. (1969). The combination of
forecasts. Or, 451-468.

7/1/2018

7/1/2018
1) Can you explain in nontechnical language the idea of
maximum likelihood estimation?, of SVM (unreviewed in class)?
2) Contrast GB with RF.
3) In what way is over-fitting like a glove? Like an umbrella?
4) Would ensemble models always improve on individual models?
5) Would you select variables by way of tree methods to use in linear
methods later on? Yes? No? why?
6) In Tree regression, final predictions are means. Could better
predictions be obtained by regression model instead? A logistic for a
binary target? Discuss.
7) There are 9 coins, 8 of which are of equal weight, and there’s one
scale. How many steps until you identify the odd coin?
8) Why are manhole covers round?
9) You obtain 100% accuracy in validation of classification model. Are
you a genius? Yes, no, why?
10)If 85% of witnesses saw blue car during accident, and 15% saw red
car, what is probability (car is blue)?

7/1/2018
Counter-interview questions (you ask the interviewer).
1) How do you measure the height of a building with just a
barometer? Give three answers at least.
2) Two players A and B take turns saying a positive integer
number from 1 to 9. The numbers are added until
whoever reaches 100 or above, loses. Is there a strategy
to never lose? (aborting a game midway is acceptable, but
give reasoning).
3) There are two jugs, one that holds 5 gallons, the other one
3, and a nearby water fountain. How do you put exactly (less
than one ounce deviation is fine) 4 ounces in the 5 gallon
jug?

4_2_Ensemble models and grad boost part 3.pdf

Recommended

Recommended

More Related Content

Similar to 4_2_Ensemble models and grad boost part 3.pdf

Similar to 4_2_Ensemble models and grad boost part 3.pdf (6)

More from Leonardo Auslender

More from Leonardo Auslender (20)

Recently uploaded

Recently uploaded (20)

4_2_Ensemble models and grad boost part 3.pdf