4 2 ensemble models and grad boost part 2 2019-10-07

Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 17/3/2018
Ensemble models and
Gradient Boosting, part 2.
Leonardo Auslender
Independent Statistical Consultant
Leonardo ‘dot’ Auslender ‘at’
Gmail ‘dot’ com.
Copyright 2018.

2 studies
2.8.b: Raw data, GB without constraints on its
parameters, compared to its friends.
2.8.c: Comparison of methods but focusing on whether
raw vs 50/50 re-sampling makes a difference.

Partial Dependency plots (PDP).
Due to GB’s (and other methods’) black-box nature, these plots show the
effect of predictor X on modeled response once all other predictors
have been marginalized (integrated away). Marginalized Predictors
usually fixed at constant value, typically mean.
Graphs may not capture nature of variable interactions especially if
interaction significantly affects model outcome.
Formally, PDP of F(x1, x2, xp) on X is E(F) over all vars except X. Thus, for
given Xs, PDP is average of predictions in training with Xs kept constant.
Since GB, Boosting, Bagging, etc are BLACK BOX models, use PDP to
obtain model interpretation. Also useful for logistic models.

Analytical problem to investigate.
Optical Health Care fraud insurance patients. Longer care typically involves higher
treatment costs and insurance company has to set up reserves immediately as soon as
a case is opened. Sometimes doctors involve in fraud.
Aim: predict fraudulent charges  classification problem; use battery of models and
compare them. Below left, original data (M1 models. Focus is on comparisons across
models (see earlier chapters for individual models analytics). For brevity sake, omitted
mean and median ensembles.
Model Name Item Information
1
M1 TRN data set train
. TRN num obs 3595
1
VAL data set validata
1
. VAL num obs 2365
1
TST data set 1
. TST num obs 1
2
Dep. Var fraud
1
TRN % Events 20.389
1
VAL % Events 19.281
1

E.g., 08_M1_VAL_BAGGING: 8th model of M1 data set case, Validation and using Bagging as the modeling
technique.
Requested Models: Names & Descriptions. Model #
Full Model Name Model Description
***
Overall Models
-1
M1 Raw 20pct
-10
01_M1_NSMBL_TRN_AVG Ensemble AVG
1
02_M1_NSMBL_TRN_LOGISTIC_NONE Logistic TRN NONE Ensemble
2
03_M1_NSMBL_TRN_MED Ensemble MED
3
04_M1_NSMBL_VAL_AVG Ensemble AVG
4
05_M1_NSMBL_VAL_LOGISTIC_NONE Logistic VAL NONE Ensemble
5
06_M1_NSMBL_VAL_MED Ensemble MED
6
07_M1_TRN_BAGGING Bagging TRN Bagging
7
08_M1_TRN_GRAD_BOOSTING Gradient Boosting
8
09_M1_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE
9
10_M1_TRN_RFORESTS Random Forests
10
11_M1_TRN_TREES Trees TRN Trees
11
12_M1_VAL_BAGGING Trees VAL Trees
12
13_M1_VAL_GRAD_BOOSTING Gradient Boosting
13
14_M1_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE
14
15_M1_VAL_RFORESTS Random Forests
15
16_M1_VAL_TREES Trees VAL Trees
16

For models other than Tree themselves, modeled posterior
probabilities via interval valued target variable (includes
logistic and ensembles).
For simplicity, just first 2 levels of trees are shown.
Notation: M1_GB_TRN_TREES: Data M1, Tree simulation of
Gradient boosting run (GB). BG: Bagging, RF: Random Forests, LG
logistic, NSMBL: ensemble.
Intention: obtain general idea of tree representation for
comparison to standard tree model. .
Next page: small detail for BG (Bagging), GB Gradient Boosting and Trees
themselves. Later, graphical comparison of vars + splits at each tree level.

Requested Tree Models: Names & Descriptions.
Pred
Model Name Level 1 + Prob. Level 2 + Prob. Level 3 + Prob. Level 4 + Prob.
0.488
M1_BG_TRN_TREES no_claims < 0.5 (
0.142 )
member_duratio
n < 180.5 ( 0.2 )
total_spend <
5250 ( 0.519 )
total_spend >=
5050 ( 0.488 )
total_spend <
5050 ( 0.522 ) 0.522
total_spend >=
5250 ( 0.184 )
optom_presc >=
3.5 ( 0.324 )
0.324
optom_presc <
3.5 ( 0.175 ) 0.175
member_duratio
n >= 180.5 (
0.062 )
doctor_visits <
5.5 ( 0.102 )
member_duratio
n >= 187.5 (
0.099 )
0.099
member_duratio
n < 187.5 ( 0.128
)
0.128
doctor_visits >=
5.5 ( 0.043 )
member_duratio
n < 189.5 ( 0.067
)
0.067
member_duratio
n >= 189.5 (
0.041 )
0.041
no_claims >= 0.5
( 0.446 )
no_claims < 3.5 (
0.395 )
member_duratio
n < 127.5 ( 0.534
)
optom_presc >=
0.5 ( 0.583 )
0.583
optom_presc <
0.5 ( 0.389 ) 0.389
ETC …

Tree Repr.
Level 1

06 actual Tree. Top splitter No_claims, but LG splits at 1.5. Note different prob. events
(bar heights).

Tree Repr.
Level 2

RF pursues
Different structure
search for level 2.
See next Slide as well.

Etc, for
Levels 3 and 4.

Requested ENSEMBLE Tree Models: Names & Descriptions. Mod #
4
M1_NSMBL_LG_TRN_TR
EES
p_M1_RFOREST
S < 0.32216 (
0.12 )
p_M1_RFOREST
S < 0.20605 (
0.094 )
p_M1_RFOREST
S < 0.13014 (
0.061 )
p_M1_RFOREST
S < 0.09849 (
0.054 )
p_M1_RFOREST
S >= 0.09849 (
0.076 ) 4
p_M1_RFOREST
S >= 0.13014 (
0.118 )
p_M1_RFOREST
S >= 0.17199 (
0.138 )
4
p_M1_RFOREST
S < 0.17199 (
0.107 ) 4
p_M1_RFOREST
S >= 0.20605 (
0.25 )
p_M1_RFOREST
S < 0.2581 (
0.208 )
p_M1_BAGGING
< 0.16363 ( 0.326
)
4
p_M1_BAGGING
>= 0.16363 (
0.195 ) 4
p_M1_RFOREST
S >= 0.2581 (
0.308 )
p_M1_TREES >=
0.09642 ( 0.299 )
4
p_M1_TREES <
0.09642 ( 0.642 ) 4
p_M1_RFOREST
S >= 0.32216 (
0.693 )
p_M1_RFOREST
S < 0.45694 (
0.499 )
p_M1_LOGISTIC
_STEPWISE <
0.40764 ( 0.602 )
p_M1_RFOREST
S >= 0.40264 (
0.767 )
4

M1 ensembled mostly in RF, does it mean that RF is best model?.
Requested ENSEMBLE Tree Models: Names & Descriptions. Mod #
M1_NSMBL_LG_TRN_TR
EES
p_M1_RFOREST
S >= 0.32216 (
0.693 )
p_M1_RFOREST
S < 0.45694 (
0.499 )
p_M1_LOGISTIC
_STEPWISE <
0.40764 ( 0.602 )
p_M1_RFOREST
S < 0.40264 (
0.535 )
4
p_M1_LOGISTIC
_STEPWISE >=
0.40764 ( 0.38 )
p_M1_RFOREST
S >= 0.38749 (
0.441 )
4
p_M1_RFOREST
S < 0.38749 (
0.308 )
4
p_M1_RFOREST
S >= 0.45694 (
0.872 )
p_M1_RFOREST
S < 0.61918 (
0.821 )
p_M1_LOGISTIC
_STEPWISE >=
0.59377 ( 0.725 )
4
p_M1_LOGISTIC
_STEPWISE <
0.59377 ( 0.887 )
4
p_M1_RFOREST
S >= 0.61918 (
0.954 )
p_M1_TREES <
0.92105 ( 0.985 )
4
p_M1_TREES >=
0.92105 ( 0.903 )
4

Ensemble
for level 1

Ensemble
for level 2

Conclusion on tree representations I
No_claims at 0.5 certainly top splitter for most TREE models but notice
that event probabilities diverge (because RF, GB and BG model
posterior probability, not a binary event, and thus carry information
from previous models). Later splits diverge in predictors and split
values across models. LG finds a completely different structure and
starts with no_claims at 1.5. Thus, for tree based models, existence of a
claim is a suspicion of fraud, while for logistic it requires higher
threshold.
Ensemble models: mixture of models  typical interpretability from
single model is doubtful when reality is complex.
Important to view each tree model independently to gage
interpretability. Note that ensemble primer splitter is RF but RF is not
best model (it over-fits badly), but is chosen because all methods
minimize misclassification.
And it is important to view these recent findings in terms of variables
importance and “best” model choice.

Conclusion on tree representations, II
Most importantly, it looks like RF wins, should we stop now?
(Validation results not shown to add to the suspense).
DO NOT RUSH YOUR CONCLUSIONS and keep on reading.

Importance
Measures
For Tree based
Methods.

Agreement on No_claims by all methods, not so much for other variables.

For GB and BG all predictors matter, RF disparages num_members, Trees doctor_visits.
Comparing GB and RF, GB allocates more importance to all predictors (other than no_claims)
when compared to other methods, which implies that structure by RF is simpler.

Tree methods find no_claims as most important, logistic finds most predictors important.
Validation results show effects of over-fitting (variable doctor_visits)

Note almost null stdzed RF VAL estimate < corresp. P-val Insignificant..

Partial Dependency
Plots for
Logistic and
Gradient Boosting
Non-Ensemble
Models.

Most important var, similar
shapes in both cases. Note the
“logistic” like case of one,
and the jagged shape of the the
other, plus flatness for >= 5 at
0.8 prob..

Num_members eliminated
From logistic stepwise. GB
jagged relationship  there
is strong interaction effect
with other predictors.

Pair-wise PDP
For GB
Some variables

Fraud is concentrated on just one or two claims with
lower membership time (two most important vars).

Fraud concentrated on smaller number of members and higher
Number of claims.

Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 41O 7/3/2018
Model # 4 (RF) seems best in fitting Prob event once other predictors’ effects are
marginalized away for TRN but VAL results point to GB instead.

Conclusions on PDPs
1) From Ensemble PDPs, it is obvious that RF fails in
validation. All the ensemble power rests on GB strongly
and on logistic with downward slope.
2) Individual Variable PDP shows uniform relationship for
variables in logistic, while GB shows fuzzy and nonlinear
structures.
3) The contour plots for pairs of variables (GB) allows to
focus on ranges of importance. For instance, No_claims
and Member_duration concentrate important information
at low levels of their respective ranges.
4) Still, it is not possible to obtain (at present) simple
interpretable graphs to understand full complexity of GB
models. Logistic are easier to understand, not fully easy.

Tree based methods do not necessarily reach top probability of 1 and lowest of 0.

Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 48
7/3/2018
Not over-fitted.
Some strong over-fit.

Over-fit degree different
Than in classif. Rates (prev. slide).

Note that TRN and VAL rank do not match. Lower VAL ranked Models tend to overfit more.
GOF ranks
GOF measure
rankAUROC
Avg
Square
Error
Cum Lift
3rd bin
Cum
Resp
Rate
3rd Gini
Rsquare
Cramer
Tjur
rank rank rank rank rank rank
Unw.
Mean
Unw.
Median
Model Name
1 1 1 1 1 1 1.00 1.00
01_M1_NSMBL_TRN_LOGISTIC_
NONE
03_M1_TRN_BAGGING
4 4 4 4 4 4 4.00 4.00
04_M1_TRN_GRAD_BOOSTING
3 2 3 3 3 3 2.83 3.00
05_M1_TRN_LOGISTIC_STEPWI
SE 6 6 6 6 6 6 6.00 6.00
06_M1_TRN_RFORESTS
2 3 2 2 2 5 2.67 2.00
07_M1_TRN_TREES
5 5 5 5 5 2 4.50 5.00
GOF ranks
GOF measure
rankAUROC
Avg
Square
Error
Cum Lift
3rd bin
Cum
Resp
Rate
3rd Gini
Rsquare
Cramer
Tjur
rank rank rank rank rank rank
Unw.
Mean
Unw.
Median
Model Name
1 1 2 2 1 1 1.33 1.00
02_M1_NSMBL_VAL_LOGISTIC_
NONE
08_M1_VAL_BAGGING 5 6 5 5 5 5 5.17 5.00
09_M1_VAL_GRAD_BOOSTING 2 2 1 1 2 2 1.67 2.00
10_M1_VAL_LOGISTIC_STEPWIS
E 3 3 4 4 3 4 3.50 3.50
11_M1_VAL_RFORESTS 6 5 6 6 6 6 5.83 6.00
12_M1_VAL_TREES 4 4 3 3 4 3 3.50 3.50

Based on this methodology, winner, and GB single best model. Alternative selection methods for
best models are users’ dependent.

ETC.

Ensembles have good performance and no over-fitting.

Conclusions
At least for the present defaults of RF in this presentation, it
has badly over-fitted. The best overall model is the
ensemble and the best single model is given by Gradient
Boosting.
The user should decide which metric to use for judging
goodness. In here, simple unweighted ranking of 5
measures was used.
Since there was no financial information, models could not
be measured in terms of profits. K-S chart (not
recommended) shows different cut-off points per model.

Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 Ch. 5-627/3/2018
for now

4 2 ensemble models and grad boost part 2 2019-10-07

Recommended

Recommended

More Related Content

Similar to 4 2 ensemble models and grad boost part 2 2019-10-07

Similar to 4 2 ensemble models and grad boost part 2 2019-10-07 (20)

More from Leonardo Auslender

More from Leonardo Auslender (20)

Recently uploaded

Recently uploaded (20)

4 2 ensemble models and grad boost part 2 2019-10-07