4_3_Ensemble models and grad boost part 2.pdf

Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 1
10/7/2019
Ensemble models and
Gradient Boosting, part 2.
Leonardo Auslender
Independent Statistical Consultant
Leonardo ‘dot’ Auslender ‘at’
Gmail ‘dot’ com.
Copyright 2018.

10/7/2019
2 studies
2.8.b: Raw data, GB without constraints on its
parameters, compared to its friends.
2.8.c: Comparison of methods but focusing on whether
raw vs 50/50 re-sampling makes a difference.

10/7/2019

10/7/2019
Partial Dependency plots (PDP).
Due to GB’s (and other methods’) black-box nature, need tools to
study model structure: these plots show mean Y score of each value
of predictor X matched to entire data set of additional predictors, on
modeled response.
Graphs may not capture nature of variable interactions especially if
interaction significantly affects model outcome.
Formally, PDP of F(x1, x2, xp) on X is E(F) over all vars except X. Thus,
for given Xs, PDP is average of predictions in training with Xs kept
constant.
Since GB, Boosting, Bagging, etc are BLACK BOX models, use PDP to
obtain model interpretation. Also useful for logistic models.

10/7/2019
Modifications of Partial Dependency plots (PDP).
In PDP, each value of X (or pair of X’s) is matched to each observation of
complementary Xs, scores are obtained (predicted Ys) and then averaged
for each X value. Known that if predictors are correlated, PDPs are not
informative. Ergo, partial out effects as in following possible options:
1) Obtain q1 and q3, in addition to avg value to verify stability of average.
2) In like manner as in linear regression, create models from selected
variables but with fully orthogonalized predictors (method proposed by
yours truly). Could be called Partialized PDP, or PPDP.
3) When obtaining 3d PDPs for pairs of variables, obtain Marginal
PDPs, i.e., avg probability of each var2 point along var 1 range.
Reason is that 3d plots typically extrapolate in low density areas ➔
misleading local curves are possible.
4) I(ndividual) C(onditional) E(xpectations) plot: PDP plots for
individual or groups of observations. Grouping defined by user:
quantiles of posterior prob, clustering solution, levels of even
variable, specific individuals, etc.

10/7/2019

10/7/2019
Analytical problem to investigate.
Optical Health Care fraud insurance patients. Longer care typically involves higher
treatment costs and insurance company has to set up reserves immediately as soon as
a case is opened. Sometimes doctors involve in fraud.
Aim: predict fraudulent charges ➔ classification problem; use battery of models and
compare them. Below left, original data (M1 models. Focus is on comparisons across
models (see earlier chapters for individual models analytics). For brevity sake, omitted
mean and median ensembles.
Model Name Item Information
1
M1 TRN data set train
. TRN num obs 3595
1
VAL data set validata
1
. VAL num obs 2365
1
TST data set 1
. TST num obs 1
2
Dep. Var fraud
1
TRN % Events 20.389
1
VAL % Events 19.281
1

10/7/2019
E.g., 08_M1_VAL_BAGGING: 8th model of M1 data set case, Validation and using Bagging as the modeling
technique.
Requested Models: Names & Descriptions. Model
#
Full Model Name Model Description
***
Overall Models
-1
M1 20 pct prior
-10
01_M1_GB_TRN_TREES Tree Repr. for Gradient Boosting
1
02_M1_LG_TRN_TREES Tree Repr. of Logistic STEPWISE
2
03_M1_NSMBL_LG_TRN_TREES Tree Repr. of Logistic NONE
3
04_M1_TRN_BAGGING Bagging TRN Bagging TRN
4
05_M1_TRN_GRAD_BOOSTING Gradient Boosting
5
06_M1_TRN_LOGISTIC_NONE_NSMBL Logistic TRN NONE TRN ENSEMBLE
6
07_M1_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE TRN
7
08_M1_TRN_NSMBL_AVG Ensemble AVG
8
09_M1_TRN_NSMBL_MED Ensemble MED
9
10_M1_TRN_RFORESTS Random Forests
10
11_M1_TRN_TREES Trees TRN Trees TRN
11
12_M1_VAL_BAGGING Trees VAL Trees VAL
12
13_M1_VAL_GRAD_BOOSTING Gradient Boosting
13
14_M1_VAL_LOGISTIC_NONE_NSMBL Logistic VAL NONE VAL ENSEMBLE
14
15_M1_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE VAL
15
16_M1_VAL_NSMBL_AVG Ensemble AVG
16
17_M1_VAL_NSMBL_MED Ensemble MED
17
18_M1_VAL_RFORESTS Random Forests
18
19_M1_VAL_TREES Trees VAL Trees VAL
19

10/7/2019
For models other than Tree themselves, modeled posterior
probabilities via interval valued target variable (includes
logistic and ensembles).
For simplicity, just first 2 levels of trees are shown.
Notation: M1_GB_TRN_TREES: Data M1, Tree simulation of
Gradient boosting run (GB). BG: Bagging, RF: Random Forests, LG
logistic, NSMBL: ensemble.
Intention: obtain general idea of tree representation for
comparison to standard tree model. .
Next page: small detail for BG (Bagging), GB Gradient Boosting and Trees
themselves. Later, graphical comparison of vars + splits at each tree level.

10/7/2019

10/7/2019
Tree representation(s) up to 4 levels Model 'M1_BG_TRN_TREES'
Requested Tree Models: Names & Descriptions. Pred
Level 1 + Prob. Level 2 + Prob. Level 3 + Prob. Level 4 + Prob.
0.273
no_claims < 0.5 ( 0.138 ) member_duration < 180.5 (
0.194 )
total_spend < 7950 ( 0.337
)
total_spend >= 7550 (
0.273 )
total_spend < 7550 ( 0.343
) 0.343
0.173 )
optom_presc >= 1.5 ( 0.25 )
0.250
optom_presc < 1.5 ( 0.146 ) 0.146
member_duration >= 180.5
( 0.061 )
doctor_visits < 6.5 ( 0.105 ) doctor_visits >= 5.5 ( 0.081
) 0.081
doctor_visits < 5.5 ( 0.109 ) 0.109
doctor_visits >= 6.5 ( 0.034
)
member_duration < 189.5 (
0.083 )
0.083
( 0.03 )
0.030
no_claims >= 0.5 ( 0.43 ) no_claims < 2.5 ( 0.39 ) optom_presc < 0.5 ( 0.304 ) member_duration >= 171.5
( 0.266 )
0.266
0.343 )
0.343
optom_presc >= 0.5 ( 0.447
)
( 0.363 )
0.363
0.521 )
0.521
no_claims >= 2.5 ( 0.556 ) optom_presc < 0.5 ( 0.515 )
0.515
optom_presc >= 0.5 ( 0.577
)
( 0.54 )
0.540
0.626 )
0.626
Bagging.

10/7/2019
ETC …
Tree representation(s) up to 4 levels Model 'M1_GB_TRN_TREES'
Requested Tree Models: Names & Descriptions. Pred
Level 1 + Prob. Level 2 + Prob. Level 3 + Prob. Level 4 + Prob.
0.186
no_claims < 2.5 ( 0.185 ) no_claims < 0.5 ( 0.159 ) member_duration <
180.5 ( 0.199 )
0.186 )
total_spend < 5250 (
0.464 ) 0.464
member_duration >=
180.5 ( 0.103 )
doctor_visits < 5.5 (
0.126 ) 0.126
doctor_visits >= 5.5 (
0.093 ) 0.093
no_claims >= 0.5 ( 0.321
)
optom_presc < 3.5 (
0.291 )
total_spend < 6300 (
0.467 ) 0.467
0.273 ) 0.273
optom_presc >= 3.5 (
0.59 )
member_duration <
154.5 ( 0.67 )
0.670
member_duration >=
154.5 ( 0.447 ) 0.447
no_claims >= 2.5 ( 0.633
)
no_claims < 4.5 ( 0.57 ) optom_presc < 3.5 (
0.54 )
member_duration >=
128.5 ( 0.498 )
0.498
member_duration <
128.5 ( 0.627 ) 0.627
optom_presc >= 3.5 (
0.81 )
member_duration >=
137 ( 0.785 )
0.785
member_duration < 137
( 0.85 ) 0.850
no_claims >= 4.5 ( 0.761
)
member_duration <
303.5 ( 0.778 )
member_duration >=
148 ( 0.757 )
0.757
member_duration < 148
( 0.823 ) 0.823
G. Boosting

10/7/2019
Tree Repr.
Level 1

10/7/2019
06 actual Tree. Top splitter No_claims, but LG splits at 1.5. Note different prob. events
(bar heights).

10/7/2019

10/7/2019
Tree Repr.
Level 2

10/7/2019
RF pursues
Different structure
search for level 2.
See next Slide as
well.

10/7/2019

10/7/2019
Etc, for
Levels 3 and 4.

10/7/2019

10/7/2019
M1 ensembled mostly in RF, does it mean that RF is best model?.
Requested ENSEMBLE Tree Models: Names & Descriptions. Mod #
Model Name Level 1 + Prob. Level 2 + Prob. Level 3 + Prob. Level 4 + Prob.
M1_NSMBL_LG_TRN_TR
EES
p_M1_RFOREST
S >= 0.34378 (
0.707 )
p_M1_RFOREST
S < 0.49055 (
0.541 )
p_M1_LOGISTIC
_STEPWISE <
0.36438 ( 0.669 )
p_M1_RFOREST
S < 0.40898 (
0.563 )
4
p_M1_LOGISTIC
_STEPWISE >=
0.36438 ( 0.476 )
p_M1_RFOREST
S >= 0.42914 (
0.581 )
4
p_M1_RFOREST
S < 0.42914 (
0.408 )
4
p_M1_RFOREST
S >= 0.49055 (
0.893 )
p_M1_RFOREST
S < 0.58664 (
0.829 )
p_M1_LOGISTIC
_STEPWISE >=
0.51796 ( 0.761 )
4
p_M1_LOGISTIC
_STEPWISE <
0.51796 ( 0.899 )
4
p_M1_RFOREST
S >= 0.58664 (
0.941 )
p_M1_TREES >=
0.8273 ( 0.908 )
4
p_M1_TREES <
0.8273 ( 0.966 )
4

10/7/2019
Ensemble
for level 1

10/7/2019

10/7/2019
Ensemble
for level 2

10/7/2019

10/7/2019
Conclusion on tree representations I
No_claims at 0.5 certainly top splitter for most TREE models but notice
that event probabilities diverge (because RF, GB and BG model
posterior probability, not a binary event, and thus carry information
from previous models). Later splits diverge in predictors and split
values across models. LG finds a completely different structure and
starts with no_claims at 1.5. Thus, for tree based models, existence of a
claim is a suspicion of fraud, while for logistic it requires higher
threshold.
Ensemble models: mixture of models ➔ typical interpretability from
single model is doubtful when reality is complex.
Important to view each tree model independently to gage
interpretability. Note that ensemble primer splitter is RF but RF is not
best model (it over-fits badly), but is chosen because all methods
minimize misclassification.
And it is important to view these recent findings in terms of variables
importance and “best” model choice.

10/7/2019
Conclusion on tree representations, II
Most importantly, it looks like RF wins, should we stop now?
(Validation results not shown to add to the suspense).
DO NOT RUSH YOUR CONCLUSIONS and keep on reading.

10/7/2019
Importance
Measures
For Tree based
Methods.

10/7/2019
Agreement on No_claims by all methods, not so much for other variables.

10/7/2019
For GB and BG all predictors matter, RF disparages num_members, Trees doctor_visits.
Comparing GB and RF, GB allocates more importance to all predictors (other than no_claims)
when compared to other methods, which implies that structure by RF is simpler.

10/7/2019

10/7/2019
Tree methods find no_claims as most important, logistic finds most predictors important.
Validation results show effects of over-fitting (variable doctor_visits)

10/7/2019
Note almost null stdzed RF VAL estimate ➔< corresp. P-val Insignificant..

10/7/2019

10/7/2019
Partial Dependency
Plots for
Logistic and
Gradient Boosting
Non-Ensemble
Models.

10/7/2019
Most important var, similar
shapes in both cases. Note the
“logistic” like case of one,
and the jagged shape of the the
other, plus flatness for >= 5 at
0.8 prob..

10/7/2019
Num_members eliminated
From logistic stepwise. GB jagged
relationship ➔ there is strong interaction
effect with other predictors.

10/7/2019
Pair-wise PDP
For
Some variables

10/7/2019
Fraud is concentrated on lower membership time. TRN Stepwise Logistic (left), GB right,
correlation = 0.02846.
Similar but not
Identical.

10/7/2019
Fraud concentrated on smaller number of members and higher
Number of claims. GB.

10/7/2019

10/7/2019
RF big winner, right? But ….

O 10/7/2019
Model # 4 (RF) seems best in fitting Prob event once other predictors’ effects are
marginalized away for TRN but VAL results point to GB instead.

10/7/2019

10/7/2019
Pair-wise PDP
For some
Ensemble Model
variables

10/7/2019

10/7/2019
Conclusions on PDPs
1) From Ensemble PDPs, it is obvious that RF fails in
validation. All the ensemble power rests on GB strongly
and on logistic with downward slope.
2) Individual Variable PDP shows uniform relationship for
variables in logistic, while GB shows fuzzy and nonlinear
structures.
3) The contour plots for pairs of variables (GB) allows to
focus on ranges of importance. For instance, No_claims
and Member_duration concentrate important information
at low levels of their respective ranges.
4) Still, it is not possible to obtain (at present) simple
interpretable graphs to understand full complexity of GB
models. Logistic are easier to understand, not fully easy.

10/7/2019

10/7/2019
Tree based methods do not necessarily reach top probability of 1 and lowest of 0.

10/7/2019

10/7/2019
Overfit?

10/7/2019
Not over-fitted.
Some strong over-fit.

10/7/2019
Over-fit degree different
Than in classif. Rates (prev. slide).

10/7/2019

10/7/2019
Note that TRN and VAL rank do not match. Lower VAL ranked Models tend to overfit more.
GOF ranks
GOF measure
rank
AUR
OC
Avg
Squa
re
Error
Class
Rate
Cum
Lift
3rd
bin
Cum
Resp
Rate
3rd Gini
P - R
AUC
Preci
sion
Rate
Rsqu
are
Cram
er
Tjur
Rank Rank Rank Rank Rank Rank Rank Rank Rank
Unw.
Mean
Unw.
Median
Model Name
6 8 8 6 6 6 8 4 8 6.67 6
02_M1_TRN_BAGGING
03_M1_TRN_GRAD_BOOSTING 3 3 7 5 5 3 3 2 6 4.11 3
04_M1_TRN_LOGISTIC_NONE_NSM
BL 1 1 4 2 2 1 1 5 1 2.00 1
05_M1_TRN_LOGISTIC_STEPWISE 8 7 3 8 8 8 6 8 7 7.00 8
06_M1_TRN_NSMBL_AVG 4 5 1 3 3 4 5 7 4 4.00 4
07_M1_TRN_NSMBL_MED 5 4 2 4 4 5 4 6 5 4.33 4
08_M1_TRN_RFORESTS 2 2 5 1 1 2 2 1 3 2.11 2
09_M1_TRN_TREES 7 6 6 7 7 7 7 3 2 5.78 7
GOF ranks
GOF measure
rank
AUR
OC
Avg
Squa
re
Error
Class
Rate
Cum
Lift
3rd
bin
Cum
Resp
Rate
3rd Gini
P - R
AUC
Preci
sion
Rate
Rsqu
are
Cram
er
Tjur
Ran
k Rank Rank Rank Rank Rank Rank Rank Rank
Unw.
Mean
Unw.
Median
Model Name
5 6 7 5 5 5 6 3 7 5.44 5
10_M1_VAL_BAGGING
11_M1_VAL_GRAD_BOOSTING 2 2 5 1 1 2 2 1 2 2.00 2
12_M1_VAL_LOGISTIC_NONE_NSMB
L
1 1 4 2 2 1 1 4 1 1.89 1
13_M1_VAL_LOGISTIC_STEPWISE 6 5 3 6 6 6 5 7 4 5.33 6
14_M1_VAL_NSMBL_AVG 3 3 1 3 3 3 3 6 6 3.44 3
15_M1_VAL_NSMBL_MED 4 4 2 4 4 4 4 5 5 4.00 4
16_M1_VAL_RFORESTS 8 8 8 8 8 8 8 8 8 8.00 8
17_M1_VAL_TREES 7 7 6 7 7 7 7 2 3 5.89 7

10/7/2019
Based on this methodology, winner, and GB single best model. Alternative selection methods for
best models are users’ dependent., below is just one approach.

10/7/2019

10/7/2019
ETC.

10/7/2019

10/7/2019
Ensembles have good performance and no over-fitting.

10/7/2019
Specific example: Note distance to reach ‘best’ lift.

10/7/2019

10/7/2019
Note ‘event’ separation for ENSEMBLE case.

10/7/2019
***

10/7/2019
Conclusions
At least for the present defaults of RF in this presentation, it
has badly over-fitted. The best overall model is the
ensemble and the best single model is given by Gradient
Boosting.
The user should decide which metric to use for judging
goodness. In here, simple unweighted ranking of 5
measures was used.
Since there was no financial information, models could not
be measured in terms of profits. K-S chart (not
recommended) shows different cut-off points per model.

4_3_Ensemble models and grad boost part 2.pdf

Recommended

Recommended

More Related Content

Similar to 4_3_Ensemble models and grad boost part 2.pdf

Similar to 4_3_Ensemble models and grad boost part 2.pdf (20)

More from Leonardo Auslender

More from Leonardo Auslender (20)

Recently uploaded

Recently uploaded (20)

4_3_Ensemble models and grad boost part 2.pdf