2. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 27/3/2018
2 studies
2.8.b: Raw data, GB without constraints on its
parameters, compared to its friends.
2.8.c: Comparison of methods but focusing on whether
raw vs 50/50 re-sampling makes a difference.
4. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 47/3/2018
Partial Dependency plots (PDP).
Due to GB’s (and other methods’) black-box nature, these plots show the
effect of predictor X on modeled response once all other predictors
have been marginalized (integrated away). Marginalized Predictors
usually fixed at constant value, typically mean.
Graphs may not capture nature of variable interactions especially if
interaction significantly affects model outcome.
Formally, PDP of F(x1, x2, xp) on X is E(F) over all vars except X. Thus, for
given Xs, PDP is average of predictions in training with Xs kept constant.
Since GB, Boosting, Bagging, etc are BLACK BOX models, use PDP to
obtain model interpretation. Also useful for logistic models.
6. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 67/3/2018
Analytical problem to investigate.
Optical Health Care fraud insurance patients. Longer care typically involves higher
treatment costs and insurance company has to set up reserves immediately as soon as
a case is opened. Sometimes doctors involve in fraud.
Aim: predict fraudulent charges classification problem; use battery of models and
compare them. Below left, original data (M1 models. Focus is on comparisons across
models (see earlier chapters for individual models analytics). For brevity sake, omitted
mean and median ensembles.
Model Name Item Information
1
M1 TRN data set train
. TRN num obs 3595
1
VAL data set validata
1
. VAL num obs 2365
1
TST data set 1
. TST num obs 1
2
Dep. Var fraud
1
TRN % Events 20.389
1
VAL % Events 19.281
1
7. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 77/3/2018
E.g., 08_M1_VAL_BAGGING: 8th model of M1 data set case, Validation and using Bagging as the modeling
technique.
Requested Models: Names & Descriptions. Model #
Full Model Name Model Description
***
Overall Models
-1
M1 Raw 20pct
-10
01_M1_NSMBL_TRN_AVG Ensemble AVG
1
02_M1_NSMBL_TRN_LOGISTIC_NONE Logistic TRN NONE Ensemble
2
03_M1_NSMBL_TRN_MED Ensemble MED
3
04_M1_NSMBL_VAL_AVG Ensemble AVG
4
05_M1_NSMBL_VAL_LOGISTIC_NONE Logistic VAL NONE Ensemble
5
06_M1_NSMBL_VAL_MED Ensemble MED
6
07_M1_TRN_BAGGING Bagging TRN Bagging
7
08_M1_TRN_GRAD_BOOSTING Gradient Boosting
8
09_M1_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE
9
10_M1_TRN_RFORESTS Random Forests
10
11_M1_TRN_TREES Trees TRN Trees
11
12_M1_VAL_BAGGING Trees VAL Trees
12
13_M1_VAL_GRAD_BOOSTING Gradient Boosting
13
14_M1_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE
14
15_M1_VAL_RFORESTS Random Forests
15
16_M1_VAL_TREES Trees VAL Trees
16
8. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 87/3/2018
For models other than Tree themselves, modeled posterior
probabilities via interval valued target variable (includes
logistic and ensembles).
For simplicity, just first 2 levels of trees are shown.
Notation: M1_GB_TRN_TREES: Data M1, Tree simulation of
Gradient boosting run (GB). BG: Bagging, RF: Random Forests, LG
logistic, NSMBL: ensemble.
Intention: obtain general idea of tree representation for
comparison to standard tree model. .
Next page: small detail for BG (Bagging), GB Gradient Boosting and Trees
themselves. Later, graphical comparison of vars + splits at each tree level.
12. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 127/3/2018
06 actual Tree. Top splitter No_claims, but LG splits at 1.5. Note different prob. events
(bar heights).
14. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 147/3/2018
RF pursues
Different structure
search for level 2.
See next Slide as well.
25. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 257/3/2018
Conclusion on tree representations I
No_claims at 0.5 certainly top splitter for most TREE models but notice
that event probabilities diverge (because RF, GB and BG model
posterior probability, not a binary event, and thus carry information
from previous models). Later splits diverge in predictors and split
values across models. LG finds a completely different structure and
starts with no_claims at 1.5. Thus, for tree based models, existence of a
claim is a suspicion of fraud, while for logistic it requires higher
threshold.
Ensemble models: mixture of models typical interpretability from
single model is doubtful when reality is complex.
Important to view each tree model independently to gage
interpretability. Note that ensemble primer splitter is RF but RF is not
best model (it over-fits badly), but is chosen because all methods
minimize misclassification.
And it is important to view these recent findings in terms of variables
importance and “best” model choice.
26. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 267/3/2018
Conclusion on tree representations, II
Most importantly, it looks like RF wins, should we stop now?
(Validation results not shown to add to the suspense).
DO NOT RUSH YOUR CONCLUSIONS and keep on reading.
27. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 277/3/2018
Importance
Measures
For Tree based
Methods.
28. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 287/3/2018
Agreement on No_claims by all methods, not so much for other variables.
29. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 297/3/2018
For GB and BG all predictors matter, RF disparages num_members, Trees doctor_visits.
Comparing GB and RF, GB allocates more importance to all predictors (other than no_claims)
when compared to other methods, which implies that structure by RF is simpler.
31. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 317/3/2018
Tree methods find no_claims as most important, logistic finds most predictors important.
Validation results show effects of over-fitting (variable doctor_visits)
34. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 347/3/2018
Partial Dependency
Plots for
Logistic and
Gradient Boosting
Non-Ensemble
Models.
35. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 357/3/2018
Most important var, similar
shapes in both cases. Note the
“logistic” like case of one,
and the jagged shape of the the
other, plus flatness for >= 5 at
0.8 prob..
36. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 367/3/2018
Num_members eliminated
From logistic stepwise. GB
jagged relationship there
is strong interaction effect
with other predictors.
37. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 377/3/2018
Pair-wise PDP
For GB
Some variables
38. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 387/3/2018
Fraud is concentrated on just one or two claims with
lower membership time (two most important vars).
39. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 397/3/2018
Fraud concentrated on smaller number of members and higher
Number of claims.
41. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 41O 7/3/2018
Model # 4 (RF) seems best in fitting Prob event once other predictors’ effects are
marginalized away for TRN but VAL results point to GB instead.
43. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 437/3/2018
Conclusions on PDPs
1) From Ensemble PDPs, it is obvious that RF fails in
validation. All the ensemble power rests on GB strongly
and on logistic with downward slope.
2) Individual Variable PDP shows uniform relationship for
variables in logistic, while GB shows fuzzy and nonlinear
structures.
3) The contour plots for pairs of variables (GB) allows to
focus on ranges of importance. For instance, No_claims
and Member_duration concentrate important information
at low levels of their respective ranges.
4) Still, it is not possible to obtain (at present) simple
interpretable graphs to understand full complexity of GB
models. Logistic are easier to understand, not fully easy.
45. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 457/3/2018
Tree based methods do not necessarily reach top probability of 1 and lowest of 0.
51. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 517/3/2018
Note that TRN and VAL rank do not match. Lower VAL ranked Models tend to overfit more.
GOF ranks
GOF measure
rankAUROC
Avg
Square
Error
Cum Lift
3rd bin
Cum
Resp
Rate
3rd Gini
Rsquare
Cramer
Tjur
rank rank rank rank rank rank
Unw.
Mean
Unw.
Median
Model Name
1 1 1 1 1 1 1.00 1.00
01_M1_NSMBL_TRN_LOGISTIC_
NONE
03_M1_TRN_BAGGING
4 4 4 4 4 4 4.00 4.00
04_M1_TRN_GRAD_BOOSTING
3 2 3 3 3 3 2.83 3.00
05_M1_TRN_LOGISTIC_STEPWI
SE 6 6 6 6 6 6 6.00 6.00
06_M1_TRN_RFORESTS
2 3 2 2 2 5 2.67 2.00
07_M1_TRN_TREES
5 5 5 5 5 2 4.50 5.00
GOF ranks
GOF measure
rankAUROC
Avg
Square
Error
Cum Lift
3rd bin
Cum
Resp
Rate
3rd Gini
Rsquare
Cramer
Tjur
rank rank rank rank rank rank
Unw.
Mean
Unw.
Median
Model Name
1 1 2 2 1 1 1.33 1.00
02_M1_NSMBL_VAL_LOGISTIC_
NONE
08_M1_VAL_BAGGING 5 6 5 5 5 5 5.17 5.00
09_M1_VAL_GRAD_BOOSTING 2 2 1 1 2 2 1.67 2.00
10_M1_VAL_LOGISTIC_STEPWIS
E 3 3 4 4 3 4 3.50 3.50
11_M1_VAL_RFORESTS 6 5 6 6 6 6 5.83 6.00
12_M1_VAL_TREES 4 4 3 3 4 3 3.50 3.50
52. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 527/3/2018
Based on this methodology, winner, and GB single best model. Alternative selection methods for
best models are users’ dependent.
61. Leonardo Auslender Copyright 2004Leonardo Auslender – Copyright 2018 617/3/2018
Conclusions
At least for the present defaults of RF in this presentation, it
has badly over-fitted. The best overall model is the
ensemble and the best single model is given by Gradient
Boosting.
The user should decide which metric to use for judging
goodness. In here, simple unweighted ranking of 5
measures was used.
Since there was no financial information, models could not
be measured in terms of profits. K-S chart (not
recommended) shows different cut-off points per model.