2. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 2
10/7/2019
2 studies
2.8.b: Raw data, GB without constraints on its
parameters, compared to its friends.
2.8.c: Comparison of methods but focusing on whether
raw vs 50/50 re-sampling makes a difference.
4. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 4
10/7/2019
Partial Dependency plots (PDP).
Due to GB’s (and other methods’) black-box nature, need tools to
study model structure: these plots show mean Y score of each value
of predictor X matched to entire data set of additional predictors, on
modeled response.
Graphs may not capture nature of variable interactions especially if
interaction significantly affects model outcome.
Formally, PDP of F(x1, x2, xp) on X is E(F) over all vars except X. Thus,
for given Xs, PDP is average of predictions in training with Xs kept
constant.
Since GB, Boosting, Bagging, etc are BLACK BOX models, use PDP to
obtain model interpretation. Also useful for logistic models.
5. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 5
10/7/2019
Modifications of Partial Dependency plots (PDP).
In PDP, each value of X (or pair of X’s) is matched to each observation of
complementary Xs, scores are obtained (predicted Ys) and then averaged
for each X value. Known that if predictors are correlated, PDPs are not
informative. Ergo, partial out effects as in following possible options:
1) Obtain q1 and q3, in addition to avg value to verify stability of average.
2) In like manner as in linear regression, create models from selected
variables but with fully orthogonalized predictors (method proposed by
yours truly). Could be called Partialized PDP, or PPDP.
3) When obtaining 3d PDPs for pairs of variables, obtain Marginal
PDPs, i.e., avg probability of each var2 point along var 1 range.
Reason is that 3d plots typically extrapolate in low density areas ➔
misleading local curves are possible.
4) I(ndividual) C(onditional) E(xpectations) plot: PDP plots for
individual or groups of observations. Grouping defined by user:
quantiles of posterior prob, clustering solution, levels of even
variable, specific individuals, etc.
7. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 7
10/7/2019
Analytical problem to investigate.
Optical Health Care fraud insurance patients. Longer care typically involves higher
treatment costs and insurance company has to set up reserves immediately as soon as
a case is opened. Sometimes doctors involve in fraud.
Aim: predict fraudulent charges ➔ classification problem; use battery of models and
compare them. Below left, original data (M1 models. Focus is on comparisons across
models (see earlier chapters for individual models analytics). For brevity sake, omitted
mean and median ensembles.
Model Name Item Information
1
M1 TRN data set train
. TRN num obs 3595
1
VAL data set validata
1
. VAL num obs 2365
1
TST data set 1
. TST num obs 1
2
Dep. Var fraud
1
TRN % Events 20.389
1
VAL % Events 19.281
1
8. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 8
10/7/2019
E.g., 08_M1_VAL_BAGGING: 8th model of M1 data set case, Validation and using Bagging as the modeling
technique.
Requested Models: Names & Descriptions. Model
#
Full Model Name Model Description
***
Overall Models
-1
M1 20 pct prior
-10
01_M1_GB_TRN_TREES Tree Repr. for Gradient Boosting
1
02_M1_LG_TRN_TREES Tree Repr. of Logistic STEPWISE
2
03_M1_NSMBL_LG_TRN_TREES Tree Repr. of Logistic NONE
3
04_M1_TRN_BAGGING Bagging TRN Bagging TRN
4
05_M1_TRN_GRAD_BOOSTING Gradient Boosting
5
06_M1_TRN_LOGISTIC_NONE_NSMBL Logistic TRN NONE TRN ENSEMBLE
6
07_M1_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE TRN
7
08_M1_TRN_NSMBL_AVG Ensemble AVG
8
09_M1_TRN_NSMBL_MED Ensemble MED
9
10_M1_TRN_RFORESTS Random Forests
10
11_M1_TRN_TREES Trees TRN Trees TRN
11
12_M1_VAL_BAGGING Trees VAL Trees VAL
12
13_M1_VAL_GRAD_BOOSTING Gradient Boosting
13
14_M1_VAL_LOGISTIC_NONE_NSMBL Logistic VAL NONE VAL ENSEMBLE
14
15_M1_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE VAL
15
16_M1_VAL_NSMBL_AVG Ensemble AVG
16
17_M1_VAL_NSMBL_MED Ensemble MED
17
18_M1_VAL_RFORESTS Random Forests
18
19_M1_VAL_TREES Trees VAL Trees VAL
19
9. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 9
10/7/2019
For models other than Tree themselves, modeled posterior
probabilities via interval valued target variable (includes
logistic and ensembles).
For simplicity, just first 2 levels of trees are shown.
Notation: M1_GB_TRN_TREES: Data M1, Tree simulation of
Gradient boosting run (GB). BG: Bagging, RF: Random Forests, LG
logistic, NSMBL: ensemble.
Intention: obtain general idea of tree representation for
comparison to standard tree model. .
Next page: small detail for BG (Bagging), GB Gradient Boosting and Trees
themselves. Later, graphical comparison of vars + splits at each tree level.
14. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 14
10/7/2019
06 actual Tree. Top splitter No_claims, but LG splits at 1.5. Note different prob. events
(bar heights).
17. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 17
10/7/2019
RF pursues
Different structure
search for level 2.
See next Slide as
well.
27. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 27
10/7/2019
Conclusion on tree representations I
No_claims at 0.5 certainly top splitter for most TREE models but notice
that event probabilities diverge (because RF, GB and BG model
posterior probability, not a binary event, and thus carry information
from previous models). Later splits diverge in predictors and split
values across models. LG finds a completely different structure and
starts with no_claims at 1.5. Thus, for tree based models, existence of a
claim is a suspicion of fraud, while for logistic it requires higher
threshold.
Ensemble models: mixture of models ➔ typical interpretability from
single model is doubtful when reality is complex.
Important to view each tree model independently to gage
interpretability. Note that ensemble primer splitter is RF but RF is not
best model (it over-fits badly), but is chosen because all methods
minimize misclassification.
And it is important to view these recent findings in terms of variables
importance and “best” model choice.
28. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 28
10/7/2019
Conclusion on tree representations, II
Most importantly, it looks like RF wins, should we stop now?
(Validation results not shown to add to the suspense).
DO NOT RUSH YOUR CONCLUSIONS and keep on reading.
29. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 29
10/7/2019
Importance
Measures
For Tree based
Methods.
30. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 30
10/7/2019
Agreement on No_claims by all methods, not so much for other variables.
31. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 31
10/7/2019
For GB and BG all predictors matter, RF disparages num_members, Trees doctor_visits.
Comparing GB and RF, GB allocates more importance to all predictors (other than no_claims)
when compared to other methods, which implies that structure by RF is simpler.
37. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 37
10/7/2019
Most important var, similar
shapes in both cases. Note the
“logistic” like case of one,
and the jagged shape of the the
other, plus flatness for >= 5 at
0.8 prob..
38. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 38
10/7/2019
Num_members eliminated
From logistic stepwise. GB jagged
relationship ➔ there is strong interaction
effect with other predictors.
39. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 39
10/7/2019
Pair-wise PDP
For
Some variables
40. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 40
10/7/2019
Fraud is concentrated on lower membership time. TRN Stepwise Logistic (left), GB right,
correlation = 0.02846.
Similar but not
Identical.
41. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 41
10/7/2019
Fraud concentrated on smaller number of members and higher
Number of claims. GB.
44. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 44
O 10/7/2019
Model # 4 (RF) seems best in fitting Prob event once other predictors’ effects are
marginalized away for TRN but VAL results point to GB instead.
48. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 48
10/7/2019
Conclusions on PDPs
1) From Ensemble PDPs, it is obvious that RF fails in
validation. All the ensemble power rests on GB strongly
and on logistic with downward slope.
2) Individual Variable PDP shows uniform relationship for
variables in logistic, while GB shows fuzzy and nonlinear
structures.
3) The contour plots for pairs of variables (GB) allows to
focus on ranges of importance. For instance, No_claims
and Member_duration concentrate important information
at low levels of their respective ranges.
4) Still, it is not possible to obtain (at present) simple
interpretable graphs to understand full complexity of GB
models. Logistic are easier to understand, not fully easy.
50. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 50
10/7/2019
Tree based methods do not necessarily reach top probability of 1 and lowest of 0.
56. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 56
10/7/2019
Note that TRN and VAL rank do not match. Lower VAL ranked Models tend to overfit more.
GOF ranks
GOF measure
rank
AUR
OC
Avg
Squa
re
Error
Class
Rate
Cum
Lift
3rd
bin
Cum
Resp
Rate
3rd Gini
P - R
AUC
Preci
sion
Rate
Rsqu
are
Cram
er
Tjur
Rank Rank Rank Rank Rank Rank Rank Rank Rank
Unw.
Mean
Unw.
Median
Model Name
6 8 8 6 6 6 8 4 8 6.67 6
02_M1_TRN_BAGGING
03_M1_TRN_GRAD_BOOSTING 3 3 7 5 5 3 3 2 6 4.11 3
04_M1_TRN_LOGISTIC_NONE_NSM
BL 1 1 4 2 2 1 1 5 1 2.00 1
05_M1_TRN_LOGISTIC_STEPWISE 8 7 3 8 8 8 6 8 7 7.00 8
06_M1_TRN_NSMBL_AVG 4 5 1 3 3 4 5 7 4 4.00 4
07_M1_TRN_NSMBL_MED 5 4 2 4 4 5 4 6 5 4.33 4
08_M1_TRN_RFORESTS 2 2 5 1 1 2 2 1 3 2.11 2
09_M1_TRN_TREES 7 6 6 7 7 7 7 3 2 5.78 7
GOF ranks
GOF measure
rank
AUR
OC
Avg
Squa
re
Error
Class
Rate
Cum
Lift
3rd
bin
Cum
Resp
Rate
3rd Gini
P - R
AUC
Preci
sion
Rate
Rsqu
are
Cram
er
Tjur
Ran
k Rank Rank Rank Rank Rank Rank Rank Rank
Unw.
Mean
Unw.
Median
Model Name
5 6 7 5 5 5 6 3 7 5.44 5
10_M1_VAL_BAGGING
11_M1_VAL_GRAD_BOOSTING 2 2 5 1 1 2 2 1 2 2.00 2
12_M1_VAL_LOGISTIC_NONE_NSMB
L
1 1 4 2 2 1 1 4 1 1.89 1
13_M1_VAL_LOGISTIC_STEPWISE 6 5 3 6 6 6 5 7 4 5.33 6
14_M1_VAL_NSMBL_AVG 3 3 1 3 3 3 3 6 6 3.44 3
15_M1_VAL_NSMBL_MED 4 4 2 4 4 4 4 5 5 4.00 4
16_M1_VAL_RFORESTS 8 8 8 8 8 8 8 8 8 8.00 8
17_M1_VAL_TREES 7 7 6 7 7 7 7 2 3 5.89 7
57. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 57
10/7/2019
Based on this methodology, winner, and GB single best model. Alternative selection methods for
best models are users’ dependent., below is just one approach.
69. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 69
10/7/2019
Conclusions
At least for the present defaults of RF in this presentation, it
has badly over-fitted. The best overall model is the
ensemble and the best single model is given by Gradient
Boosting.
The user should decide which metric to use for judging
goodness. In here, simple unweighted ranking of 5
measures was used.
Since there was no financial information, models could not
be measured in terms of profits. K-S chart (not
recommended) shows different cut-off points per model.