2. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 2
7/1/2018
Studies
2.8.c: Comparison of methods but focusing on whether
raw vs 50/50 re-sampling makes a difference.
4. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 4
7/1/2018
Aim: study performance of Fraud models with original 20%
fraud events by altering percentage of events.
3 studies:
M1 5% events
M2 20% events, original
M3 50% Events.
Validation data set is random sample from original 20% data
set for all three studies.
Battery of models as in previous study, similar graphs for
evaluation.
5. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 5
7/1/2018
Model Name Item Information
M1 TRN data set train05
. TRN num obs 954
VAL data set validata
. VAL num obs 2365
TST data set
. TST num obs
Dep. Var fraud
TRN % Events 5.346
VAL % Events 19.281
M2 TRN data set train
. TRN num obs 3595
VAL data set validata
. VAL num obs 2365
TST data set
. TST num obs
Dep. Var fraud
TRN % Events 20.389
VAL % Events 19.281
M3 TRN data set train50
. TRN num obs 1133
VAL data set validata
. VAL num obs 2365
TST data set
M3 . TST num obs 1
2
Dep. Var fraud
1
TRN % Events 50.838
1
VAL % Events 19.281
1
6. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 6
7/1/2018
Requested Models: Names & Descriptions. Model #
Full Model Name Model Description
***
Overall Models
-1
M1 Raw 05pct
-10
M2 Raw 20pct
-10
M3 50pct
-10
01_M1_NSMBL_TRN_LOGISTIC_NONE Logistic TRN NONE Ensemble
1
02_M1_NSMBL_VAL_LOGISTIC_NONE Logistic VAL NONE Ensemble
2
03_M1_TRN_BAGGING Bagging TRN Bagging
3
04_M1_TRN_GRAD_BOOSTING Gradient Boosting
4
05_M1_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE
5
06_M1_TRN_RFORESTS Random Forests
6
07_M1_TRN_TREES Trees TRN Trees
7
08_M1_VAL_BAGGING Trees VAL Trees
8
09_M1_VAL_GRAD_BOOSTING Gradient Boosting
9
10_M1_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE
10
11_M1_VAL_RFORESTS Random Forests
11
12_M1_VAL_TREES Trees VAL Trees
12
13_M2_NSMBL_TRN_LOGISTIC_NONE Logistic TRN NONE Ensemble
13
14_M2_NSMBL_VAL_LOGISTIC_NONE Logistic VAL NONE Ensemble
14
15_M2_TRN_BAGGING Bagging TRN Bagging
15
16_M2_TRN_GRAD_BOOSTING Gradient Boosting
16
17_M2_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE
17
And similarly for rest of M2 and all of M3.
8. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 8
7/1/2018
Top split level, nodes 2 and 3. Note: 3 M1 (02, 03, 05) models split on member_duration,
but corresponding M2 and M3 on no_claims. Lonely M1 GB. Previously, just No_claims.
9. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 9
7/1/2018
Same info, different categorization. Omitted next levels.
10. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 10
7/1/2018
Omitted rest for brevity. Some conclusions:
Extreme imbalance has caused different initial split variable and
therefore different model structure as opposed to more balanced data
sets.
In the more balanced cases, even the splitting value has mostly not
changed. The probability of event in resulting nodes is different due to
different initial event rates.
The difference in splitting variables is not necessarily “BAD”. Note that
the sample size for more imbalanced data sets is smaller.
12. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 12
7/1/2018
M1 models choose different important variable. 50/50 trees (M3_TREES selected just
no_claims, while M2_TREES selected 3 additional predictors. BG, RF and GB are not
similarly affected. M1 trees have no important variables, most affected by imbalance.
14. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 14
7/1/2018
M2 (50/50) stops earlier, but val miscl higher than for raw (M1).
15. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 15
7/1/2018
Similar results (see previous slide)
22. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 22
7/1/2018
TRN GB has # 3 No_claims as most important, others flat.
23. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 23
7/1/2018
VAL GB repeats No_claims as most important, adds doctor_visits and optom_presc
As positive effects. Corresponding logistic does not point to doctor_visits.
Corresponding M1 and M3 almost identical.
25. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 25
7/1/2018
# 7 VAL GB by far most important.
26. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 26
7/1/2018
Confluence of curves at point of event prior. Forests perform
Very well at TRN but not at VAL.
27. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 27
7/1/2018
#28 VAL log works to ‘bring’ down positive GB VAL slope.
33. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 33
7/1/2018
M2 VAL ensemble best, and M1 VAL GB best of single models.
M3 VAL performance of single models is lackluster.
35. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 35
7/1/2018
Conclusion on re-sampling.
In this example, 50/50 M3 resampled models yielded a
smaller Tree with no discernible difference in
performance to its M2 counterpart. M1 trees failed to
perform and the other M1 methods performed acceptably
well.
Actual performance (for best models) was not affected by
50/50 or raw modeling. Extreme imbalance seriously
affected raw trees, but not other variants.
The overall winner in all cases was GB, when evaluated at
VAL. Models suffer when event prior is seriously
imbalanced except for GB.
37. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 37
7/1/2018
XGBoost
Developed by Chen and Guestrin (2016) XGBoost: A Scalable Tree
Boosting System.
Claims: Faster and better than neural networks and Random Forests.
Uses 2nd order gradients of loss functions based on Taylor expansions of loss functions,
plugged into same algorithm for greater generalization. In addition, transforms loss function
into more sophisticated objective function containing regularization terms, that penalizes tree
growth, with penalty proportional to the size of the node weights thus preventing overfitting.
More efficient than GB due to parallel computing on single computer (10 times faster).
Algorithm takes advantage of advanced decomposition of objective function that allows for
outperforming GB.
Not yet SAS available. Available in R, Julia, Python, CLI.
Tool used in many champion models in recent competitions (Kaggle, etc.).
See also Foster’s (2017) XGboostExplainer.
39. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 39
7/1/2018
Comments
1) Not immediately apparent what weak classifier is for GB (e.g., by
varying depth in our case). Likewise, number of iterations is big
issue. In our simple example in first study, M6 GB was best performer.
Still, overall modeling benefited from ensembling all methods as
measured by either AUROC or Cum Lift or ensemble p-values.
2) The posterior probability ranges are vastly different and thus the
tendency to classify observations by the .5 threshold is too simplistic.
3) The PDPs show that different methods find distinct multivariate
structures. Interestingly, the ensemble p-values show a decreasing
tendency by logistic and trees and a strong S shaped tendency
by M6 GB (first study), which could mean that M6 GB alone tends
to overshoot its predictions.
4) GB relatively unaffected by 50/50 mixture.
40. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 40
7/1/2018
Comments
5) While on classification GB problems, predictions are within [0, 1], for
continuous target problems, predictions can be beyond the range of the
target variable headaches.
This is due to the fact that GB models residual at each iteration, not the
original target; this can lead to surprises, such as negative predictions
when Y takes only non-negative residual values, contrary to the original
Tree algorithm.
6) Shrinkage parameter and early stopping (# trees) act as regularizers
but combined effect not known and could be ineffective.
7) If shrinkage too small, and allow large T, model is large, expensive
to compute, implement and understand.
8) Random Forests over-fitted. A larger study should incorporate changes
in its parameters for better validation.
41. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 41
7/1/2018
Comments
9) Model interpretation is difficult in the case of BG, RF and BG (and not
trivial for the other methods either). PDPs for logistic regression variables
show monotonic relationships, while those of GB variables are very
nonlinear. PDPs for other methods were not created.
42. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 42
7/1/2018
Drawbacks of GB.
1) IT IS NOT MAGIC, it won’t solve ALL modeling needs,
but best off-the-shelf tool. Still need to look for
transformations, odd issues, missing values, etc.
2) As all tree methods, categorical variables with many levels can
make it impossible to obtain model. E.g., zip codes.
3) Memory requirements can be very large, especially with large
iterations, typical problem of ensemble methods.
4) Large number of iterations slow speed to obtain predictions
on-line scoring may require trade-off between complexity and time
available. Once GB is learned, parallelization certainly helps.
5) No simple algorithm to capture interactions because of base-
learners.
6) No simple rules to determine gamma, # of iterations or depth of
simple learner. Need to try different combinations and possibly
recalibrate in time.
7) Still, one of the most powerful methods available.
43. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 43
7/1/2018
Un-reviewed
Catboost
DeepForest
gcForest
Use of tree methods for continuous target variable.
Naïve-Bayes
Bootstrapping.
…
44. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 44
7/1/2018
2.11) References
Auslender L. (1998): Alacart, poor man’s classification trees, NESUG.
Breiman L., Friedman J., Olshen R., Stone J. (1984): Classification and Regression Trees, Wadsworth.
Chen and Guestrin (2016): XGBoost: A Scalable Tree Boosting System.
Chipman H., George E., McCulloch R.: BART, Bayesian additive regression Trees, The Annals of
Statistics.
Foster D. (2017): New R package that makes Xgboost Interpretable, https://medium.com/applied-data-
science/new-r-package-the-xgboost-explainer-51dd7d1aa211
Friedman, J. (2001).Greedy boosting approximation: a gradient boosting machine. Ann.Stat. 29, 1189–
1232.doi:10.1214/aos/1013203451
Paluszynska A. (2017): Structural mining and knowledge extraction from random forest with
applications to The Cancer Genome Atlas project
(https://www.google.com/url?q=https%3A%2F%2Frawgit.com%2FgeneticsMiNIng%2FBlackBoxOpener
%2Fmaster%2FrandomForestExplainer_Master_thesis.pdf&sa=D&sntz=1&usg=AFQjCNHTJONZK24L
ioDeOB0KZnwLkn98fw and https://mi2datalab.github.io/randomForestExplainer/)
Quinlan J. Ross (1993): C4.5: programs for machine learning, Morgan Kaufmann Publshers.
45. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 45
7/1/2018
Earlier literature on combining methods:
Winkler, RL. and Makridakis, S. (1983). The combination of
forecasts. J. R. Statis. Soc. A. 146(2), 150-157.
Makridakis, S. and Winkler, R.L. (1983). Averages of
Forecasts: Some Empirical Results,. Management Science,
29(9) 987-996.
Bates, J.M. and Granger, C.W. (1969). The combination of
forecasts. Or, 451-468.
47. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 47
7/1/2018
1) Can you explain in nontechnical language the idea of
maximum likelihood estimation?, of SVM (unreviewed in class)?
2) Contrast GB with RF.
3) In what way is over-fitting like a glove? Like an umbrella?
4) Would ensemble models always improve on individual models?
5) Would you select variables by way of tree methods to use in linear
methods later on? Yes? No? why?
6) In Tree regression, final predictions are means. Could better
predictions be obtained by regression model instead? A logistic for a
binary target? Discuss.
7) There are 9 coins, 8 of which are of equal weight, and there’s one
scale. How many steps until you identify the odd coin?
8) Why are manhole covers round?
9) You obtain 100% accuracy in validation of classification model. Are
you a genius? Yes, no, why?
10)If 85% of witnesses saw blue car during accident, and 15% saw red
car, what is probability (car is blue)?
48. Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 48
7/1/2018
Counter-interview questions (you ask the interviewer).
1) How do you measure the height of a building with just a
barometer? Give three answers at least.
2) Two players A and B take turns saying a positive integer
number from 1 to 9. The numbers are added until
whoever reaches 100 or above, loses. Is there a strategy
to never lose? (aborting a game midway is acceptable, but
give reasoning).
3) There are two jugs, one that holds 5 gallons, the other one
3, and a nearby water fountain. How do you put exactly (less
than one ounce deviation is fine) 4 ounces in the 5 gallon
jug?