SlideShare a Scribd company logo
1 of 49
Download to read offline
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 1
7/1/2018
Ensemble models and
Gradient Boosting, part 3.
Leonardo Auslender
Independent Statistical Consultant
Leonardo ‘dot’ Auslender ‘at’
Gmail ‘dot’ com.
Copyright 2018.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 2
7/1/2018
Studies
2.8.c: Comparison of methods but focusing on whether
raw vs 50/50 re-sampling makes a difference.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 3
7/1/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 4
7/1/2018
Aim: study performance of Fraud models with original 20%
fraud events by altering percentage of events.
3 studies:
M1 5% events
M2 20% events, original
M3 50% Events.
Validation data set is random sample from original 20% data
set for all three studies.
Battery of models as in previous study, similar graphs for
evaluation.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 5
7/1/2018
Model Name Item Information
M1 TRN data set train05
. TRN num obs 954
VAL data set validata
. VAL num obs 2365
TST data set
. TST num obs
Dep. Var fraud
TRN % Events 5.346
VAL % Events 19.281
M2 TRN data set train
. TRN num obs 3595
VAL data set validata
. VAL num obs 2365
TST data set
. TST num obs
Dep. Var fraud
TRN % Events 20.389
VAL % Events 19.281
M3 TRN data set train50
. TRN num obs 1133
VAL data set validata
. VAL num obs 2365
TST data set
M3 . TST num obs 1
2
Dep. Var fraud
1
TRN % Events 50.838
1
VAL % Events 19.281
1
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 6
7/1/2018
Requested Models: Names & Descriptions. Model #
Full Model Name Model Description
***
Overall Models
-1
M1 Raw 05pct
-10
M2 Raw 20pct
-10
M3 50pct
-10
01_M1_NSMBL_TRN_LOGISTIC_NONE Logistic TRN NONE Ensemble
1
02_M1_NSMBL_VAL_LOGISTIC_NONE Logistic VAL NONE Ensemble
2
03_M1_TRN_BAGGING Bagging TRN Bagging
3
04_M1_TRN_GRAD_BOOSTING Gradient Boosting
4
05_M1_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE
5
06_M1_TRN_RFORESTS Random Forests
6
07_M1_TRN_TREES Trees TRN Trees
7
08_M1_VAL_BAGGING Trees VAL Trees
8
09_M1_VAL_GRAD_BOOSTING Gradient Boosting
9
10_M1_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE
10
11_M1_VAL_RFORESTS Random Forests
11
12_M1_VAL_TREES Trees VAL Trees
12
13_M2_NSMBL_TRN_LOGISTIC_NONE Logistic TRN NONE Ensemble
13
14_M2_NSMBL_VAL_LOGISTIC_NONE Logistic VAL NONE Ensemble
14
15_M2_TRN_BAGGING Bagging TRN Bagging
15
16_M2_TRN_GRAD_BOOSTING Gradient Boosting
16
17_M2_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE
17
And similarly for rest of M2 and all of M3.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 7
7/1/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 8
7/1/2018
Top split level, nodes 2 and 3. Note: 3 M1 (02, 03, 05) models split on member_duration,
but corresponding M2 and M3 on no_claims. Lonely M1 GB. Previously, just No_claims.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 9
7/1/2018
Same info, different categorization. Omitted next levels.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 10
7/1/2018
Omitted rest for brevity. Some conclusions:
Extreme imbalance has caused different initial split variable and
therefore different model structure as opposed to more balanced data
sets.
In the more balanced cases, even the splitting value has mostly not
changed. The probability of event in resulting nodes is different due to
different initial event rates.
The difference in splitting variables is not necessarily “BAD”. Note that
the sample size for more imbalanced data sets is smaller.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 11
7/1/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 12
7/1/2018
M1 models choose different important variable. 50/50 trees (M3_TREES selected just
no_claims, while M2_TREES selected 3 additional predictors. BG, RF and GB are not
similarly affected. M1 trees have no important variables, most affected by imbalance.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 13
7/1/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 14
7/1/2018
M2 (50/50) stops earlier, but val miscl higher than for raw (M1).
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 15
7/1/2018
Similar results (see previous slide)
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 16
7/1/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 17
7/1/2018
Logistic shows monotonically increasing relationship, while GB more jagged and
increasing. Just one variable shown. M1 unbalanced case seriously affected in
comparison. No_vclaims obviously positively affects Prob fraud.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 18
7/1/2018
M1 logistic suffers due to event imbalance.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 19
7/1/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 20
7/1/2018
No_claims and Optom_presc positively associated with prob Fraud (Training).
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 21
7/1/2018
No_claims and Optom_presc positively associated with prob
Fraud (Validation).
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 22
7/1/2018
TRN GB has # 3 No_claims as most important, others flat.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 23
7/1/2018
VAL GB repeats No_claims as most important, adds doctor_visits and optom_presc
As positive effects. Corresponding logistic does not point to doctor_visits.
Corresponding M1 and M3 almost identical.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 24
7/1/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 25
7/1/2018
# 7 VAL GB by far most important.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 26
7/1/2018
Confluence of curves at point of event prior. Forests perform
Very well at TRN but not at VAL.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 27
7/1/2018
#28 VAL log works to ‘bring’ down positive GB VAL slope.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 28
7/1/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 29
7/1/2018
Class imbalance shifts curve down and with flatter slope.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 30
7/1/2018
Val results for RF in ensemble models are flat.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 31
7/1/2018
Goodness of fit
And model
Selection.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 32
7/1/2018
GOF ranks VAL
GOF measure
rank
AUROC
Avg
Square
Error
Cum Lift
3rd bin
Cum
Resp
Rate
3rd Gini
Rsquare
Cramer
Tjur
Rank Rank Rank Rank Rank Rank
Unw.
Mean
Unw.
Median
Model Name
6 3 6 6 6 3 5.00 6.00
02_M1_NSMBL_VAL_LOGISTIC_
NONE
08_M1_VAL_BAGGING
14 13 15 15 14 14 14.17 14.00
09_M1_VAL_GRAD_BOOSTING
5 4 1 1 5 4 3.33 4.00
10_M1_VAL_LOGISTIC_STEPWI
SE 16 11 16 16 16 17 15.33 16.00
11_M1_VAL_RFORESTS
13 16 14 14 13 15 14.17 14.00
12_M1_VAL_TREES
18 18 18 18 18 18 18.00 18.00
14_M2_NSMBL_VAL_LOGISTIC_
NONE 1 1 2 2 1 1 1.33 1.00
20_M2_VAL_BAGGING
9 7 9 9 9 12 9.17 9.00
21_M2_VAL_GRAD_BOOSTING
3 5 3 3 3 5 3.67 3.00
22_M2_VAL_LOGISTIC_STEPWI
SE 10 8 11 11 10 10 10.00 10.00
23_M2_VAL_RFORESTS
17 12 17 17 17 16 16.00 17.00
24_M2_VAL_TREES
12 10 10 10 12 8 10.33 10.00
26_M3_NSMBL_VAL_LOGISTIC_
NONE 2 2 4 4 2 2 2.67 2.00
32_M3_VAL_BAGGING
8 14 8 8 8 7 8.83 8.00
33_M3_VAL_GRAD_BOOSTING
4 6 5 5 4 6 5.00 5.00
34_M3_VAL_LOGISTIC_STEPWI
SE 11 9 12 12 11 11 11.00 11.00
35_M3_VAL_RFORESTS
7 15 7 7 7 9 8.67 7.00
Model Name
36_M3_VAL_TREES
15 17 13 13 15 13 14.33
14.0
0
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 33
7/1/2018
M2 VAL ensemble best, and M1 VAL GB best of single models.
M3 VAL performance of single models is lackluster.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 34
7/1/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 35
7/1/2018
Conclusion on re-sampling.
In this example, 50/50 M3 resampled models yielded a
smaller Tree with no discernible difference in
performance to its M2 counterpart. M1 trees failed to
perform and the other M1 methods performed acceptably
well.
Actual performance (for best models) was not affected by
50/50 or raw modeling. Extreme imbalance seriously
affected raw trees, but not other variants.
The overall winner in all cases was GB, when evaluated at
VAL. Models suffer when event prior is seriously
imbalanced except for GB.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 36
7/1/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 37
7/1/2018
XGBoost
Developed by Chen and Guestrin (2016) XGBoost: A Scalable Tree
Boosting System.
Claims: Faster and better than neural networks and Random Forests.
Uses 2nd order gradients of loss functions based on Taylor expansions of loss functions,
plugged into same algorithm for greater generalization. In addition, transforms loss function
into more sophisticated objective function containing regularization terms, that penalizes tree
growth, with penalty proportional to the size of the node weights thus preventing overfitting.
More efficient than GB due to parallel computing on single computer (10 times faster).
Algorithm takes advantage of advanced decomposition of objective function that allows for
outperforming GB.
Not yet SAS available. Available in R, Julia, Python, CLI.
Tool used in many champion models in recent competitions (Kaggle, etc.).
See also Foster’s (2017) XGboostExplainer.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 38
7/1/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 39
7/1/2018
Comments
1) Not immediately apparent what weak classifier is for GB (e.g., by
varying depth in our case). Likewise, number of iterations is big
issue. In our simple example in first study, M6 GB was best performer.
Still, overall modeling benefited from ensembling all methods as
measured by either AUROC or Cum Lift or ensemble p-values.
2) The posterior probability ranges are vastly different and thus the
tendency to classify observations by the .5 threshold is too simplistic.
3) The PDPs show that different methods find distinct multivariate
structures. Interestingly, the ensemble p-values show a decreasing
tendency by logistic and trees and a strong S shaped tendency
by M6 GB (first study), which could mean that M6 GB alone tends
to overshoot its predictions.
4) GB relatively unaffected by 50/50 mixture.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 40
7/1/2018
Comments
5) While on classification GB problems, predictions are within [0, 1], for
continuous target problems, predictions can be beyond the range of the
target variable  headaches.
This is due to the fact that GB models residual at each iteration, not the
original target; this can lead to surprises, such as negative predictions
when Y takes only non-negative residual values, contrary to the original
Tree algorithm.
6) Shrinkage parameter and early stopping (# trees) act as regularizers
but combined effect not known and could be ineffective.
7) If shrinkage too small, and allow large T, model is large, expensive
to compute, implement and understand.
8) Random Forests over-fitted. A larger study should incorporate changes
in its parameters for better validation.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 41
7/1/2018
Comments
9) Model interpretation is difficult in the case of BG, RF and BG (and not
trivial for the other methods either). PDPs for logistic regression variables
show monotonic relationships, while those of GB variables are very
nonlinear. PDPs for other methods were not created.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 42
7/1/2018
Drawbacks of GB.
1) IT IS NOT MAGIC, it won’t solve ALL modeling needs,
but best off-the-shelf tool. Still need to look for
transformations, odd issues, missing values, etc.
2) As all tree methods, categorical variables with many levels can
make it impossible to obtain model. E.g., zip codes.
3) Memory requirements can be very large, especially with large
iterations, typical problem of ensemble methods.
4) Large number of iterations  slow speed to obtain predictions 
on-line scoring may require trade-off between complexity and time
available. Once GB is learned, parallelization certainly helps.
5) No simple algorithm to capture interactions because of base-
learners.
6) No simple rules to determine gamma, # of iterations or depth of
simple learner. Need to try different combinations and possibly
recalibrate in time.
7) Still, one of the most powerful methods available.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 43
7/1/2018
Un-reviewed
Catboost
DeepForest
gcForest
Use of tree methods for continuous target variable.
Naïve-Bayes
Bootstrapping.
…
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 44
7/1/2018
2.11) References
Auslender L. (1998): Alacart, poor man’s classification trees, NESUG.
Breiman L., Friedman J., Olshen R., Stone J. (1984): Classification and Regression Trees, Wadsworth.
Chen and Guestrin (2016): XGBoost: A Scalable Tree Boosting System.
Chipman H., George E., McCulloch R.: BART, Bayesian additive regression Trees, The Annals of
Statistics.
Foster D. (2017): New R package that makes Xgboost Interpretable, https://medium.com/applied-data-
science/new-r-package-the-xgboost-explainer-51dd7d1aa211
Friedman, J. (2001).Greedy boosting approximation: a gradient boosting machine. Ann.Stat. 29, 1189–
1232.doi:10.1214/aos/1013203451
Paluszynska A. (2017): Structural mining and knowledge extraction from random forest with
applications to The Cancer Genome Atlas project
(https://www.google.com/url?q=https%3A%2F%2Frawgit.com%2FgeneticsMiNIng%2FBlackBoxOpener
%2Fmaster%2FrandomForestExplainer_Master_thesis.pdf&sa=D&sntz=1&usg=AFQjCNHTJONZK24L
ioDeOB0KZnwLkn98fw and https://mi2datalab.github.io/randomForestExplainer/)
Quinlan J. Ross (1993): C4.5: programs for machine learning, Morgan Kaufmann Publshers.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 45
7/1/2018
Earlier literature on combining methods:
Winkler, RL. and Makridakis, S. (1983). The combination of
forecasts. J. R. Statis. Soc. A. 146(2), 150-157.
Makridakis, S. and Winkler, R.L. (1983). Averages of
Forecasts: Some Empirical Results,. Management Science,
29(9) 987-996.
Bates, J.M. and Granger, C.W. (1969). The combination of
forecasts. Or, 451-468.
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 46
7/1/2018
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 47
7/1/2018
1) Can you explain in nontechnical language the idea of
maximum likelihood estimation?, of SVM (unreviewed in class)?
2) Contrast GB with RF.
3) In what way is over-fitting like a glove? Like an umbrella?
4) Would ensemble models always improve on individual models?
5) Would you select variables by way of tree methods to use in linear
methods later on? Yes? No? why?
6) In Tree regression, final predictions are means. Could better
predictions be obtained by regression model instead? A logistic for a
binary target? Discuss.
7) There are 9 coins, 8 of which are of equal weight, and there’s one
scale. How many steps until you identify the odd coin?
8) Why are manhole covers round?
9) You obtain 100% accuracy in validation of classification model. Are
you a genius? Yes, no, why?
10)If 85% of witnesses saw blue car during accident, and 15% saw red
car, what is probability (car is blue)?
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 48
7/1/2018
Counter-interview questions (you ask the interviewer).
1) How do you measure the height of a building with just a
barometer? Give three answers at least.
2) Two players A and B take turns saying a positive integer
number from 1 to 9. The numbers are added until
whoever reaches 100 or above, loses. Is there a strategy
to never lose? (aborting a game midway is acceptable, but
give reasoning).
3) There are two jugs, one that holds 5 gallons, the other one
3, and a nearby water fountain. How do you put exactly (less
than one ounce deviation is fine) 4 ounces in the 5 gallon
jug?
Leonardo Auslender Copyright 2004
Leonardo Auslender – Copyright 2018 Ch. 5-49
7/1/2018
for now

More Related Content

Similar to 4_2_Ensemble models and grad boost part 3.pdf

Predicting the Present with Google Trends
Predicting the Present with Google TrendsPredicting the Present with Google Trends
Predicting the Present with Google Trends
Robert Ressmann
 
Econometrics beat dave giles' blog ardl modelling in e_views 9
Econometrics beat  dave giles' blog  ardl modelling in e_views 9Econometrics beat  dave giles' blog  ardl modelling in e_views 9
Econometrics beat dave giles' blog ardl modelling in e_views 9
b1mit
 
Google Predicting The Present
Google Predicting The PresentGoogle Predicting The Present
Google Predicting The Present
Sinan Ata
 
MEMS and Sensors for Automotive 2017 Report by Yole Developpement
MEMS and Sensors for Automotive 2017 Report by Yole Developpement	MEMS and Sensors for Automotive 2017 Report by Yole Developpement
MEMS and Sensors for Automotive 2017 Report by Yole Developpement
Yole Developpement
 
Quantitative Analysis for ManagementThirteenth EditionLesson.docx
Quantitative Analysis for ManagementThirteenth EditionLesson.docxQuantitative Analysis for ManagementThirteenth EditionLesson.docx
Quantitative Analysis for ManagementThirteenth EditionLesson.docx
hildredzr1di
 

Similar to 4_2_Ensemble models and grad boost part 3.pdf (6)

Predicting the Present with Google Trends
Predicting the Present with Google TrendsPredicting the Present with Google Trends
Predicting the Present with Google Trends
 
Econometrics beat dave giles' blog ardl modelling in e_views 9
Econometrics beat  dave giles' blog  ardl modelling in e_views 9Econometrics beat  dave giles' blog  ardl modelling in e_views 9
Econometrics beat dave giles' blog ardl modelling in e_views 9
 
Google Predicting The Present
Google Predicting The PresentGoogle Predicting The Present
Google Predicting The Present
 
MEMS and Sensors for Automotive 2017 Report by Yole Developpement
MEMS and Sensors for Automotive 2017 Report by Yole Developpement	MEMS and Sensors for Automotive 2017 Report by Yole Developpement
MEMS and Sensors for Automotive 2017 Report by Yole Developpement
 
GRLtmoldguage
GRLtmoldguageGRLtmoldguage
GRLtmoldguage
 
Quantitative Analysis for ManagementThirteenth EditionLesson.docx
Quantitative Analysis for ManagementThirteenth EditionLesson.docxQuantitative Analysis for ManagementThirteenth EditionLesson.docx
Quantitative Analysis for ManagementThirteenth EditionLesson.docx
 

More from Leonardo Auslender

4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
Leonardo Auslender
 

More from Leonardo Auslender (20)

1 UMI.pdf
1 UMI.pdf1 UMI.pdf
1 UMI.pdf
 
Ensembles.pdf
Ensembles.pdfEnsembles.pdf
Ensembles.pdf
 
Suppression Enhancement.pdf
Suppression Enhancement.pdfSuppression Enhancement.pdf
Suppression Enhancement.pdf
 
4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf
 
4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf
 
4_1_Tree World.pdf
4_1_Tree World.pdf4_1_Tree World.pdf
4_1_Tree World.pdf
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdf
 
Linear Regression.pdf
Linear Regression.pdfLinear Regression.pdf
Linear Regression.pdf
 
4 MEDA.pdf
4 MEDA.pdf4 MEDA.pdf
4 MEDA.pdf
 
2 UEDA.pdf
2 UEDA.pdf2 UEDA.pdf
2 UEDA.pdf
 
3 BEDA.pdf
3 BEDA.pdf3 BEDA.pdf
3 BEDA.pdf
 
1 EDA.pdf
1 EDA.pdf1 EDA.pdf
1 EDA.pdf
 
0 Statistics Intro.pdf
0 Statistics Intro.pdf0 Statistics Intro.pdf
0 Statistics Intro.pdf
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf
 
4 2 ensemble models and grad boost part 1 2019-10-07
4 2 ensemble models and grad boost part 1 2019-10-074 2 ensemble models and grad boost part 1 2019-10-07
4 2 ensemble models and grad boost part 1 2019-10-07
 
4 meda
4 meda4 meda
4 meda
 
3 beda
3 beda3 beda
3 beda
 
2 ueda
2 ueda2 ueda
2 ueda
 

Recently uploaded

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
yulianti213969
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
siskavia95
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptxClient Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Stephen266013
 

Recently uploaded (20)

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
 
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptxClient Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
 

4_2_Ensemble models and grad boost part 3.pdf

  • 1. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 1 7/1/2018 Ensemble models and Gradient Boosting, part 3. Leonardo Auslender Independent Statistical Consultant Leonardo ‘dot’ Auslender ‘at’ Gmail ‘dot’ com. Copyright 2018.
  • 2. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 2 7/1/2018 Studies 2.8.c: Comparison of methods but focusing on whether raw vs 50/50 re-sampling makes a difference.
  • 3. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 3 7/1/2018
  • 4. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 4 7/1/2018 Aim: study performance of Fraud models with original 20% fraud events by altering percentage of events. 3 studies: M1 5% events M2 20% events, original M3 50% Events. Validation data set is random sample from original 20% data set for all three studies. Battery of models as in previous study, similar graphs for evaluation.
  • 5. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 5 7/1/2018 Model Name Item Information M1 TRN data set train05 . TRN num obs 954 VAL data set validata . VAL num obs 2365 TST data set . TST num obs Dep. Var fraud TRN % Events 5.346 VAL % Events 19.281 M2 TRN data set train . TRN num obs 3595 VAL data set validata . VAL num obs 2365 TST data set . TST num obs Dep. Var fraud TRN % Events 20.389 VAL % Events 19.281 M3 TRN data set train50 . TRN num obs 1133 VAL data set validata . VAL num obs 2365 TST data set M3 . TST num obs 1 2 Dep. Var fraud 1 TRN % Events 50.838 1 VAL % Events 19.281 1
  • 6. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 6 7/1/2018 Requested Models: Names & Descriptions. Model # Full Model Name Model Description *** Overall Models -1 M1 Raw 05pct -10 M2 Raw 20pct -10 M3 50pct -10 01_M1_NSMBL_TRN_LOGISTIC_NONE Logistic TRN NONE Ensemble 1 02_M1_NSMBL_VAL_LOGISTIC_NONE Logistic VAL NONE Ensemble 2 03_M1_TRN_BAGGING Bagging TRN Bagging 3 04_M1_TRN_GRAD_BOOSTING Gradient Boosting 4 05_M1_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE 5 06_M1_TRN_RFORESTS Random Forests 6 07_M1_TRN_TREES Trees TRN Trees 7 08_M1_VAL_BAGGING Trees VAL Trees 8 09_M1_VAL_GRAD_BOOSTING Gradient Boosting 9 10_M1_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE 10 11_M1_VAL_RFORESTS Random Forests 11 12_M1_VAL_TREES Trees VAL Trees 12 13_M2_NSMBL_TRN_LOGISTIC_NONE Logistic TRN NONE Ensemble 13 14_M2_NSMBL_VAL_LOGISTIC_NONE Logistic VAL NONE Ensemble 14 15_M2_TRN_BAGGING Bagging TRN Bagging 15 16_M2_TRN_GRAD_BOOSTING Gradient Boosting 16 17_M2_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE 17 And similarly for rest of M2 and all of M3.
  • 7. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 7 7/1/2018
  • 8. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 8 7/1/2018 Top split level, nodes 2 and 3. Note: 3 M1 (02, 03, 05) models split on member_duration, but corresponding M2 and M3 on no_claims. Lonely M1 GB. Previously, just No_claims.
  • 9. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 9 7/1/2018 Same info, different categorization. Omitted next levels.
  • 10. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 10 7/1/2018 Omitted rest for brevity. Some conclusions: Extreme imbalance has caused different initial split variable and therefore different model structure as opposed to more balanced data sets. In the more balanced cases, even the splitting value has mostly not changed. The probability of event in resulting nodes is different due to different initial event rates. The difference in splitting variables is not necessarily “BAD”. Note that the sample size for more imbalanced data sets is smaller.
  • 11. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 11 7/1/2018
  • 12. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 12 7/1/2018 M1 models choose different important variable. 50/50 trees (M3_TREES selected just no_claims, while M2_TREES selected 3 additional predictors. BG, RF and GB are not similarly affected. M1 trees have no important variables, most affected by imbalance.
  • 13. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 13 7/1/2018
  • 14. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 14 7/1/2018 M2 (50/50) stops earlier, but val miscl higher than for raw (M1).
  • 15. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 15 7/1/2018 Similar results (see previous slide)
  • 16. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 16 7/1/2018
  • 17. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 17 7/1/2018 Logistic shows monotonically increasing relationship, while GB more jagged and increasing. Just one variable shown. M1 unbalanced case seriously affected in comparison. No_vclaims obviously positively affects Prob fraud.
  • 18. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 18 7/1/2018 M1 logistic suffers due to event imbalance.
  • 19. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 19 7/1/2018
  • 20. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 20 7/1/2018 No_claims and Optom_presc positively associated with prob Fraud (Training).
  • 21. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 21 7/1/2018 No_claims and Optom_presc positively associated with prob Fraud (Validation).
  • 22. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 22 7/1/2018 TRN GB has # 3 No_claims as most important, others flat.
  • 23. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 23 7/1/2018 VAL GB repeats No_claims as most important, adds doctor_visits and optom_presc As positive effects. Corresponding logistic does not point to doctor_visits. Corresponding M1 and M3 almost identical.
  • 24. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 24 7/1/2018
  • 25. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 25 7/1/2018 # 7 VAL GB by far most important.
  • 26. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 26 7/1/2018 Confluence of curves at point of event prior. Forests perform Very well at TRN but not at VAL.
  • 27. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 27 7/1/2018 #28 VAL log works to ‘bring’ down positive GB VAL slope.
  • 28. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 28 7/1/2018
  • 29. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 29 7/1/2018 Class imbalance shifts curve down and with flatter slope.
  • 30. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 30 7/1/2018 Val results for RF in ensemble models are flat.
  • 31. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 31 7/1/2018 Goodness of fit And model Selection.
  • 32. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 32 7/1/2018 GOF ranks VAL GOF measure rank AUROC Avg Square Error Cum Lift 3rd bin Cum Resp Rate 3rd Gini Rsquare Cramer Tjur Rank Rank Rank Rank Rank Rank Unw. Mean Unw. Median Model Name 6 3 6 6 6 3 5.00 6.00 02_M1_NSMBL_VAL_LOGISTIC_ NONE 08_M1_VAL_BAGGING 14 13 15 15 14 14 14.17 14.00 09_M1_VAL_GRAD_BOOSTING 5 4 1 1 5 4 3.33 4.00 10_M1_VAL_LOGISTIC_STEPWI SE 16 11 16 16 16 17 15.33 16.00 11_M1_VAL_RFORESTS 13 16 14 14 13 15 14.17 14.00 12_M1_VAL_TREES 18 18 18 18 18 18 18.00 18.00 14_M2_NSMBL_VAL_LOGISTIC_ NONE 1 1 2 2 1 1 1.33 1.00 20_M2_VAL_BAGGING 9 7 9 9 9 12 9.17 9.00 21_M2_VAL_GRAD_BOOSTING 3 5 3 3 3 5 3.67 3.00 22_M2_VAL_LOGISTIC_STEPWI SE 10 8 11 11 10 10 10.00 10.00 23_M2_VAL_RFORESTS 17 12 17 17 17 16 16.00 17.00 24_M2_VAL_TREES 12 10 10 10 12 8 10.33 10.00 26_M3_NSMBL_VAL_LOGISTIC_ NONE 2 2 4 4 2 2 2.67 2.00 32_M3_VAL_BAGGING 8 14 8 8 8 7 8.83 8.00 33_M3_VAL_GRAD_BOOSTING 4 6 5 5 4 6 5.00 5.00 34_M3_VAL_LOGISTIC_STEPWI SE 11 9 12 12 11 11 11.00 11.00 35_M3_VAL_RFORESTS 7 15 7 7 7 9 8.67 7.00 Model Name 36_M3_VAL_TREES 15 17 13 13 15 13 14.33 14.0 0
  • 33. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 33 7/1/2018 M2 VAL ensemble best, and M1 VAL GB best of single models. M3 VAL performance of single models is lackluster.
  • 34. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 34 7/1/2018
  • 35. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 35 7/1/2018 Conclusion on re-sampling. In this example, 50/50 M3 resampled models yielded a smaller Tree with no discernible difference in performance to its M2 counterpart. M1 trees failed to perform and the other M1 methods performed acceptably well. Actual performance (for best models) was not affected by 50/50 or raw modeling. Extreme imbalance seriously affected raw trees, but not other variants. The overall winner in all cases was GB, when evaluated at VAL. Models suffer when event prior is seriously imbalanced except for GB.
  • 36. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 36 7/1/2018
  • 37. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 37 7/1/2018 XGBoost Developed by Chen and Guestrin (2016) XGBoost: A Scalable Tree Boosting System. Claims: Faster and better than neural networks and Random Forests. Uses 2nd order gradients of loss functions based on Taylor expansions of loss functions, plugged into same algorithm for greater generalization. In addition, transforms loss function into more sophisticated objective function containing regularization terms, that penalizes tree growth, with penalty proportional to the size of the node weights thus preventing overfitting. More efficient than GB due to parallel computing on single computer (10 times faster). Algorithm takes advantage of advanced decomposition of objective function that allows for outperforming GB. Not yet SAS available. Available in R, Julia, Python, CLI. Tool used in many champion models in recent competitions (Kaggle, etc.). See also Foster’s (2017) XGboostExplainer.
  • 38. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 38 7/1/2018
  • 39. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 39 7/1/2018 Comments 1) Not immediately apparent what weak classifier is for GB (e.g., by varying depth in our case). Likewise, number of iterations is big issue. In our simple example in first study, M6 GB was best performer. Still, overall modeling benefited from ensembling all methods as measured by either AUROC or Cum Lift or ensemble p-values. 2) The posterior probability ranges are vastly different and thus the tendency to classify observations by the .5 threshold is too simplistic. 3) The PDPs show that different methods find distinct multivariate structures. Interestingly, the ensemble p-values show a decreasing tendency by logistic and trees and a strong S shaped tendency by M6 GB (first study), which could mean that M6 GB alone tends to overshoot its predictions. 4) GB relatively unaffected by 50/50 mixture.
  • 40. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 40 7/1/2018 Comments 5) While on classification GB problems, predictions are within [0, 1], for continuous target problems, predictions can be beyond the range of the target variable  headaches. This is due to the fact that GB models residual at each iteration, not the original target; this can lead to surprises, such as negative predictions when Y takes only non-negative residual values, contrary to the original Tree algorithm. 6) Shrinkage parameter and early stopping (# trees) act as regularizers but combined effect not known and could be ineffective. 7) If shrinkage too small, and allow large T, model is large, expensive to compute, implement and understand. 8) Random Forests over-fitted. A larger study should incorporate changes in its parameters for better validation.
  • 41. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 41 7/1/2018 Comments 9) Model interpretation is difficult in the case of BG, RF and BG (and not trivial for the other methods either). PDPs for logistic regression variables show monotonic relationships, while those of GB variables are very nonlinear. PDPs for other methods were not created.
  • 42. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 42 7/1/2018 Drawbacks of GB. 1) IT IS NOT MAGIC, it won’t solve ALL modeling needs, but best off-the-shelf tool. Still need to look for transformations, odd issues, missing values, etc. 2) As all tree methods, categorical variables with many levels can make it impossible to obtain model. E.g., zip codes. 3) Memory requirements can be very large, especially with large iterations, typical problem of ensemble methods. 4) Large number of iterations  slow speed to obtain predictions  on-line scoring may require trade-off between complexity and time available. Once GB is learned, parallelization certainly helps. 5) No simple algorithm to capture interactions because of base- learners. 6) No simple rules to determine gamma, # of iterations or depth of simple learner. Need to try different combinations and possibly recalibrate in time. 7) Still, one of the most powerful methods available.
  • 43. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 43 7/1/2018 Un-reviewed Catboost DeepForest gcForest Use of tree methods for continuous target variable. Naïve-Bayes Bootstrapping. …
  • 44. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 44 7/1/2018 2.11) References Auslender L. (1998): Alacart, poor man’s classification trees, NESUG. Breiman L., Friedman J., Olshen R., Stone J. (1984): Classification and Regression Trees, Wadsworth. Chen and Guestrin (2016): XGBoost: A Scalable Tree Boosting System. Chipman H., George E., McCulloch R.: BART, Bayesian additive regression Trees, The Annals of Statistics. Foster D. (2017): New R package that makes Xgboost Interpretable, https://medium.com/applied-data- science/new-r-package-the-xgboost-explainer-51dd7d1aa211 Friedman, J. (2001).Greedy boosting approximation: a gradient boosting machine. Ann.Stat. 29, 1189– 1232.doi:10.1214/aos/1013203451 Paluszynska A. (2017): Structural mining and knowledge extraction from random forest with applications to The Cancer Genome Atlas project (https://www.google.com/url?q=https%3A%2F%2Frawgit.com%2FgeneticsMiNIng%2FBlackBoxOpener %2Fmaster%2FrandomForestExplainer_Master_thesis.pdf&sa=D&sntz=1&usg=AFQjCNHTJONZK24L ioDeOB0KZnwLkn98fw and https://mi2datalab.github.io/randomForestExplainer/) Quinlan J. Ross (1993): C4.5: programs for machine learning, Morgan Kaufmann Publshers.
  • 45. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 45 7/1/2018 Earlier literature on combining methods: Winkler, RL. and Makridakis, S. (1983). The combination of forecasts. J. R. Statis. Soc. A. 146(2), 150-157. Makridakis, S. and Winkler, R.L. (1983). Averages of Forecasts: Some Empirical Results,. Management Science, 29(9) 987-996. Bates, J.M. and Granger, C.W. (1969). The combination of forecasts. Or, 451-468.
  • 46. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 46 7/1/2018
  • 47. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 47 7/1/2018 1) Can you explain in nontechnical language the idea of maximum likelihood estimation?, of SVM (unreviewed in class)? 2) Contrast GB with RF. 3) In what way is over-fitting like a glove? Like an umbrella? 4) Would ensemble models always improve on individual models? 5) Would you select variables by way of tree methods to use in linear methods later on? Yes? No? why? 6) In Tree regression, final predictions are means. Could better predictions be obtained by regression model instead? A logistic for a binary target? Discuss. 7) There are 9 coins, 8 of which are of equal weight, and there’s one scale. How many steps until you identify the odd coin? 8) Why are manhole covers round? 9) You obtain 100% accuracy in validation of classification model. Are you a genius? Yes, no, why? 10)If 85% of witnesses saw blue car during accident, and 15% saw red car, what is probability (car is blue)?
  • 48. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 48 7/1/2018 Counter-interview questions (you ask the interviewer). 1) How do you measure the height of a building with just a barometer? Give three answers at least. 2) Two players A and B take turns saying a positive integer number from 1 to 9. The numbers are added until whoever reaches 100 or above, loses. Is there a strategy to never lose? (aborting a game midway is acceptable, but give reasoning). 3) There are two jugs, one that holds 5 gallons, the other one 3, and a nearby water fountain. How do you put exactly (less than one ounce deviation is fine) 4 ounces in the 5 gallon jug?
  • 49. Leonardo Auslender Copyright 2004 Leonardo Auslender – Copyright 2018 Ch. 5-49 7/1/2018 for now