Your SlideShare is downloading. ×
  • Like
Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

  • 1,824 views
Published

Dr. Trevor Hastie of Stanford University discusses the data science behind Gradient Boosted Regression and Classification

Dr. Trevor Hastie of Stanford University discusses the data science behind Gradient Boosted Regression and Classification

Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,824
On SlideShare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
48
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Ensemble Learners • Classification Trees (Done) • Bagging: Averaging Trees • Random Forests: Cleverer Averaging of Trees • Boosting: Cleverest Averaging of Trees • Details in chaps 10, 15 and 16 Methods for dramatically improving the performance of weak learners such as Trees. Classification trees are adaptive and robust, but do not make good prediction models. The techniques discussed here enhance their performance considerably. 227
  • 2. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Properties of Trees • [ ]Can handle huge datasets • [ ]Can handle mixed predictors—quantitative and qualitative • [ ]Easily ignore redundant variables • [ ]Handle missing data elegantly through surrogate splits • [ ]Small trees are easy to interpret • ["]Large trees are hard to interpret • ["]Prediction performance is often poor 228
  • 3. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting email 600/1536 ch$<0.0555 ch$>0.0555 email 280/1177 spam 48/359 remove<0.06 hp<0.405 hp>0.405 remove>0.06 email 180/1065 spam ch!<0.191 spam george<0.005 george>0.005 email email spam 80/652 0/209 36/123 16/81 free<0.065 free>0.065 email email email spam 77/423 3/229 16/94 9/29 CAPMAX<10.5 business<0.145 CAPMAX>10.5 business>0.145 email email email spam 20/238 57/185 14/89 3/5 receive<0.125 edu<0.045 receive>0.125 edu>0.045 email spam email email 19/236 1/2 48/113 9/72 our<1.2 our>1.2 email spam 37/101 1/12 0/3 CAPAVE<2.7505 CAPAVE>2.7505 email hp<0.03 hp>0.03 email 6/109 email 100/204 80/861 0/22 george<0.15 CAPAVE<2.907 george>0.15 CAPAVE>2.907 ch!>0.191 email email spam 26/337 9/112 spam 19/110 spam 7/227 1999<0.58 1999>0.58 spam 18/109 email 0/1 229
  • 4. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting ROC curve for pruned tree on SPAM data 0.8 1.0 SPAM Data TREE − Error: 8.7% 0.0 0.2 0.4 Sensitivity 0.6 o 0.0 0.2 0.4 0.6 Specificity 0.8 1.0 Overall error rate on test data: 8.7%. ROC curve obtained by varying the threshold c0 of the classifier: C(X) = +1 if Pr(+1|X) > c0 . Sensitivity: proportion of true spam identified Specificity: proportion of true email identified. We may want specificity to be high, and suffer some spam: Specificity : 95% =⇒ Sensitivity : 79% 230
  • 5. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Toy Classification Problem Bayes Error Rate: 0.25 6 4 1 2 0 -2 -4 -6 X2 1 1 1 1 1 1 111 11 1 1 1 1 1 1 0 110111 1 0 1 1 1 1 1 11 0 1 1 11 1 1 1 1 1 1 1 1 1 1 1 11 1 0 1 11 0 1 1 1 0 1 0 1 00 1 11 1 1 1 1 010 010 11 010 1 00 01 1 1 1 1 1 1 1 01 0 1 0 00 00 0 1 1 11 1 0 0 00 0 01011 0 0111 101 1 01 0 0 0 0 0 1 0011 1 11 0 1 1 1 101 1010 0 0 0000010 0 0 1 0 00 1 0 100 0 1 0 0 0 1 0 0 00 01 100 000 011 0 00 0 0 0 1 00 0 001000 000 1 1 1 000 00 1 000 0 11 1 0 1 0 0 1 0 0 0101 0 1 11 1 1 001 01001 000110 000 0111 0 1 1 11 0 00 0100000 000 10 001 0 1 1 1 0 1 1 00 1 0 1 0 1 0001000 1 01 0 00100000011 0 1 00000000001000010 0 1 0 1 0 1 00 00000000000000 00001101 1 00 000 10 1 1 1 1 1 0 0 0 10 0100 0 0 1 00 0 11 0 111 0 00000010010111 101010 01 1 1 0 0 11 10 0010 100 1 0 0 1 0 1 0 01100 0 0 0 101 0 01 1 00 0 0 01 000 01 0 0 1 1 1000000 0 000 1 1 1 00 1 1 0 10 11 111 0000001 1 111 11 1 1 0001 01000000 1 11 1 1 0 0 0 0 011 0 00 0 0 01 1 1 1 0 1 1 1 1 0 1 11 11 1 1111 0 1 1 10 01000 0 1 1 1111 1 1 1 00 1 0 11 0 11 0 0 0 1 0 0 111 11 1 11 1001 0 1 10 01 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 11 1 1 1 10 1 1 1 1 1 1 11 1 1 1 1 1 11 1 11 1 1 1 1 1 1 1 1 1 11 1 11 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 -6 -4 -2 • Data X and Y , with Y taking values +1 or −1. 1 1 1 1 1 0 X1 2 4 • Here X =∈ IR2 1 1 1 6 • The black boundary is the Bayes Decision Boundary - the best one can do. Here 25% • Goal: given N training pairs (Xi , Yi ) produce ˆ a classifier C(X) ∈ {−1, 1} • Also estimate the probability of the class labels Pr(Y = +1|X). 231
  • 6. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Toy Example — No Noise Bayes Error Rate: 0 3 1 1 1 1 1 11 1 11 11 1 1 11 1 11 1 1 1 1 1 11 1 1 1 11 1 1 1 1 11 1 1 1 11 1 1 1 1 1 1 11 1 1 11 1 1 11 11 1 1 1 1 1 1 1 11 1 1 1 1 11 11 1 11 1 1 11 1 0 000 1 1 0 0 11 1 1 1 1 1 1 1 1 1111 1 1 00 00 101 1 1 1 00 0 1 1 1 0 0 1 1 1 11 00 0 00 0000 00 001 111 1 1 0 0 0 0 0 0 00 0 1 1 1 1 1 11 0 0 0 1 1 11 1 0 0 0 0 00 000 0 00 1 1 1 1 10 0 0 1 1 0 0 00 0 1 11 11 100000000 0 0000 0 000 00 0 1 1 1 1 1 0 1 11 0 0 00000 00 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0000000 11 0 0 0 0 0 000 0 00 001 11 1 1 1 1 0 1 0 0 00 00 0 00 0 00 0 0 0 1 1 0 00 000 0 0 0 00 0 0 0 1 1 1 00 0 0 0000 00 0000000 0 00 01 1 1 1 1 11 1 1 1 1 0 1 1 0 000 00000 0 0000 0 000 1 000 0 000 0 00 00 0 0 0 1 0 0 0 00 1 00 0 11 00 0 0 0 11 1 1 10 0 0 0 0 0 0 0000 00 0 0 0 11 1 1 1 1 1 0 0 00 0 1 1 1 00 0 0 1 1 1 1 1 000 0000 000000 0000 0 0 0 11 1 1 1 1 0 0 0 00000 0 0 00 1 11 1 0 000000 0 1 11 1 11 0 0 00 0 00 0 000 1 1 1 1 1 1 0 0 0 1 0 0 00 0 000 00 0 0 0 1 1 1 1 1 1 1 0 1 11 0 00 00 0 00 00 1 1 11 1 1 00 0 0 0 0 1 1 1 1 1 1 1 1 01 00 0 0 00 1 0 0 11 1 1 0 1 1 1 111 001 0 010 1 1 111 1 1 1 1 1 1 0 11 1 1 01 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 11 11 1 1 111 11 1 11 1 1 11 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 11 1 11 0 -1 -2 -3 X2 1 2 1 -3 -2 1 11 -1 1 11 1 1 1 1 1 0 X1 1 2 3 • Deterministic problem; noise comes from sampling distribution of X. • Use a training sample of size 200. • Here Bayes Error is 0%. 232
  • 7. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Classification Tree 1 94/200 x.2<-1.06711 x.2>-1.06711 0 1 72/166 x.2<1.14988 x.2>1.14988 0/34 1 0 40/134 x.1<1.13632 x.1>1.13632 0 1 23/117 0/32 0/17 x.1<-0.900735 x.1>-0.900735 1 0 5/26 x.1<-1.1668 x.1>-1.1668 1 1 0/12 2/91 x.2<-0.823968 x.2>-0.823968 0 5/14 x.1<-1.07831 x.1>-1.07831 1 1 1/5 4/9 0 2/8 0/83 233
  • 8. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 234 Decision Boundary: Tree Error Rate: 0.073 2 3 • The eight-node tree is boxy, and has a testerror rate 0f 7.3% 1 1 0 1 1 -1 1 1 -2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 0 0 01 1 1 11 0 11 1 1 1 1 0 000 1 0 00 0 000 0 00 0 0 00 0 0 0 00 1 0 0 00 1 1 11 1 1 0 0 0 0 0 0 0 00 0 1 1 1 0 0 00 0 0 0 00 0 0 0 00 0 00 0 0 0 00 0 0 0 0 0 1 0 0 0 0 0 0 00 0 1 00 0 1 1 0 0 1 00 0 0 1 1 000 00 0 0 0 1 0 1 1 1 0 11 1 10 01 11 1 11 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -3 X2 1 1 11 -3 -2 -1 0 X1 1 2 3 • When the nested spheres are in 10dimensions, a classification tree produces a rather noisy and ˆ inaccurate rule C(X), with error rates around 30%.
  • 9. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Model Averaging Classification trees can be simple, but often produce noisy (bushy) or weak (stunted) classifiers. • Bagging (Breiman, 1996): Fit many large trees to bootstrap-resampled versions of the training data, and classify by majority vote. • Boosting (Freund & Shapire, 1996): Fit many large or small trees to reweighted versions of the training data. Classify by weighted majority vote. • Random Forests (Breiman 1999): Fancier version of bagging. In general Boosting ≻ Random Forests ≻ Bagging ≻ Single Tree. 235
  • 10. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Bagging • Bagging or bootstrap aggregation averages a noisy fitted function refit to many bootstrap samples, to reduce its variance. Bagging is a poor man’s Bayes; See pp 282. • Suppose C(S, x) is a fitted classifier, such as a tree, fit to our training data S, producing a predicted class label at input point x. • To bag C, we draw bootstrap samples S ∗1 , . . . S ∗B each of size N with replacement from the training data. Then Cbag (x) = Majority Vote {C(S ∗b , x)}B . b=1 • Bagging can dramatically reduce the variance of unstable procedures (like trees), leading to improved prediction. However any simple structure in C (e.g a tree) is lost. 236
  • 11. SLDM III c Hastie & Tibshirani - October 9, 2013 Original Tree Random Forests and Boosting b=1 b=2 x.1 < 0.395 x.1 < 0.555 x.2 < 0.205 | | | 1 1 0 1 0 1 0 0 1 0 0 0 1 b=3 0 0 1 0 1 b=4 1 b=5 x.2 < 0.285 x.3 < 0.985 x.4 < −1.36 | | | 0 1 0 0 1 1 0 1 1 1 0 1 0 1 b=6 1 1 0 0 1 b=7 b=8 x.1 < 0.395 x.1 < 0.395 x.3 < 0.985 | | | 1 1 1 0 1 1 0 0 0 1 0 0 1 b=9 0 0 1 b = 10 b = 11 x.1 < 0.395 x.1 < 0.555 x.1 < 0.555 | | | 1 0 1 0 1 1 0 1 0 0 1 0 1 0 1 237
  • 12. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Decision Boundary: Bagging 2 3 Error Rate: 0.032 1 1 0 1 1 -1 1 1 -2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 0 0 01 1 1 11 0 11 1 1 1 1 0 000 1 0 00 0 000 0 00 0 0 00 0 0 0 00 1 0 0 00 1 1 11 1 1 0 0 0 0 0 0 0 00 0 1 1 1 0 0 00 0 0 0 00 0 0 0 00 0 00 0 0 0 00 0 0 0 0 0 1 0 0 0 0 0 0 00 0 1 00 0 1 1 0 0 1 00 0 0 1 1 000 00 0 0 0 1 0 1 1 1 0 11 1 10 01 11 1 11 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 • Bagging reduces error by variance reduction. -3 X2 1 1 11 -3 -2 -1 0 X1 1 2 • Bagging averages many trees, and produces smoother decision boundaries. 3 • Bias is slightly increased because bagged trees are slightly shallower. 238
  • 13. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Random forests • refinement of bagged trees; very popular— chapter 15. • at each tree split, a random sample of m features is drawn, and only those m features are considered for splitting. Typically √ m = p or log2 p, where p is the number of features • For each tree grown on a bootstrap sample, the error rate for observations left out of the bootstrap sample is monitored. This is called the “out-of-bag” or OOB error rate. • random forests tries to improve on bagging by “de-correlating” the trees, and reduce the variance. • Each tree has the same expectation, so increasing the number of trees does not alter the bias of bagging or random forests. 239
  • 14. SLDM III c Hastie & Tibshirani - October 9, 2013 Wisdom of Crowds 10 Wisdom of Crowds Consensus Individual 8 • 10 movie categories, 4 choices in each category 6 • 50 people, for each question 15 informed and 35 guessers 4 Expected Correct out of 10 Random Forests and Boosting • For informed people, 0 0.25 0.50 0.75 P − Probability of Informed Person Being Correct 1.00 = p Pr(incorrect) 2 Pr(correct) = (1 − p)/3 • For guessers, Pr(all choices) = 1/4. Consensus (orange) improve over individuals (teal) even for moderate p. 240
  • 15. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 241 1.0 ROC curve for TREE, SVM and Random Forest on SPAM data 0.8 TREE, SVM and RF Random Forest − Error: 4.75% SVM − Error: 6.7% TREE − Error: 8.7% Random Forest dominates both other methods on the SPAM data — 4.75% error. Used 500 trees with default settings for random Forest package in R. 0.0 0.2 0.4 Sensitivity 0.6 o o o 0.0 0.2 0.4 0.6 Specificity 0.8 1.0
  • 16. SLDM III c Hastie & Tibshirani - October 9, 2013 242 0.07 0.08 RF: decorrelation 0.06 ˆ • VarfRF (x) = ρ(x)σ 2 (x) 0.03 0.04 0.05 • The correlation is because we are growing trees to the same data. 0.02 Correlation between Trees Random Forests and Boosting 1 4 7 10 16 22 28 34 40 Number of Randomly Selected Splitting Variables m 46 • The correlation decreases as m decreases—m is the number of variables sampled at each split. Based on a simple experiment with random forest regression models.
  • 17. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Random Forest Ensemble 0.15 0.10 0.05 Mean Squared Error Squared Bias Variance 0.0 0 10 20 30 m 40 50 Variance 0.80 0.70 0.75 0.20 0.85 Bias & Variance 0.65 Mean Squared Error and Squared Bias 243 • The smaller m, the lower the variance of the random forest ensemble • Small m is also associated with higher bias, because important variables can be missed by the sampling.
  • 18. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 244 Boosting • Breakthrough invention of Freund & Schapire (1997) Weighted Sample CM (x) • Average many trees, each grown to reweighted versions of the training data. • Weighting decorrelates the trees, by focussing on regions missed by past trees. Weighted Sample C3 (x) Weighted Sample C2 (x) • Final Classifier is weighted average of classifiers: M αm Cm (x) C(x) = sign Training Sample C1 (x) m=1 Note: Cm (x) ∈ {−1, +1}.
  • 19. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 245 100 Node Trees Boosting vs Bagging 0.4 Bagging AdaBoost • Bayes error rate is 0%. 0.2 • Trees are grown best first without pruning. 0.1 • Leftmost term is a single tree. 0.0 Test Error 0.3 • 2000 points from Nested Spheres in R10 0 100 200 Number of Terms 300 400
  • 20. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting AdaBoost (Freund & Schapire, 1996) 1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N . 2. For m = 1 to M repeat steps (a)–(d): (a) Fit a classifier Cm (x) to the training data using weights wi . (b) Compute weighted error of newest tree errm = N i=1 wi I(yi = Cm (xi )) N i=1 wi (c) Compute αm = log[(1 − errm )/errm ]. (d) Update weights for i = 1, . . . , N : wi ← wi · exp[αm · I(yi = Cm (xi ))] and renormalize to wi to sum to 1. 3. Output C(x) = sign M m=1 αm Cm (x) . . 246
  • 21. Random Forests and Boosting 0.5 SLDM III c Hastie & Tibshirani - October 9, 2013 247 Boosting Stumps 0.4 Single Stump 0.3 • Boosting stumps works remarkably well on the nested-spheres problem. 0.1 0.2 400 Node Tree 0.0 Test Error • A stump is a two-node tree, after a single split. 0 100 200 Boosting Iterations 300 400
  • 22. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 248 0.4 • Bayes error is 0%. in 10- 0.2 • Boosting drives the training error to zero (green curve). 0.1 • Further iterations continue to improve test error in many examples (red curve). 0.0 Train and Test Error • Nested spheres Dimensions. 0.3 0.5 Training Error 0 100 200 300 Number of Terms 400 500 600
  • 23. Random Forests and Boosting 0.5 SLDM III c Hastie & Tibshirani - October 9, 2013 249 0.3 • Nested Gaussians in 10-Dimensions. • Bayes error is 25%. Bayes Error 0.2 • Boosting with stumps 0.1 • Here the test error does increase, but quite slowly. 0.0 Train and Test Error 0.4 Noisy Problems 0 100 200 300 Number of Terms 400 500 600
  • 24. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Stagewise Additive Modeling Boosting builds an additive model M βm b(x; γm ). F (x) = m=1 Here b(x, γm ) is a tree, and γm parametrizes the splits. We do things like that in statistics all the time! • GAMs: F (x) = j fj (xj ) • Basis expansions: F (x) = M m=1 θm hm (x) Traditionally the parameters fm , θm are fit jointly (i.e. least squares, maximum likelihood). With boosting, the parameters (βm , γm ) are fit in a stagewise fashion. This slows the process down, and overfits less quickly. 250
  • 25. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Example: Least Squares Boosting 1. Start with function F0 (x) = 0 and residual r = y, m = 0 2. m ← m + 1 3. Fit a CART regression tree to r giving g(x) 4. Set fm (x) ← ǫ · g(x) 5. Set Fm (x) ← Fm−1 (x) + fm (x), r ← r − fm (x) and repeat step 2–5 many times • g(x) typically shallow tree; e.g. constrained to have k = 3 splits only (depth). • Stagewise fitting (no backfitting!). Slows down (over-) fitting process. • 0 < ǫ ≤ 1 is shrinkage parameter; slows overfitting even further. 251
  • 26. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Least Squares Boosting and ǫ Prediction Error Single tree ǫ=1 ǫ = .01 Number of steps m • If instead of trees, we use linear regression terms, boosting with small ǫ very similar to lasso! 252
  • 27. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Adaboost: Stagewise Modeling • AdaBoost builds an additive logistic regression model M Pr(Y = 1|x) αm fm (x) F (x) = log = Pr(Y = −1|x) m=1 by stagewise fitting using the loss function L(y, F (x)) = exp(−y F (x)). • Instead of fitting trees to residuals, the special form of the exponential loss leads to fitting trees to weighted versions of the original data. See pp 343. 253
  • 28. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 3.0 Why Exponential Loss? 2.0 0.5 1.0 1.5 • Leads to simple reweighting scheme. 0.0 Loss 2.5 Misclassification Exponential Binomial Deviance Squared Error Support Vector • e−yF (x) is a monotone, smooth upper bound on misclassification loss at x. −2 −1 0 1 2 • Has logit transform as population minimizer y·F F ∗ (x) = • y · F is the margin; positive margin means correct classification. 1 Pr(Y = 1|x) log 2 Pr(Y = −1|x) • Other more robust loss functions, like binomial deviance. 254
  • 29. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting General Boosting Algorithms The boosting paradigm can be extended to general loss functions — not only squared error or exponential. 1. Initialize F0 (x) = 0. 2. For m = 1 to M : (a) Compute (βm , γm ) = arg minβ,γ N i=1 L(yi , Fm−1 (xi ) + βb(xi ; γ)). (b) Set Fm (x) = Fm−1 (x) + βm b(x; γm ). Sometimes we replace step (b) in item 2 by (b∗ ) Set Fm (x) = Fm−1 (x) + ǫβm b(x; γm ) Here ǫ is a shrinkage factor; in gbm package in R, the default is ǫ = 0.001. Shrinkage slows the stagewise model-building even more, and typically leads to better performance. 255
  • 30. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Gradient Boosting • General boosting algorithm that works with a variety of different loss functions. Models include regression, resistant regression, K-class classification and risk modeling. • Gradient Boosting builds additive tree models, for example, for representing the logits in logistic regression. • Tree size is a parameter that determines the order of interaction (next slide). • Gradient Boosting inherits all the good features of trees (variable selection, missing data, mixed predictors), and improves on the weak features, such as prediction performance. • Gradient Boosting is described in detail in , section 10.10, and is implemented in Greg Ridgeways gbm package in R. 256
  • 31. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 0.070 Spam Data 0.055 0.050 0.045 0.040 Test Error 0.060 0.065 Bagging Random Forest Gradient Boosting (5 Node) 0 500 1000 1500 Number of Trees 2000 2500 257
  • 32. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Spam Example Results With 3000 training and 1500 test observations, Gradient Boosting fits an additive logistic model f (x) = log Pr(spam|x) Pr(email|x) using trees with J = 6 terminal-node trees. Gradient Boosting achieves a test error of 4.5%, compared to 5.3% for an additive GAM, 4.75% for Random Forests, and 8.7% for CART. 258
  • 33. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Spam: Variable Importance 3d addresses labs telnet 857 415 direct cs table 85 # parts credit [ lab conference report original data project font make address order all hpl technology people pm mail over 650 ; meeting email 000 internet receive ( re business 1999 will money our you edu CAPTOT george CAPMAX your CAPAVE free remove hp $ ! 0 20 40 60 Relative importance 80 100 259
  • 34. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 0.0 0.2 0.4 0.6 0.8 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Partial Dependence -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Partial Dependence Spam: Partial Dependence 1.0 0.0 0.2 0.4 0.6 -0.2 0.0 0.2 -1.0 -0.6 -0.6 Partial Dependence -0.2 0.0 0.2 remove -1.0 Partial Dependence ! 0.0 0.2 0.4 0.6 edu 0.8 1.0 0.0 0.5 1.0 1.5 hp 2.0 2.5 3.0 260
  • 35. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting California Housing Data 0.42 0.40 0.38 0.36 0.34 0.32 Test Average Absolute Error 0.44 RF m=2 RF m=6 GBM depth=4 GBM depth=6 0 200 400 600 Number of Trees 800 1000 261
  • 36. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Tree Size 0.4 Stumps 10 Node 100 Node Adaboost 0.2 0.3 The tree size J determines the interaction order of the model: η(X) = ηj (Xj ) 0.1 j + ηjk (Xj , Xk ) jk 0.0 Test Error 262 + 0 100 200 Number of Terms 300 400 ηjkl (Xj , Xk , Xl ) jkl +···
  • 37. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Stumps win! Since the true decision boundary is the surface of a sphere, the function that describes it has the form 2 2 2 f (X) = X1 + X2 + . . . + Xp − c = 0. Boosted stumps via Gradient Boosting returns reasonable approximations to these quadratic functions. Coordinate Functions for Additive Logistic Trees f1 (x1 ) f2 (x2 ) f3 (x3 ) f4 (x4 ) f5 (x5 ) f6 (x6 ) f7 (x7 ) f8 (x8 ) f9 (x9 ) f10 (x10 ) 263
  • 38. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Learning Ensembles Ensemble Learning consists of two steps: 1. Construction of a dictionary D = {T1 (X), T2 (X), . . . , TM (X)} of basis elements (weak learners) Tm (X). 2. Fitting a model f (X) = m∈D αm Tm (X). Simple examples of ensembles • Linear regression: The ensemble consists of the coordinate functions Tm (X) = Xm . The fitting is done by least squares. • Random Forests: The ensemble consists of trees grown to booststrapped versions of the data, with additional randomization at each split. The fitting simply averages. • Gradient Boosting: The ensemble is grown in an adaptive fashion, but then simply averaged at the end. 264
  • 39. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Learning Ensembles and the Lasso Proposed by Popescu and Friedman (2004): Fit the model f (X) = m∈D αm Tm (X) in step (2) using Lasso regularization: α(λ) = arg min α αm Tm (xi )] + λ L[yi , α0 + i=1 M M N m=1 m=1 |αm |. Advantages: • Often ensembles are large, involving thousands of small trees. The post-processing selects a smaller subset of these and combines them efficiently. • Often the post-processing improves the process that generated the ensemble. 265
  • 40. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Post-Processing Random Forest Ensembles 0.09 Spam Data 0.07 0.06 0.05 0.04 Test Error 0.08 Random Forest Random Forest (5%, 6) Gradient Boost (5 node) 0 100 200 300 Number of Trees 400 500 266
  • 41. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Software • R: free GPL statistical computing environment available from CRAN, implements the S language. Includes: – randomForest: implementation of Leo Breimans algorithms. – rpart: Terry Therneau’s implementation of classification and regression trees. – gbm: Greg Ridgeway’s implementation of Friedman’s gradient boosting algorithm. • Salford Systems: Commercial implementation of trees, random forests and gradient boosting. • Splus (Insightful): Commerical version of S. • Weka: GPL software from University of Waikato, New Zealand. Includes Trees, Random Forests and many other procedures. 267