Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- GBM package in r by mark_landry 20786 views
- 10 R Packages to Win Kaggle Competi... by DataRobot 78371 views
- Building Random Forest at Scale by SriSatish Ambati 81068 views
- Machine learning basics using trees... by Parth Khare 4145 views
- Cliff Click Explains GBM at Netflix... by SriSatish Ambati 1245 views
- Tips for data science competitions by Owen Zhang 29534 views

4,221 views

4,059 views

4,059 views

Published on

No Downloads

Total views

4,221

On SlideShare

0

From Embeds

0

Number of Embeds

16

Shares

0

Downloads

146

Comments

0

Likes

6

No embeds

No notes for slide

- 1. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Ensemble Learners • Classiﬁcation Trees (Done) • Bagging: Averaging Trees • Random Forests: Cleverer Averaging of Trees • Boosting: Cleverest Averaging of Trees • Details in chaps 10, 15 and 16 Methods for dramatically improving the performance of weak learners such as Trees. Classiﬁcation trees are adaptive and robust, but do not make good prediction models. The techniques discussed here enhance their performance considerably. 227
- 2. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Properties of Trees • [ ]Can handle huge datasets • [ ]Can handle mixed predictors—quantitative and qualitative • [ ]Easily ignore redundant variables • [ ]Handle missing data elegantly through surrogate splits • [ ]Small trees are easy to interpret • ["]Large trees are hard to interpret • ["]Prediction performance is often poor 228
- 3. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting email 600/1536 ch$<0.0555 ch$>0.0555 email 280/1177 spam 48/359 remove<0.06 hp<0.405 hp>0.405 remove>0.06 email 180/1065 spam ch!<0.191 spam george<0.005 george>0.005 email email spam 80/652 0/209 36/123 16/81 free<0.065 free>0.065 email email email spam 77/423 3/229 16/94 9/29 CAPMAX<10.5 business<0.145 CAPMAX>10.5 business>0.145 email email email spam 20/238 57/185 14/89 3/5 receive<0.125 edu<0.045 receive>0.125 edu>0.045 email spam email email 19/236 1/2 48/113 9/72 our<1.2 our>1.2 email spam 37/101 1/12 0/3 CAPAVE<2.7505 CAPAVE>2.7505 email hp<0.03 hp>0.03 email 6/109 email 100/204 80/861 0/22 george<0.15 CAPAVE<2.907 george>0.15 CAPAVE>2.907 ch!>0.191 email email spam 26/337 9/112 spam 19/110 spam 7/227 1999<0.58 1999>0.58 spam 18/109 email 0/1 229
- 4. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting ROC curve for pruned tree on SPAM data 0.8 1.0 SPAM Data TREE − Error: 8.7% 0.0 0.2 0.4 Sensitivity 0.6 o 0.0 0.2 0.4 0.6 Specificity 0.8 1.0 Overall error rate on test data: 8.7%. ROC curve obtained by varying the threshold c0 of the classiﬁer: C(X) = +1 if Pr(+1|X) > c0 . Sensitivity: proportion of true spam identiﬁed Speciﬁcity: proportion of true email identiﬁed. We may want speciﬁcity to be high, and suﬀer some spam: Speciﬁcity : 95% =⇒ Sensitivity : 79% 230
- 5. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Toy Classiﬁcation Problem Bayes Error Rate: 0.25 6 4 1 2 0 -2 -4 -6 X2 1 1 1 1 1 1 111 11 1 1 1 1 1 1 0 110111 1 0 1 1 1 1 1 11 0 1 1 11 1 1 1 1 1 1 1 1 1 1 1 11 1 0 1 11 0 1 1 1 0 1 0 1 00 1 11 1 1 1 1 010 010 11 010 1 00 01 1 1 1 1 1 1 1 01 0 1 0 00 00 0 1 1 11 1 0 0 00 0 01011 0 0111 101 1 01 0 0 0 0 0 1 0011 1 11 0 1 1 1 101 1010 0 0 0000010 0 0 1 0 00 1 0 100 0 1 0 0 0 1 0 0 00 01 100 000 011 0 00 0 0 0 1 00 0 001000 000 1 1 1 000 00 1 000 0 11 1 0 1 0 0 1 0 0 0101 0 1 11 1 1 001 01001 000110 000 0111 0 1 1 11 0 00 0100000 000 10 001 0 1 1 1 0 1 1 00 1 0 1 0 1 0001000 1 01 0 00100000011 0 1 00000000001000010 0 1 0 1 0 1 00 00000000000000 00001101 1 00 000 10 1 1 1 1 1 0 0 0 10 0100 0 0 1 00 0 11 0 111 0 00000010010111 101010 01 1 1 0 0 11 10 0010 100 1 0 0 1 0 1 0 01100 0 0 0 101 0 01 1 00 0 0 01 000 01 0 0 1 1 1000000 0 000 1 1 1 00 1 1 0 10 11 111 0000001 1 111 11 1 1 0001 01000000 1 11 1 1 0 0 0 0 011 0 00 0 0 01 1 1 1 0 1 1 1 1 0 1 11 11 1 1111 0 1 1 10 01000 0 1 1 1111 1 1 1 00 1 0 11 0 11 0 0 0 1 0 0 111 11 1 11 1001 0 1 10 01 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 11 1 1 1 10 1 1 1 1 1 1 11 1 1 1 1 1 11 1 11 1 1 1 1 1 1 1 1 1 11 1 11 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 -6 -4 -2 • Data X and Y , with Y taking values +1 or −1. 1 1 1 1 1 0 X1 2 4 • Here X =∈ IR2 1 1 1 6 • The black boundary is the Bayes Decision Boundary - the best one can do. Here 25% • Goal: given N training pairs (Xi , Yi ) produce ˆ a classiﬁer C(X) ∈ {−1, 1} • Also estimate the probability of the class labels Pr(Y = +1|X). 231
- 6. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Toy Example — No Noise Bayes Error Rate: 0 3 1 1 1 1 1 11 1 11 11 1 1 11 1 11 1 1 1 1 1 11 1 1 1 11 1 1 1 1 11 1 1 1 11 1 1 1 1 1 1 11 1 1 11 1 1 11 11 1 1 1 1 1 1 1 11 1 1 1 1 11 11 1 11 1 1 11 1 0 000 1 1 0 0 11 1 1 1 1 1 1 1 1 1111 1 1 00 00 101 1 1 1 00 0 1 1 1 0 0 1 1 1 11 00 0 00 0000 00 001 111 1 1 0 0 0 0 0 0 00 0 1 1 1 1 1 11 0 0 0 1 1 11 1 0 0 0 0 00 000 0 00 1 1 1 1 10 0 0 1 1 0 0 00 0 1 11 11 100000000 0 0000 0 000 00 0 1 1 1 1 1 0 1 11 0 0 00000 00 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0000000 11 0 0 0 0 0 000 0 00 001 11 1 1 1 1 0 1 0 0 00 00 0 00 0 00 0 0 0 1 1 0 00 000 0 0 0 00 0 0 0 1 1 1 00 0 0 0000 00 0000000 0 00 01 1 1 1 1 11 1 1 1 1 0 1 1 0 000 00000 0 0000 0 000 1 000 0 000 0 00 00 0 0 0 1 0 0 0 00 1 00 0 11 00 0 0 0 11 1 1 10 0 0 0 0 0 0 0000 00 0 0 0 11 1 1 1 1 1 0 0 00 0 1 1 1 00 0 0 1 1 1 1 1 000 0000 000000 0000 0 0 0 11 1 1 1 1 0 0 0 00000 0 0 00 1 11 1 0 000000 0 1 11 1 11 0 0 00 0 00 0 000 1 1 1 1 1 1 0 0 0 1 0 0 00 0 000 00 0 0 0 1 1 1 1 1 1 1 0 1 11 0 00 00 0 00 00 1 1 11 1 1 00 0 0 0 0 1 1 1 1 1 1 1 1 01 00 0 0 00 1 0 0 11 1 1 0 1 1 1 111 001 0 010 1 1 111 1 1 1 1 1 1 0 11 1 1 01 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 11 11 1 1 111 11 1 11 1 1 11 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 11 1 11 0 -1 -2 -3 X2 1 2 1 -3 -2 1 11 -1 1 11 1 1 1 1 1 0 X1 1 2 3 • Deterministic problem; noise comes from sampling distribution of X. • Use a training sample of size 200. • Here Bayes Error is 0%. 232
- 7. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Classiﬁcation Tree 1 94/200 x.2<-1.06711 x.2>-1.06711 0 1 72/166 x.2<1.14988 x.2>1.14988 0/34 1 0 40/134 x.1<1.13632 x.1>1.13632 0 1 23/117 0/32 0/17 x.1<-0.900735 x.1>-0.900735 1 0 5/26 x.1<-1.1668 x.1>-1.1668 1 1 0/12 2/91 x.2<-0.823968 x.2>-0.823968 0 5/14 x.1<-1.07831 x.1>-1.07831 1 1 1/5 4/9 0 2/8 0/83 233
- 8. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 234 Decision Boundary: Tree Error Rate: 0.073 2 3 • The eight-node tree is boxy, and has a testerror rate 0f 7.3% 1 1 0 1 1 -1 1 1 -2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 0 0 01 1 1 11 0 11 1 1 1 1 0 000 1 0 00 0 000 0 00 0 0 00 0 0 0 00 1 0 0 00 1 1 11 1 1 0 0 0 0 0 0 0 00 0 1 1 1 0 0 00 0 0 0 00 0 0 0 00 0 00 0 0 0 00 0 0 0 0 0 1 0 0 0 0 0 0 00 0 1 00 0 1 1 0 0 1 00 0 0 1 1 000 00 0 0 0 1 0 1 1 1 0 11 1 10 01 11 1 11 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -3 X2 1 1 11 -3 -2 -1 0 X1 1 2 3 • When the nested spheres are in 10dimensions, a classiﬁcation tree produces a rather noisy and ˆ inaccurate rule C(X), with error rates around 30%.
- 9. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Model Averaging Classiﬁcation trees can be simple, but often produce noisy (bushy) or weak (stunted) classiﬁers. • Bagging (Breiman, 1996): Fit many large trees to bootstrap-resampled versions of the training data, and classify by majority vote. • Boosting (Freund & Shapire, 1996): Fit many large or small trees to reweighted versions of the training data. Classify by weighted majority vote. • Random Forests (Breiman 1999): Fancier version of bagging. In general Boosting ≻ Random Forests ≻ Bagging ≻ Single Tree. 235
- 10. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Bagging • Bagging or bootstrap aggregation averages a noisy ﬁtted function reﬁt to many bootstrap samples, to reduce its variance. Bagging is a poor man’s Bayes; See pp 282. • Suppose C(S, x) is a ﬁtted classiﬁer, such as a tree, ﬁt to our training data S, producing a predicted class label at input point x. • To bag C, we draw bootstrap samples S ∗1 , . . . S ∗B each of size N with replacement from the training data. Then Cbag (x) = Majority Vote {C(S ∗b , x)}B . b=1 • Bagging can dramatically reduce the variance of unstable procedures (like trees), leading to improved prediction. However any simple structure in C (e.g a tree) is lost. 236
- 11. SLDM III c Hastie & Tibshirani - October 9, 2013 Original Tree Random Forests and Boosting b=1 b=2 x.1 < 0.395 x.1 < 0.555 x.2 < 0.205 | | | 1 1 0 1 0 1 0 0 1 0 0 0 1 b=3 0 0 1 0 1 b=4 1 b=5 x.2 < 0.285 x.3 < 0.985 x.4 < −1.36 | | | 0 1 0 0 1 1 0 1 1 1 0 1 0 1 b=6 1 1 0 0 1 b=7 b=8 x.1 < 0.395 x.1 < 0.395 x.3 < 0.985 | | | 1 1 1 0 1 1 0 0 0 1 0 0 1 b=9 0 0 1 b = 10 b = 11 x.1 < 0.395 x.1 < 0.555 x.1 < 0.555 | | | 1 0 1 0 1 1 0 1 0 0 1 0 1 0 1 237
- 12. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Decision Boundary: Bagging 2 3 Error Rate: 0.032 1 1 0 1 1 -1 1 1 -2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 0 0 01 1 1 11 0 11 1 1 1 1 0 000 1 0 00 0 000 0 00 0 0 00 0 0 0 00 1 0 0 00 1 1 11 1 1 0 0 0 0 0 0 0 00 0 1 1 1 0 0 00 0 0 0 00 0 0 0 00 0 00 0 0 0 00 0 0 0 0 0 1 0 0 0 0 0 0 00 0 1 00 0 1 1 0 0 1 00 0 0 1 1 000 00 0 0 0 1 0 1 1 1 0 11 1 10 01 11 1 11 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 • Bagging reduces error by variance reduction. -3 X2 1 1 11 -3 -2 -1 0 X1 1 2 • Bagging averages many trees, and produces smoother decision boundaries. 3 • Bias is slightly increased because bagged trees are slightly shallower. 238
- 13. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Random forests • reﬁnement of bagged trees; very popular— chapter 15. • at each tree split, a random sample of m features is drawn, and only those m features are considered for splitting. Typically √ m = p or log2 p, where p is the number of features • For each tree grown on a bootstrap sample, the error rate for observations left out of the bootstrap sample is monitored. This is called the “out-of-bag” or OOB error rate. • random forests tries to improve on bagging by “de-correlating” the trees, and reduce the variance. • Each tree has the same expectation, so increasing the number of trees does not alter the bias of bagging or random forests. 239
- 14. SLDM III c Hastie & Tibshirani - October 9, 2013 Wisdom of Crowds 10 Wisdom of Crowds Consensus Individual 8 • 10 movie categories, 4 choices in each category 6 • 50 people, for each question 15 informed and 35 guessers 4 Expected Correct out of 10 Random Forests and Boosting • For informed people, 0 0.25 0.50 0.75 P − Probability of Informed Person Being Correct 1.00 = p Pr(incorrect) 2 Pr(correct) = (1 − p)/3 • For guessers, Pr(all choices) = 1/4. Consensus (orange) improve over individuals (teal) even for moderate p. 240
- 15. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 241 1.0 ROC curve for TREE, SVM and Random Forest on SPAM data 0.8 TREE, SVM and RF Random Forest − Error: 4.75% SVM − Error: 6.7% TREE − Error: 8.7% Random Forest dominates both other methods on the SPAM data — 4.75% error. Used 500 trees with default settings for random Forest package in R. 0.0 0.2 0.4 Sensitivity 0.6 o o o 0.0 0.2 0.4 0.6 Specificity 0.8 1.0
- 16. SLDM III c Hastie & Tibshirani - October 9, 2013 242 0.07 0.08 RF: decorrelation 0.06 ˆ • VarfRF (x) = ρ(x)σ 2 (x) 0.03 0.04 0.05 • The correlation is because we are growing trees to the same data. 0.02 Correlation between Trees Random Forests and Boosting 1 4 7 10 16 22 28 34 40 Number of Randomly Selected Splitting Variables m 46 • The correlation decreases as m decreases—m is the number of variables sampled at each split. Based on a simple experiment with random forest regression models.
- 17. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Random Forest Ensemble 0.15 0.10 0.05 Mean Squared Error Squared Bias Variance 0.0 0 10 20 30 m 40 50 Variance 0.80 0.70 0.75 0.20 0.85 Bias & Variance 0.65 Mean Squared Error and Squared Bias 243 • The smaller m, the lower the variance of the random forest ensemble • Small m is also associated with higher bias, because important variables can be missed by the sampling.
- 18. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 244 Boosting • Breakthrough invention of Freund & Schapire (1997) Weighted Sample CM (x) • Average many trees, each grown to reweighted versions of the training data. • Weighting decorrelates the trees, by focussing on regions missed by past trees. Weighted Sample C3 (x) Weighted Sample C2 (x) • Final Classiﬁer is weighted average of classiﬁers: M αm Cm (x) C(x) = sign Training Sample C1 (x) m=1 Note: Cm (x) ∈ {−1, +1}.
- 19. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 245 100 Node Trees Boosting vs Bagging 0.4 Bagging AdaBoost • Bayes error rate is 0%. 0.2 • Trees are grown best ﬁrst without pruning. 0.1 • Leftmost term is a single tree. 0.0 Test Error 0.3 • 2000 points from Nested Spheres in R10 0 100 200 Number of Terms 300 400
- 20. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting AdaBoost (Freund & Schapire, 1996) 1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N . 2. For m = 1 to M repeat steps (a)–(d): (a) Fit a classiﬁer Cm (x) to the training data using weights wi . (b) Compute weighted error of newest tree errm = N i=1 wi I(yi = Cm (xi )) N i=1 wi (c) Compute αm = log[(1 − errm )/errm ]. (d) Update weights for i = 1, . . . , N : wi ← wi · exp[αm · I(yi = Cm (xi ))] and renormalize to wi to sum to 1. 3. Output C(x) = sign M m=1 αm Cm (x) . . 246
- 21. Random Forests and Boosting 0.5 SLDM III c Hastie & Tibshirani - October 9, 2013 247 Boosting Stumps 0.4 Single Stump 0.3 • Boosting stumps works remarkably well on the nested-spheres problem. 0.1 0.2 400 Node Tree 0.0 Test Error • A stump is a two-node tree, after a single split. 0 100 200 Boosting Iterations 300 400
- 22. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 248 0.4 • Bayes error is 0%. in 10- 0.2 • Boosting drives the training error to zero (green curve). 0.1 • Further iterations continue to improve test error in many examples (red curve). 0.0 Train and Test Error • Nested spheres Dimensions. 0.3 0.5 Training Error 0 100 200 300 Number of Terms 400 500 600
- 23. Random Forests and Boosting 0.5 SLDM III c Hastie & Tibshirani - October 9, 2013 249 0.3 • Nested Gaussians in 10-Dimensions. • Bayes error is 25%. Bayes Error 0.2 • Boosting with stumps 0.1 • Here the test error does increase, but quite slowly. 0.0 Train and Test Error 0.4 Noisy Problems 0 100 200 300 Number of Terms 400 500 600
- 24. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Stagewise Additive Modeling Boosting builds an additive model M βm b(x; γm ). F (x) = m=1 Here b(x, γm ) is a tree, and γm parametrizes the splits. We do things like that in statistics all the time! • GAMs: F (x) = j fj (xj ) • Basis expansions: F (x) = M m=1 θm hm (x) Traditionally the parameters fm , θm are ﬁt jointly (i.e. least squares, maximum likelihood). With boosting, the parameters (βm , γm ) are ﬁt in a stagewise fashion. This slows the process down, and overﬁts less quickly. 250
- 25. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Example: Least Squares Boosting 1. Start with function F0 (x) = 0 and residual r = y, m = 0 2. m ← m + 1 3. Fit a CART regression tree to r giving g(x) 4. Set fm (x) ← ǫ · g(x) 5. Set Fm (x) ← Fm−1 (x) + fm (x), r ← r − fm (x) and repeat step 2–5 many times • g(x) typically shallow tree; e.g. constrained to have k = 3 splits only (depth). • Stagewise ﬁtting (no backﬁtting!). Slows down (over-) ﬁtting process. • 0 < ǫ ≤ 1 is shrinkage parameter; slows overﬁtting even further. 251
- 26. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Least Squares Boosting and ǫ Prediction Error Single tree ǫ=1 ǫ = .01 Number of steps m • If instead of trees, we use linear regression terms, boosting with small ǫ very similar to lasso! 252
- 27. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Adaboost: Stagewise Modeling • AdaBoost builds an additive logistic regression model M Pr(Y = 1|x) αm fm (x) F (x) = log = Pr(Y = −1|x) m=1 by stagewise ﬁtting using the loss function L(y, F (x)) = exp(−y F (x)). • Instead of ﬁtting trees to residuals, the special form of the exponential loss leads to ﬁtting trees to weighted versions of the original data. See pp 343. 253
- 28. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 3.0 Why Exponential Loss? 2.0 0.5 1.0 1.5 • Leads to simple reweighting scheme. 0.0 Loss 2.5 Misclassification Exponential Binomial Deviance Squared Error Support Vector • e−yF (x) is a monotone, smooth upper bound on misclassiﬁcation loss at x. −2 −1 0 1 2 • Has logit transform as population minimizer y·F F ∗ (x) = • y · F is the margin; positive margin means correct classiﬁcation. 1 Pr(Y = 1|x) log 2 Pr(Y = −1|x) • Other more robust loss functions, like binomial deviance. 254
- 29. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting General Boosting Algorithms The boosting paradigm can be extended to general loss functions — not only squared error or exponential. 1. Initialize F0 (x) = 0. 2. For m = 1 to M : (a) Compute (βm , γm ) = arg minβ,γ N i=1 L(yi , Fm−1 (xi ) + βb(xi ; γ)). (b) Set Fm (x) = Fm−1 (x) + βm b(x; γm ). Sometimes we replace step (b) in item 2 by (b∗ ) Set Fm (x) = Fm−1 (x) + ǫβm b(x; γm ) Here ǫ is a shrinkage factor; in gbm package in R, the default is ǫ = 0.001. Shrinkage slows the stagewise model-building even more, and typically leads to better performance. 255
- 30. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Gradient Boosting • General boosting algorithm that works with a variety of diﬀerent loss functions. Models include regression, resistant regression, K-class classiﬁcation and risk modeling. • Gradient Boosting builds additive tree models, for example, for representing the logits in logistic regression. • Tree size is a parameter that determines the order of interaction (next slide). • Gradient Boosting inherits all the good features of trees (variable selection, missing data, mixed predictors), and improves on the weak features, such as prediction performance. • Gradient Boosting is described in detail in , section 10.10, and is implemented in Greg Ridgeways gbm package in R. 256
- 31. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 0.070 Spam Data 0.055 0.050 0.045 0.040 Test Error 0.060 0.065 Bagging Random Forest Gradient Boosting (5 Node) 0 500 1000 1500 Number of Trees 2000 2500 257
- 32. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Spam Example Results With 3000 training and 1500 test observations, Gradient Boosting ﬁts an additive logistic model f (x) = log Pr(spam|x) Pr(email|x) using trees with J = 6 terminal-node trees. Gradient Boosting achieves a test error of 4.5%, compared to 5.3% for an additive GAM, 4.75% for Random Forests, and 8.7% for CART. 258
- 33. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Spam: Variable Importance 3d addresses labs telnet 857 415 direct cs table 85 # parts credit [ lab conference report original data project font make address order all hpl technology people pm mail over 650 ; meeting email 000 internet receive ( re business 1999 will money our you edu CAPTOT george CAPMAX your CAPAVE free remove hp $ ! 0 20 40 60 Relative importance 80 100 259
- 34. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 0.0 0.2 0.4 0.6 0.8 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Partial Dependence -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Partial Dependence Spam: Partial Dependence 1.0 0.0 0.2 0.4 0.6 -0.2 0.0 0.2 -1.0 -0.6 -0.6 Partial Dependence -0.2 0.0 0.2 remove -1.0 Partial Dependence ! 0.0 0.2 0.4 0.6 edu 0.8 1.0 0.0 0.5 1.0 1.5 hp 2.0 2.5 3.0 260
- 35. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting California Housing Data 0.42 0.40 0.38 0.36 0.34 0.32 Test Average Absolute Error 0.44 RF m=2 RF m=6 GBM depth=4 GBM depth=6 0 200 400 600 Number of Trees 800 1000 261
- 36. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Tree Size 0.4 Stumps 10 Node 100 Node Adaboost 0.2 0.3 The tree size J determines the interaction order of the model: η(X) = ηj (Xj ) 0.1 j + ηjk (Xj , Xk ) jk 0.0 Test Error 262 + 0 100 200 Number of Terms 300 400 ηjkl (Xj , Xk , Xl ) jkl +···
- 37. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Stumps win! Since the true decision boundary is the surface of a sphere, the function that describes it has the form 2 2 2 f (X) = X1 + X2 + . . . + Xp − c = 0. Boosted stumps via Gradient Boosting returns reasonable approximations to these quadratic functions. Coordinate Functions for Additive Logistic Trees f1 (x1 ) f2 (x2 ) f3 (x3 ) f4 (x4 ) f5 (x5 ) f6 (x6 ) f7 (x7 ) f8 (x8 ) f9 (x9 ) f10 (x10 ) 263
- 38. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Learning Ensembles Ensemble Learning consists of two steps: 1. Construction of a dictionary D = {T1 (X), T2 (X), . . . , TM (X)} of basis elements (weak learners) Tm (X). 2. Fitting a model f (X) = m∈D αm Tm (X). Simple examples of ensembles • Linear regression: The ensemble consists of the coordinate functions Tm (X) = Xm . The ﬁtting is done by least squares. • Random Forests: The ensemble consists of trees grown to booststrapped versions of the data, with additional randomization at each split. The ﬁtting simply averages. • Gradient Boosting: The ensemble is grown in an adaptive fashion, but then simply averaged at the end. 264
- 39. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Learning Ensembles and the Lasso Proposed by Popescu and Friedman (2004): Fit the model f (X) = m∈D αm Tm (X) in step (2) using Lasso regularization: α(λ) = arg min α αm Tm (xi )] + λ L[yi , α0 + i=1 M M N m=1 m=1 |αm |. Advantages: • Often ensembles are large, involving thousands of small trees. The post-processing selects a smaller subset of these and combines them eﬃciently. • Often the post-processing improves the process that generated the ensemble. 265
- 40. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Post-Processing Random Forest Ensembles 0.09 Spam Data 0.07 0.06 0.05 0.04 Test Error 0.08 Random Forest Random Forest (5%, 6) Gradient Boost (5 node) 0 100 200 300 Number of Trees 400 500 266
- 41. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Software • R: free GPL statistical computing environment available from CRAN, implements the S language. Includes: – randomForest: implementation of Leo Breimans algorithms. – rpart: Terry Therneau’s implementation of classiﬁcation and regression trees. – gbm: Greg Ridgeway’s implementation of Friedman’s gradient boosting algorithm. • Salford Systems: Commercial implementation of trees, random forests and gradient boosting. • Splus (Insightful): Commerical version of S. • Weka: GPL software from University of Waikato, New Zealand. Includes Trees, Random Forests and many other procedures. 267

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment