Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles

11,665 views

Published on

This tutorial, based on a published book by Giovanni Seni, offers a hands-on intro to ensemble models, which combine multiple models into a single predictive system that’s often more accurate than the best of its components. Participants will use data sets and snippets of R code to experiment with the methods to gain a practical understanding of this breakthrough technology.

Giovanni Seni is currently a Senior Data Scientist with Intuit where he leads the Applied Data Sciences team. As an active data mining practitioner in Silicon Valley, he has over 15 years R&D experience in statistical pattern recognition and data mining applications. He has been a member of the technical staff at large technology companies, and a contributor at smaller organizations. He holds five US patents and has published over twenty conference and journal articles. His book with John Elder, “Ensemble Methods in Data Mining – Improving accuracy through combining predictions”, was published in February 2010 by Morgan & Claypool. Giovanni is also an adjunct faculty at the Computer Engineering Department of Santa Clara University, where he teaches an Introduction to Pattern Recognition and Data Mining class.

Published in: Education, Technology

Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles

  1. 1. How to Create Predictive Models in R using Ensembles Giovanni Seni, Ph.D. Intuit @IntuitInc Giovanni_Seni@intuit.com Santa Clara University GSeni@scu.edu Strata - Hadoop World, New York October 28, 2013
  2. 2. Reference © 2013 G.Seni 2013 Strata Conference + Hadoop World 2
  3. 3. Overview •  Motivation, In a Nutshell & Timeline •  Predictive Learning & Decision Trees •  Ensemble Methods - Diversity & Importance Sampling –  Bagging –  Random Forest –  Ada Boost –  Gradient Boosting –  Rule Ensembles •  Summary © 2013 G.Seni 2013 Strata Conference + Hadoop World 3
  4. 4. Motivation Volume 9 Issue 2 © 2013 G.Seni 2013 Strata Conference + Hadoop World 4
  5. 5. Motivation (2) “1′st Place Algorithm Description: … 4. Classification: Ensemble classification methods are used to combine multiple classifiers. Two separate Random Forest ensembles are created based on the shadow index (one for the shadow-covered area and one for the shadow-free area). The random forest “Out of Bag” error is used to automatically evaluate features according to their impact, resulting in 45 features selected for the shadow-free and 55 for the shadow-covered part.” © 2013 G.Seni 2013 Strata Conference + Hadoop World 5
  6. 6. Motivation (3) •  “What are the best of the best techniques at winning Kaggle competitions? –  Ensembles of Decisions Trees –  Deep Learning account for 90% of top 3 winners!” Jeremy Howard, Chief Scientist of Kaggle KDD 2013 ⇒ Key common characteristics: –  Resistance to overfitting –  Universal approximations © 2013 G.Seni 2013 Strata Conference + Hadoop World 6
  7. 7. Ensemble Methods in a Nutshell •  “Algorithmic” statistical procedure •  Based on combining the fitted values from a number of fitting attempts •  Loosely related to: –  Iterative procedures –  Bootstrap procedures •  Original idea: a “weak” procedure can be strengthened if it can operate “by committee” –  e.g., combining low-bias/high-variance procedures •  Accompanied by interpretation methodology © 2013 G.Seni 2013 Strata Conference + Hadoop World 7
  8. 8. Timeline •  CART (Breiman, Friedman, Stone, Olshen, 1983) •  Bagging (Breiman, 1996) –  Random Forest (Ho, 1995; Breiman 2001) •  AdaBoost (Freund, Schapire, 1997) •  Boosting – a statistical view (Friedman et. al., 2000) –  Gradient Boosting (Friedman, 2001) –  Stochastic Gradient Boosting (Friedman, 1999) •  Importance Sampling Learning Ensembles (ISLE) (Friedman, Popescu, 2003) © 2013 G.Seni 2013 Strata Conference + Hadoop World 8
  9. 9. Timeline (2) •  Regularization – variance control techniques: –  Lasso (Tibshirani, 1996) –  LARS (Efron, 2004) –  Elastic Net (Zou, Hastie, 2005) –  GLMs via Coordinate Descent (Friedman, Hastie, Tibshirani, 2008) •  Rule Ensembles (Friedman, Popescu, 2008) © 2013 G.Seni 2013 Strata Conference + Hadoop World 9
  10. 10. Overview •  Motivation, In a Nutshell & Timeline Ø  Predictive Learning & Decision Trees •  Ensemble Methods •  Summary © 2013 G.Seni 2013 Strata Conference + Hadoop World 10
  11. 11. Predictive Learning Procedure Summary N N •  Given "training" data D = { yi , xi1 , xi 2 ,, xin }1 = { yi , x i }1 –  D is a random sample from some unknown (joint) distribution    •  Build a functional model y = F ( x1 , x2 ,, xn ) = F ( x ) –  Offers adequate and interpretable description of how the inputs affect the outputs –  Parsimony is an important criterion: simpler models are preferred for the sake of scientific insight into the x - y relationship •  Need to specify: < model, score criterion, search strategy > © 2013 G.Seni 2013 Strata Conference + Hadoop World 11
  12. 12. Predictive Learning Procedure Summary (2) •  Model: underlying functional form sought from data   F (x) = F (x; a) ∈ ℱ family of functions indexed by a •  Score criterion: judges (lack of) quality of fitted model  –  Loss function L( y, F ): penalizes individual errors in prediction  –  Risk R(a) = E y ,x L( y, F (x; a)) : the expected loss over all predictions •  Search Strategy: minimization procedure of score criterion a* = arg min R(a) a © 2013 G.Seni 2013 Strata Conference + Hadoop World 12
  13. 13. Predictive Learning Procedure Summary (3) •  “Surrogate” Score criterion: N –  Training data: { yi , x i }1 ~ p( x, y ) * –  p ( x, y ) unknown ⇒ a unknown ⇒ Use approximation: Empirical Risk  1 •  R (a) = N  ∑ L( y, F (xi ; a)) N i =1 •  If not N >> n , © 2013 G.Seni ⇒   a = arg min R(a) a  R(a) >> R(a* ) 2013 Strata Conference + Hadoop World 13
  14. 14. Predictive Learning Example •  A simple data set Attribute-1 Attribute-2 Class ( x1 ) ( x2 ) 1.0 2.0 blue 2.0 1.0 green … … … 4.5 3.5 x2 ? (y) •  What is the class of new point x1 ? •  Many approaches… no method is universally better; try several / use committee © 2013 G.Seni 2013 Strata Conference + Hadoop World 14
  15. 15. Predictive Learning Example (2) •  Ordinary Linear Regression (OLR) x2 x1 n  –  Model: F(x) = a0 + ∑ a j x j j=1 ;  ⎧ F (x) ≥ 0 ⎨ ⎩else ⇒ Not flexible enough © 2013 G.Seni 2013 Strata Conference + Hadoop World 15
  16. 16. Decision Trees Overview x2 R2 x1 ≥ 5 R1 x2 ≥ 3 3 R4 x1 ≥ 2 R3 2 x1 5 M ˆ ˆ ˆ •  Model: y = T (x ) = ∑ cm I R (x ) m =1 m  M {Rm }m=1 = Sub-regions of input variable space where I R (x) = 1 if x ∈ R , 0 otherwise © 2013 G.Seni 2013 Strata Conference + Hadoop World 16
  17. 17. Decision Trees Overview (2) •  Score criterion: –  Classification – "0-1 loss" ⇒ misclassification error (or surrogate) N M ˆ { cˆm, Rm } = argmin 1 M cm ,Rm 1 TM ={ } ∑I (y ≠ T i M (x i )) i=1 2 ˆ ˆ –  Regression – least squares – i.e., L( y , y ) = ( y − y ) M ˆ ˆ { cm, Rm } = argmin 1 M N ∑( y − T i TM ={cm ,Rm }1 i=1 M (x i ))  R(TM ) 2   •  Search: Find T = arg min T R(T ) –  i.e., find best regions Rm and constants cm © 2013 G.Seni 2013 Strata Conference + Hadoop World 17
  18. 18. Decision Trees Overview (3) •  Join optimization with respect to Rm and cm simultaneously is very difficult ⇒ use a greedy iterative procedure R0 R4 R1 R5 R6 R2 R3 • • j 1 , s1 R0 j 2 , s2 • • j 1 , s1 R0 j 2 , s2 R1 • R0 • R1 • R3 • j 1 , s1 • R4 j 3 , s3 j 2 , s2 • j 1 , s1 R0 • R3 •• • R4 R5 R7 2013 Strata Conference + Hadoop World j 4 , s4 R6 • © 2013 G.Seni j 3 , s3 R2 R1 R2 • • R8 18
  19. 19. Decision Trees What is the “right” size of a model? y y y ο ο ο ο ο ο ο ο ο ο ο ο ο ο ο ⇒ ο ο ο c1 ο ο ο x ο ο ο ο ο ο ο ο ο c2 ο vs ο ο c1 ο c2 ο ο ο ο ο ο ο ο ο ο c3 ο ο ο ο x x •  Dilemma –  If model (# of splits) is too small, then approximation is too crude (bias) ⇒ increased errors –  If model is too large, then it fits the training data too closely (overfitting, increased variance) ⇒ increased errors © 2013 G.Seni 2013 Strata Conference + Hadoop World 19
  20. 20. Decision Trees What is the “right” size of a model? (2) High Bias Low Bias Low Variance Prediction Error High Variance Test Sample Training Sample Low M* High Model Complexity –  Right sized tree, M * when test error is at a minimum , –  Error on the training is not a useful estimator! •  If test set is not available, need alternative method © 2013 G.Seni 2013 Strata Conference + Hadoop World 20
  21. 21. Decision Trees Pruning to obtain “right” size •  Two strategies –  Prepruning - stop growing a branch when information becomes unreliable •  #(Rm) – i.e., number of data points, too small ⇒ same bound everywhere in the tree •  Next split not worthwhile ⇒ Not sufficient condition –  Postpruning - take a fully-grown tree and discard unreliable parts (i.e., not supported by test data) •  C4.5: pessimistic pruning •  CART: cost-complexity pruning © 2013 G.Seni (more statistically grounded) 2013 Strata Conference + Hadoop World 21
  22. 22. Decision Trees 1.0 Hands-on Exercise Start Rstudio •  0.8 •  Navigate to directory: example.1.LinearBoundary Load and run “fitModel_CART.R” •  If curious, also see “gen2DdataLinear.R” •  After boosting discussion, load and run “fitModel_GBM.R 0.0 0.2 0.4 x2 Set working directory: use setwd() or with GUI •  0.6 •  0.0 0.2 0.4 0.6 0.8 1.0 x1 © 2013 G.Seni 2013 Strata Conference + Hadoop World 22
  23. 23. Decision Trees Key Features •  Ability to deal with irrelevant inputs –  i.e., automatic variable subset selection –  Measure anything you can measure –  Score provided for selected variables ("importance") •  No data preprocessing needed -  Naturally handle all types of variables •  numeric, binary, categorical -  Invariant under monotone transformations: x j = g j (x j ) •  •  © 2013 G.Seni Variable scales are irrelevant Immune to bad x j −distributions (e.g., outliers) 2013 Strata Conference + Hadoop World 23
  24. 24. Decision Trees Key Features (2) •  Computational scalability –  Relatively fast: O(nN log N ) •  Missing value tolerant -  Moderate loss of accuracy due to missing values -  Handling via "surrogate" splits •  "Off-the-shelf" procedure -  Few tunable parameters •  Interpretable model representation -  Binary tree graphic © 2013 G.Seni 2013 Strata Conference + Hadoop World 24
  25. 25. Decision Trees Limitations •  Discontinuous piecewise constant model F (x) x –  In order to have many splits you need to have a lot of data •  In high-dimensions, you often run out of data after a few splits –  Also note error is bigger near region boundaries © 2013 G.Seni 2013 Strata Conference + Hadoop World 25
  26. 26. Decision Trees Limitations (2) •  Not good for low interaction F * (x ) n * –  e.g., F (x ) = ao + ∑ a j x j is worst function for trees j =1 n = ∑ f j* (x j ) (no interaction, additive) j =1 –  In order for xl to enter model, must split on it •  Path from root to node is a product of indicators •  Not good for F * (x ) that has dependence on many variables -  Each split reduces training data for subsequent splits (data fragmentation) © 2013 G.Seni 2013 Strata Conference + Hadoop World 26
  27. 27. Decision Trees Limitations (3) •  High variance caused by greedy search strategy (local optima) –  Errors in upper splits are propagated down to affect all splits below it ⇒ Small changes in data (sampling fluctuations) can cause big changes in tree - Very deep trees might be questionable - Pruning is important •  What to do next? –  Live with problems –  Use other methods (when possible) –  Fix-up trees: use ensembles © 2013 G.Seni 2013 Strata Conference + Hadoop World 27
  28. 28. Overview •  In a Nutshell & Timeline •  Predictive Learning & Decision Trees Ø  Ensemble Methods –  In a Nutshell, Diversity & Importance Sampling –  Generic Ensemble Generation –  Bagging, RF, AdaBoost, Boosting, Rule Ensembles •  Summary © 2013 G.Seni 2013 Strata Conference + Hadoop World 28
  29. 29. Ensemble Methods In a Nutshell M •  Model: F (x) = c0 + ∑m=1 cmTm (x) M –  { m (x)}1 : “basis” functions (or “base learners”) T –  i.e., linear model in a (very) high dimensional space of derived variables •  Learner characterization: Tm (x) = T (x; p m ) –  p m : a specific set of joint parameter values – e.g., split definitions at internal nodes and predictions at terminal nodes –  {T (x; p)}p∈P : function class – i.e., set of all base learners of specified family © 2013 G.Seni 2013 Strata Conference + Hadoop World 29
  30. 30. Ensemble Methods In a Nutshell (2) •  Learning: two-step process; approximate solution to N M   M {cm , p m }o = arg min ∑ L yi , c0 + ∑ cmT (x;p m ) M {cm , p m }o i=1 ( m=1 ) M –  Step 1: Choose points {p m }1 M •  i.e., select {Tm (x)}1 ⊂ {T (x; p)}p∈P M –  Step 2: Determine weights {cm }0 •  e.g., via regularized LR © 2013 G.Seni 2013 Strata Conference + Hadoop World 30
  31. 31. Ensemble Methods Importance Sampling (Friedman, 2003) •  How to judiciously choose the “basis” functions (i.e., {pm }1M )? M •  Goal: find “good” {pm }1 so that M M F (x;{p m }1 , {cm }1 ) ≅ F * (x ) •  Connection with numerical integration: –  ∫ Ρ M I (p) ∂p ≈ ∑m =1 w m I (p m ) vs. © 2013 G.Seni 2013 Strata Conference + Hadoop World Accuracy improves when we choose more points from this region… 31
  32. 32. Importance Sampling Numerical Integration via Monte Carlo Methods M •  r (p) = sampling pdf of p ∈ P -- i.e, {p m ~ r (p)}1 –  Simple approach: r (p m ) iid -- i.e., uniform –  In our problem: inversely related to p m’s “risk” •  i.e., T (x; p m ) has high error ⇒ lack of relevance of p m ⇒ low r (pm ) •  “Quasi” Monte Carlo: –  with/out knowledge of the other points that will be used •  i.e., single point vs. group importance –  Sequential approximation: p’s relevance judged in the context of the (fixed) previously selected points © 2013 G.Seni 2013 Strata Conference + Hadoop World 32
  33. 33. Ensemble Methods Importance Sampling – Characterization of •  Let p∗ = arg minp Risk(p) Narrow r (p) Broad r (p) M T •  Ensemble { (x; p m )}1 of “strong” base learners - i.e., all with Risk (p m ) ≈ Risk (p∗ ) •  Diverse ensemble - i.e., predictions are not highly correlated with each other •  T (x; p m ) yield similar highly correlated ’s predictions ⇒ unexceptional performance •  However, many “weak” base learners - i.e., Risk (p m ) >> Risk (p ∗ ) ⇒ poor performance © 2013 G.Seni 2013 Strata Conference + Hadoop World 33
  34. 34. Ensemble Methods Approximate Process of Drawing from •  Heuristic sampling strategy: sampling around p by iteratively applying small perturbations to existing problem structure ∗ –  Generating ensemble members Tm (x) = T (x; p m ) For m = 1 to M { pm = PERTURBm { minp Ε xy L( y, T (x; p) )} arg } ⋅ –  PERTURB {} is a (random) modification of any of •  Data distribution - e.g., by re-weighting the observations •  Loss function - e.g., by modifying its argument •  Search algorithm (used to find minp)  –  Width of r (p ) is controlled by degree of perturbation © 2013 G.Seni 2013 Strata Conference + Hadoop World 34
  35. 35. Generic Ensemble Generation Step 1: Choose Base Learners p! ! ! ! •  Forward Stagewise Fitting Procedure: 𝐹0 (x) = 0     For    𝑚 = 1  to  𝑀    {            //  Fit  a  single  base  learner              p Modification of data distribution 𝑚 = argmin . p 𝐿0𝑦 𝑖 ,  𝐹 𝑚 −1 + 𝑇(x 𝑖 ; p)8   𝑖∈𝑆 𝑚 ( 𝜂 )          //  Update  additive  expansion            𝑇 𝑚 ( 𝑥 ) = 𝑇0x; p 𝒎 8            𝐹 𝑚 (x) =   𝐹 𝑚 −1 (x) + 𝜐 ∙ 𝑇 𝑚 (x)   }   write  { 𝑇 𝑚 (x)}1𝑀   –  Algorithm control: L, η , υ Modification of loss function (“sequential” approximation) •  Sm (η ) : random sub-sample of size η ≤ N ⇒ impacts ensemble "diversity" m −1 •  Fm−1 (x) = υ ⋅ ∑k =1Tk (x) : “memory” function (0 ≤ υ ≤ 1 ) © 2013 G.Seni 2013 Strata Conference + Hadoop World 35
  36. 36. Generic Ensemble Generation Step 2: Choose Coefficients c! ! !! M M •  Given {Tm (x)}m=1 = {T (x; pm )}m=1 , coefficients can be obtained by a regularized linear regression N M ⎛ ⎞  {cm } = arg min ∑ L⎜ yi , c0 + ∑ cmTm (xi ) ⎟ + λ ⋅ P(c ) {cm } i =1 ⎝ m =1 ⎠ –  Regularization here helps reduce bias (in addition to variance) of the model –  New iterative fast algorithms for various loss/penalty combinations •  “GLMs via Coordinate Descent” (2008) © 2013 G.Seni 2013 Strata Conference + Hadoop World 36
  37. 37. Bagging (Breiman, 1996) •  Bagging = Bootstrap Aggregation ˆ •  L(y, y) : as available for single tree F0 (x) = 0 For m = 1 to M { •  υ = 0 ⇒ no memory p m = arg min p •  η = N / 2 i m −1 i∈S m ( ) ( x i ) + T ( x i ; p )) Tm (x) = T (x; p m ) •  Tm (x) ⇒ are large un-pruned trees ∑ηL(y , F Fm (x) = Fm −1 (x) + υ ⋅ Tm (x) υ } •  co = 0, {cm = 1 / M }1M M i.e., not fit to the data (avg) write {Tm (x)}1 –  i.e., perturbation of the data distribution only –  Potential improvements? –  R package: ipred © 2013 G.Seni 2013 Strata Conference + Hadoop World 37
  38. 38. Bagging Hands-on Exercise 1.0 •  Navigate to directory: example.2.EllipticalBoundary 0.0 Load and run –  fitModel_Bagging_by_hand.R -0.5 –  fitModel_CART.R (optional) •  If curious, also see gen2DdataNonLinear.R -1.0 x2 Set working directory: use setwd() or with GUI •  0.5 •  •  After class, load and run fitModel_Bagging.R -2 -1 0 1 2 x1 © 2013 G.Seni 2013 Strata Conference + Hadoop World 38
  39. 39. Bagging Why it helps?  ˆ •  Under L( y, y) = ( y − y) 2, averaging reduces variance and leaves bias unchanged •  Consider “idealized” bagging (aggregate)  estimator: f (x) = Ε f Z (x)  –  f Z fit to bootstrap data set Z = {yi , xi }1N –  Z is sampled from actual population distribution (not training data)   –  We can write: Ε[Y − f Z (x)] = Ε[Y − f (x) + f (x) − f Z (x)] 2 2  2 = Ε Y − f ( x) + Ε f Z ( x) − f ( x) [ ] ≥ Ε[Y − f (x)] [ ] 2 2 ⇒  true population aggregation never increases mean squared error! ⇒  Bagging will often decrease MSE… © 2013 G.Seni 2013 Strata Conference + Hadoop World 39
  40. 40. Random Forest (Ho, 1995; Breiman, 2001) •  Random Forest = Bagging + algorithm randomizing –  Subset splitting As each tree is constructed… •  Draw a random sample of predictors before each node is split ns = ⎣log 2 (n) + 1⎦ •  Find best split as usual but selecting only from subset of predictors  M ⇒ Increased diversity among {Tm (x)}1 - i.e., wider r (p) •  Width (inversely) controlled by ns –  Speed improvement over Bagging –  R package: randomForest © 2013 G.Seni 2013 Strata Conference + Hadoop World 40
  41. 41. Bagging vs. Random Forest vs. ISLE 100 Target Functions Comparison (Popescu, 2005) •  ISLE improvements: –  Different data sampling strategy (not fixed) –  Fit coefficients to data Comparative RMS Error •  xxx_6_5%_P : 6 terminal nodes trees 5% samples without replacement Post-processing – i.e., using estimated “optimal” quadrature coefficients ⇒ Significantly faster to build! Bag © 2013 G.Seni RF Bag_6_5%_P RF_6_5%_P 2013 Strata Conference + Hadoop World 41
  42. 42. AdaBoost (Freund & Schapire, 1997) observation weights : wi( 0 ) = 1 N For m = 1 to M { a. Fit a classifier Tm (x) to training data with wi( m ) b. Compute errm = ∑ N i =1 (cm , p m ) = arg min w I ( yi ≠ Tm (x i )) ∑ N ∑ηL( y , F i m −1 (x i ) + c ⋅ T (x i ; p) ) i∈S m ( ) Tm (x) = T (x; p m ) Fm (x) = Fm −1 (x) + υ ⋅ cm ⋅ Tm (x) d. Set wi( m +1) = wi( m ) ⋅ exp[α m ⋅ I ( yi ≠ Tm (x i )] } Output sign ∑m =1α mTm (x) c, p wi( m ) c. Compue α m = log((1 − errm ) errm ) M For m = 1 to M { (m) i i =1 ( F0 (x) = 0 } M write {cm , Tm (x)}1 ) –  We need to show p m = arg min (⋅) is equivalent to line a. above p Book •  Equivalence to Forward Stagewise Fitting Procedure –  cm = arg min (⋅) is equivalent to line c. c •  R package adabag © 2013 G.Seni 2013 Strata Conference + Hadoop World 42
  43. 43. AdaBoost Hands-on Exercise 1.0 •  Navigate to directory: example.2.EllipticalBoundary Set working directory: use setwd() or with GUI •  Load and run 0.0 –  fitModel_Adaboost_by_hand.R -0.5 •  After class, load and run fitModel_Adaboost.R and fitModel_RandomForest.R -1.0 x2 0.5 •  -2 -1 0 1 2 x1 © 2013 G.Seni 2013 Strata Conference + Hadoop World 43
  44. 44. Stochastic Gradient Boosting (Friedman, 2001) •  Boosting with any differentiable loss criterion ˆ •  General L( y, y ) F0 (x) = c00 •  υ = 0.1 ⇒ Sequential sampling For m = 1 to M { (cm , p m ) = arg min m c, p •  η = N 2 ∑ηL( y , F i m −1 i∈S m ( ) (x i ) + c ⋅ T (x i ; p)) Tm (x) = T (x; p m ) •  Tm (x) ⇒ Any “weak” learner N •  co = arg minc ∑i =1 L( yi , c) Fm (x) = Fm −1 (x) +υ ⋅ cm ⋅ Tm (x) υ } M write {(υ ⋅ cm ), Tm (x)}1 M •  {cm }1 ⇒ “shrunk” sequential partial regression coefficients –  Potential improvements? –  R package: gbm © 2013 G.Seni 2013 Strata Conference + Hadoop World 44
  45. 45. Stochastic Gradient Boosting LAD Regression – L !, ! = ! − ! •  More robust than ( y − F )2 •  Resistant to outliers in y …trees already providing resistance to outliers in x ! N F0 (x) = median{yi }1 For m = 1 to M { // Step1 : find Tm (x) ~ = sign ( y − F (x ) ) yi i m −1 i •  Note:  {R } J jm 1 –  Trees are fitted to pseudoresponse ( // Step2 : find coefficients ⇒ Can’t interpret interpret γˆ jm = median{yi − Fm −1 (x i )}1N  x i ∈R jm individual trees –  “shrunk” version of tree gets added to ensemble j = 1… J // Update expansion –  Original tree constants are overwritten © 2013 G.Seni N = J − terminal node LS - regression tree {~i , x i }1 y  Fm (x) = Fm −1 (x) + υ ⋅ ∑ γˆ jm I x i ∈ R jm J j =1 ( ) } 2013 Strata Conference + Hadoop World 45 )
  46. 46. Parallel vs. Sequential Ensembles 100 Target Functions Comparison (Popescu, 2005) Comparative RMS Error •  xxx_6_5%_P : 6 terminal nodes trees 5% samples without replacement Post-processing – i.e., using estimated “optimal” quadrature coefficients “Sequential” “Parallel” Bag RF Boost Seq_0.01_20%_P •  Seq_υ_η%_P : “Sequential” ensemble 6 terminal nodes trees υ : “memory” factor η % samples without replacement Post-processing Bag_6_5%_P RF_6_5%_P Seq_0.1_50%_P •  Sequential ISLE tend to perform better than parallel ones –  Consistent with results observed in classical Monte Carlo integration © 2013 G.Seni 2013 Strata Conference + Hadoop World 46
  47. 47. Rule Ensembles (Friedman & Popescu, 2005) J ˆ ˆ •  Trees as collection of conjunctive rules: Tm (x) = ∑ c jm I (x ∈ R jm ) j =1 R1 27 R4 15 R2 ⇒ R5 15 22 x1 r2 (x) = I ( x1 > 22) ⋅ I (0 ≤ x2 ≤ 27) r3 (x) = I (15 < x1 ≤ 22) ⋅ I (0 ≤ x2 ) R4 ⇒ r4 (x) = I (0 ≤ x1 ≤ 15) ⋅ I ( x2 > 15) R5 ⇒ x2 r1 (x) = I ( x1 > 22) ⋅ I ( x2 > 27) R3 ⇒ R3 ˆ y R2 R1 ⇒ r5 (x) = I (0 ≤ x1 ≤ 15) ⋅ I (0 ≤ x2 ≤ 15) –  These simple rules, rm (x) ∈ {0,1} can be used as base learners , –  Main motivation is interpretability © 2013 G.Seni 2013 Strata Conference + Hadoop World 47
  48. 48. Rule Ensembles ISLE Procedure •  Rule-based model: F (x) = a0 + ∑ am rm (x) m –  Still a piecewise constant model ⇒ complement the non-linear rules with purely linear terms: •  Fitting –  Step 1: derive rules from tree ensemble (shortcut) •  Tree size controls rule “complexity” (interaction order) –  Step 2: fit coefficients using linear regularized procedure: ( N P K ˆ ˆ ({ak },{b j }) = arg min ∑ L yi , F x; {ak }0 , {b j }1 {ak },{b j } © 2013 G.Seni i=1 ( )) +!!! ⋅ 2013 Strata Conference + Hadoop World !(a) + !(b) ! 48
  49. 49. Boosting & Rule Ensembles Hands-on Exercise 2500 •  example.3.Diamonds Load and run 1500 Set working directory: use setwd() or with GUI •  2000 •  –  viewDiamondData.R –  fitModel_GBM.R 1000 –  fitModel_RE.R •  500 Absolute loss Navigate to directory: After class, go to: example.1.LinearBoundary 0 200 400 600 800 1000 Run fitModel_GBM.R Iteration © 2013 G.Seni 2013 Strata Conference + Hadoop World 49
  50. 50. Overview •  Motivation, In a Nutshell & Timeline •  Predictive Learning & Decision Trees •  Ensemble Methods Ø  Summary © 2013 G.Seni 2013 Strata Conference + Hadoop World 50
  51. 51. Summary •  Ensemble methods have been found to perform extremely well in a variety of problem domains •  Shown to have desirable statistical properties •  Latest ensemble research brings together important foundational strands of statistics •  Emphasis on accuracy but significant progress has been made on interpretability Go build Ensembles and keep in touch! © 2013 G.Seni 2013 Strata Conference + Hadoop World 51

×