How to Create Predictive Models in
R using Ensembles
Giovanni Seni, Ph.D.
Intuit
@IntuitInc

Giovanni_Seni@intuit.com
Santa Clara University
GSeni@scu.edu

Strata - Hadoop World, New York
October 28, 2013
Reference

© 2013 G.Seni

2013 Strata Conference + Hadoop World

2
Overview
•  Motivation, In a Nutshell & Timeline
•  Predictive Learning & Decision Trees
•  Ensemble Methods - Diversity & Importance Sampling
–  Bagging
–  Random Forest
–  Ada Boost
–  Gradient Boosting
–  Rule Ensembles

•  Summary

© 2013 G.Seni

2013 Strata Conference + Hadoop World

3
Motivation

Volume 9 Issue 2

© 2013 G.Seni

2013 Strata Conference + Hadoop World

4
Motivation (2)

“1′st Place Algorithm Description: … 4. Classification: Ensemble classification methods are used to combine
multiple classifiers. Two separate Random Forest ensembles are created based on the shadow index (one for the
shadow-covered area and one for the shadow-free area). The random forest “Out of Bag” error is used to
automatically evaluate features according to their impact, resulting in 45 features selected for the shadow-free and
55 for the shadow-covered part.”
© 2013 G.Seni

2013 Strata Conference + Hadoop World

5
Motivation (3)
•  “What are the best of the best techniques at winning
Kaggle competitions?
–  Ensembles of Decisions Trees
–  Deep Learning

account for 90% of top 3 winners!”
Jeremy Howard, Chief Scientist of Kaggle
KDD 2013
⇒ Key common characteristics:
–  Resistance to overfitting
–  Universal approximations
© 2013 G.Seni

2013 Strata Conference + Hadoop World

6
Ensemble Methods in a Nutshell
•  “Algorithmic” statistical procedure
•  Based on combining the fitted values from a number of
fitting attempts
•  Loosely related to:
–  Iterative procedures
–  Bootstrap procedures

•  Original idea: a “weak” procedure can be strengthened if
it can operate “by committee”
–  e.g., combining low-bias/high-variance procedures

•  Accompanied by interpretation methodology
© 2013 G.Seni

2013 Strata Conference + Hadoop World

7
Timeline
•  CART (Breiman, Friedman, Stone, Olshen, 1983)
•  Bagging (Breiman, 1996)
–  Random Forest (Ho, 1995; Breiman 2001)

•  AdaBoost (Freund, Schapire, 1997)
•  Boosting – a statistical view (Friedman et. al., 2000)
–  Gradient Boosting (Friedman, 2001)
–  Stochastic Gradient Boosting (Friedman, 1999)

•  Importance Sampling Learning Ensembles (ISLE)
(Friedman, Popescu, 2003)
© 2013 G.Seni

2013 Strata Conference + Hadoop World

8
Timeline (2)
•  Regularization – variance control techniques:
–  Lasso (Tibshirani, 1996)
–  LARS (Efron, 2004)
–  Elastic Net (Zou, Hastie, 2005)
–  GLMs via Coordinate Descent (Friedman, Hastie, Tibshirani, 2008)

•  Rule Ensembles (Friedman, Popescu, 2008)

© 2013 G.Seni

2013 Strata Conference + Hadoop World

9
Overview
•  Motivation, In a Nutshell & Timeline
Ø  Predictive Learning & Decision Trees
•  Ensemble Methods
•  Summary

© 2013 G.Seni

2013 Strata Conference + Hadoop World

10
Predictive Learning
Procedure Summary
N
N
•  Given "training" data D = { yi , xi1 , xi 2 ,, xin }1 = { yi , x i }1

–  D is a random sample from some unknown (joint) distribution


 
•  Build a functional model y = F ( x1 , x2 ,, xn ) = F ( x )
–  Offers adequate and interpretable description of how the inputs
affect the outputs
–  Parsimony is an important criterion: simpler models are preferred
for the sake of scientific insight into the x - y relationship

•  Need to specify: < model, score criterion, search strategy >
© 2013 G.Seni

2013 Strata Conference + Hadoop World

11
Predictive Learning
Procedure Summary (2)
•  Model: underlying functional form sought from data


F (x) = F (x; a) ∈ ℱ

family of functions indexed by a

•  Score criterion: judges (lack of) quality of fitted model


–  Loss function L( y, F ): penalizes individual errors in prediction

–  Risk R(a) = E y ,x L( y, F (x; a)) : the expected loss over all predictions

•  Search Strategy: minimization procedure of score criterion
a* = arg min R(a)
a

© 2013 G.Seni

2013 Strata Conference + Hadoop World

12
Predictive Learning
Procedure Summary (3)
•  “Surrogate” Score criterion:
N
–  Training data: { yi , x i }1 ~ p( x, y )

*
–  p ( x, y ) unknown ⇒ a unknown

⇒ Use approximation: Empirical Risk


1
•  R (a) =
N


∑ L( y, F (xi ; a))
N

i =1

•  If not N >> n ,

© 2013 G.Seni

⇒



a = arg min R(a)
a


R(a) >> R(a* )

2013 Strata Conference + Hadoop World

13
Predictive Learning
Example
•  A simple data set
Attribute-1

Attribute-2

Class

( x1 )

( x2 )

1.0

2.0

blue

2.0

1.0

green

…

…

…

4.5

3.5

x2

?

(y)

•  What is the class of new point

x1

?

•  Many approaches… no method is universally better; try
several / use committee
© 2013 G.Seni

2013 Strata Conference + Hadoop World

14
Predictive Learning
Example (2)
•  Ordinary Linear Regression (OLR)
x2

x1

n

–  Model: F(x) = a0 + ∑ a j x j
j=1

;


⎧
F (x) ≥ 0 ⎨
⎩else

⇒ Not flexible enough
© 2013 G.Seni

2013 Strata Conference + Hadoop World

15
Decision Trees
Overview
x2
R2

x1 ≥ 5

R1

x2 ≥ 3

3

R4

x1 ≥ 2

R3
2

x1

5

M

ˆ
ˆ ˆ
•  Model: y = T (x ) = ∑ cm I R (x )
m =1

m

 M
{Rm }m=1 =

Sub-regions of input
variable space

where I R (x) = 1 if x ∈ R , 0 otherwise
© 2013 G.Seni

2013 Strata Conference + Hadoop World

16
Decision Trees
Overview (2)
•  Score criterion:
–  Classification – "0-1 loss" ⇒ misclassification error (or surrogate)
N

M

ˆ
{ cˆm, Rm } = argmin
1

M
cm ,Rm 1

TM ={

}

∑I (y ≠ T
i

M

(x i ))

i=1

2

ˆ
ˆ
–  Regression – least squares – i.e., L( y , y ) = ( y − y )
M

ˆ ˆ
{ cm, Rm } = argmin
1

M

N

∑( y − T
i

TM ={cm ,Rm }1 i=1

M

(x i ))


R(TM )

2



•  Search: Find T = arg min T R(T )
–  i.e., find best regions Rm and constants cm
© 2013 G.Seni

2013 Strata Conference + Hadoop World

17
Decision Trees
Overview (3)
•  Join optimization with respect to Rm and cm simultaneously is
very difficult
⇒ use a greedy iterative procedure
R0

R4

R1

R5 R6

R2

R3

•

•

j 1 , s1

R0
j 2 , s2

•

•

j 1 , s1

R0
j 2 , s2

R1

•

R0

•

R1

•
R3

•

j 1 , s1

•
R4

j 3 , s3

j 2 , s2

•

j 1 , s1

R0

•
R3

••

•

R4 R5
R7

2013 Strata Conference + Hadoop World

j 4 , s4

R6

•
© 2013 G.Seni

j 3 , s3

R2

R1

R2

•

•
R8
18
Decision Trees
What is the “right” size of a model?
y

y

y

ο ο ο
ο
ο
ο ο

ο ο

ο ο ο

ο ο

ο

⇒

ο
ο

ο

c1

ο

ο ο

x

ο ο

ο ο

ο ο

ο ο ο

c2

ο

vs

ο
ο

c1 ο

c2

ο ο

ο

ο ο

ο
ο ο

ο ο

c3

ο
ο

ο

ο

x

x

•  Dilemma
–  If model (# of splits) is too small, then approximation is too crude
(bias) ⇒ increased errors
–  If model is too large, then it fits the training data too closely
(overfitting, increased variance) ⇒ increased errors
© 2013 G.Seni

2013 Strata Conference + Hadoop World

19
Decision Trees
What is the “right” size of a model? (2)
High Bias

Low Bias

Low Variance

Prediction Error

High Variance

Test Sample

Training Sample
Low

M*

High

Model Complexity

–  Right sized tree, M * when test error is at a minimum
,
–  Error on the training is not a useful estimator!
•  If test set is not available, need alternative method
© 2013 G.Seni

2013 Strata Conference + Hadoop World

20
Decision Trees
Pruning to obtain “right” size
•  Two strategies
–  Prepruning - stop growing a branch when information becomes
unreliable
•  #(Rm) – i.e., number of data points, too small
⇒ same bound everywhere in the tree
•  Next split not worthwhile
⇒ Not sufficient condition

–  Postpruning - take a fully-grown tree and discard unreliable parts
(i.e., not supported by test data)
•  C4.5: pessimistic pruning
•  CART: cost-complexity pruning
© 2013 G.Seni

(more statistically grounded)

2013 Strata Conference + Hadoop World

21
Decision Trees

1.0

Hands-on Exercise
Start Rstudio

• 

0.8

• 

Navigate to directory:
example.1.LinearBoundary

Load and run “fitModel_CART.R”

• 

If curious, also see
“gen2DdataLinear.R”

• 

After boosting discussion, load and
run “fitModel_GBM.R

0.0

0.2

0.4

x2

Set working directory: use
setwd() or with GUI

• 

0.6

• 

0.0

0.2

0.4

0.6

0.8

1.0

x1

© 2013 G.Seni

2013 Strata Conference + Hadoop World

22
Decision Trees
Key Features
•  Ability to deal with irrelevant inputs
–  i.e., automatic variable subset selection
–  Measure anything you can measure
–  Score provided for selected variables ("importance")

•  No data preprocessing needed
-  Naturally handle all types of variables
• 

numeric, binary, categorical

-  Invariant under monotone transformations: x j = g j (x j )
• 
• 
© 2013 G.Seni

Variable scales are irrelevant
Immune to bad x j −distributions (e.g., outliers)
2013 Strata Conference + Hadoop World

23
Decision Trees
Key Features (2)
•  Computational scalability
–  Relatively fast: O(nN log N )

•  Missing value tolerant
-  Moderate loss of accuracy due to missing values
-  Handling via "surrogate" splits

•  "Off-the-shelf" procedure
-  Few tunable parameters

•  Interpretable model representation
-  Binary tree graphic
© 2013 G.Seni

2013 Strata Conference + Hadoop World

24
Decision Trees
Limitations
•  Discontinuous piecewise constant model
F (x)

x

–  In order to have many splits you need to have a lot of data
•  In high-dimensions, you often run out of data after a few splits

–  Also note error is bigger near region boundaries
© 2013 G.Seni

2013 Strata Conference + Hadoop World

25
Decision Trees
Limitations (2)
•  Not good for low interaction F * (x )
n

*

–  e.g., F (x ) = ao + ∑ a j x j is worst function for trees
j =1

n

= ∑ f j* (x j )

(no interaction, additive)

j =1

–  In order for xl to enter model, must split on it
•  Path from root to node is a product of indicators

•  Not good for F * (x ) that has dependence on many variables
-  Each split reduces training data for subsequent splits (data
fragmentation)

© 2013 G.Seni

2013 Strata Conference + Hadoop World

26
Decision Trees
Limitations (3)
•  High variance caused by greedy search strategy (local optima)
–  Errors in upper splits are propagated down to affect all splits
below it
⇒ Small changes in data (sampling fluctuations) can cause
big changes in tree
- Very deep trees might be questionable
- Pruning is important

•  What to do next?
–  Live with problems
–  Use other methods (when possible)
–  Fix-up trees: use ensembles
© 2013 G.Seni

2013 Strata Conference + Hadoop World

27
Overview
•  In a Nutshell & Timeline
•  Predictive Learning & Decision Trees
Ø  Ensemble Methods
–  In a Nutshell, Diversity & Importance Sampling
–  Generic Ensemble Generation
–  Bagging, RF, AdaBoost, Boosting, Rule Ensembles

•  Summary

© 2013 G.Seni

2013 Strata Conference + Hadoop World

28
Ensemble Methods
In a Nutshell
M

•  Model: F (x) = c0 + ∑m=1 cmTm (x)
M
–  { m (x)}1 : “basis” functions (or “base learners”)
T

–  i.e., linear model in a (very) high dimensional space of derived
variables

•  Learner characterization:

Tm (x) = T (x; p m )

–  p m : a specific set of joint parameter values – e.g., split definitions
at internal nodes and predictions at terminal nodes
–  {T (x; p)}p∈P : function class – i.e., set of all base learners of specified
family
© 2013 G.Seni

2013 Strata Conference + Hadoop World

29
Ensemble Methods
In a Nutshell (2)
•  Learning: two-step process; approximate solution to
N
M
  M
{cm , p m }o = arg min ∑ L yi , c0 + ∑ cmT (x;p m )
M
{cm , p m }o i=1

(

m=1

)

M
–  Step 1: Choose points {p m }1

M
•  i.e., select {Tm (x)}1 ⊂ {T (x; p)}p∈P

M
–  Step 2: Determine weights {cm }0

•  e.g., via regularized LR

© 2013 G.Seni

2013 Strata Conference + Hadoop World

30
Ensemble Methods
Importance Sampling (Friedman, 2003)
•  How to judiciously choose the “basis” functions (i.e., {pm }1M )?
M
•  Goal: find “good” {pm }1 so that

M
M
F (x;{p m }1 , {cm }1 ) ≅ F * (x )

•  Connection with numerical integration:
– 

∫

Ρ

M

I (p) ∂p ≈ ∑m =1 w m I (p m )

vs.

© 2013 G.Seni

2013 Strata Conference + Hadoop World

Accuracy improves
when we choose more
points from this
region…

31
Importance Sampling
Numerical Integration via Monte Carlo Methods
M
•  r (p) = sampling pdf of p ∈ P -- i.e, {p m ~ r (p)}1

–  Simple approach: r (p m ) iid -- i.e., uniform
–  In our problem: inversely related to p m’s “risk”
•  i.e., T (x; p m ) has high error ⇒ lack of relevance of p m ⇒ low r (pm )

•  “Quasi” Monte Carlo:
–  with/out knowledge of the other points that will be used
•  i.e., single point vs. group importance

–  Sequential approximation: p’s relevance judged in the context of
the (fixed) previously selected points
© 2013 G.Seni

2013 Strata Conference + Hadoop World

32
Ensemble Methods
Importance Sampling – Characterization of
•  Let p∗ = arg minp Risk(p)

Narrow r (p)

Broad r (p)

M

T
•  Ensemble { (x; p m )}1 of “strong” base
learners - i.e., all with Risk (p m ) ≈ Risk (p∗ )

•  Diverse ensemble - i.e., predictions are
not highly correlated with each other

•  T (x; p m ) yield similar highly correlated
’s
predictions ⇒ unexceptional performance

•  However, many “weak” base learners - i.e.,
Risk (p m ) >> Risk (p ∗ ) ⇒ poor performance

© 2013 G.Seni

2013 Strata Conference + Hadoop World

33
Ensemble Methods
Approximate Process of Drawing from
•  Heuristic sampling strategy: sampling around p by iteratively
applying small perturbations to existing problem structure
∗

–  Generating ensemble members Tm (x) = T (x; p m )

For m = 1 to M {
pm = PERTURBm { minp Ε xy L( y, T (x; p) )}
arg
}
⋅
–  PERTURB {} is a (random) modification of any of

•  Data distribution - e.g., by re-weighting the observations
•  Loss function - e.g., by modifying its argument
•  Search algorithm (used to find minp)


–  Width of r (p ) is controlled by degree of perturbation
© 2013 G.Seni

2013 Strata Conference + Hadoop World

34
Generic Ensemble Generation
Step 1: Choose Base Learners p!

!
! !

•  Forward Stagewise Fitting Procedure:
𝐹0 (x) = 0    
For    𝑚 = 1  to  𝑀    {	
  
	
  	
  	
  	
  	
  //	
  Fit	
  a	
  single	
  base	
  learner	
    
          p

Modification of
data distribution

𝑚

= argmin .
p

𝐿0𝑦 𝑖 ,   𝐹 𝑚 −1 + 𝑇(x 𝑖 ; p)8  

𝑖∈𝑆 𝑚 ( 𝜂 )

	
  	
  	
  	
  	
  //	
  Update	
  additive	
  expansion	
  
          𝑇 𝑚 ( 𝑥 ) = 𝑇0x; p 𝒎 8	
  
          𝐹 𝑚 (x) =    𝐹 𝑚 −1 (x) + 𝜐 ∙ 𝑇 𝑚 (x)  
}  
write  { 𝑇 𝑚 (x)}1𝑀 	
  

–  Algorithm control: L, η , υ

Modification of loss function
(“sequential” approximation)

•  Sm (η ) : random sub-sample of size η ≤ N ⇒ impacts ensemble "diversity"
m −1

•  Fm−1 (x) = υ ⋅ ∑k =1Tk (x) : “memory” function (0 ≤ υ ≤ 1 )
© 2013 G.Seni

2013 Strata Conference + Hadoop World

35
Generic Ensemble Generation
Step 2: Choose Coefficients c!

!
!!

M
M
•  Given {Tm (x)}m=1 = {T (x; pm )}m=1 , coefficients can be obtained by a

regularized linear regression
N
M
⎛
⎞

{cm } = arg min ∑ L⎜ yi , c0 + ∑ cmTm (xi ) ⎟ + λ ⋅ P(c )
{cm }
i =1 ⎝
m =1
⎠

–  Regularization here helps reduce bias (in addition to variance) of the
model
–  New iterative fast algorithms for various loss/penalty combinations
•  “GLMs via Coordinate Descent” (2008)

© 2013 G.Seni

2013 Strata Conference + Hadoop World

36
Bagging (Breiman, 1996)
•  Bagging = Bootstrap Aggregation
ˆ
•  L(y, y) : as available for single tree

F0 (x) = 0
For m = 1 to M {

•  υ = 0 ⇒ no memory

p m = arg min
p

•  η = N / 2

i

m −1

i∈S m ( )

( x i ) + T ( x i ; p ))

Tm (x) = T (x; p m )

•  Tm (x) ⇒ are large un-pruned
trees

∑ηL(y , F

Fm (x) = Fm −1 (x) + υ ⋅ Tm (x)
υ
}

•  co = 0, {cm = 1 / M }1M

M

i.e., not fit to the data (avg)

write {Tm (x)}1

–  i.e., perturbation of the data distribution only
–  Potential improvements?
–  R package: ipred
© 2013 G.Seni

2013 Strata Conference + Hadoop World

37
Bagging
Hands-on Exercise

1.0

• 

Navigate to directory:
example.2.EllipticalBoundary

0.0

Load and run
–  fitModel_Bagging_by_hand.R

-0.5

–  fitModel_CART.R (optional)
• 

If curious, also see
gen2DdataNonLinear.R

-1.0

x2

Set working directory: use setwd()
or with GUI

• 

0.5

• 

• 

After class, load and run
fitModel_Bagging.R

-2

-1

0

1

2

x1

© 2013 G.Seni

2013 Strata Conference + Hadoop World

38
Bagging
Why it helps?

ˆ
•  Under L( y, y) = ( y − y) 2, averaging reduces variance and leaves
bias unchanged

•  Consider “idealized” bagging (aggregate)


estimator: f (x) = Ε f Z (x)



–  f Z fit to bootstrap data set Z = {yi , xi }1N
–  Z is sampled from actual population distribution (not training data)




–  We can write: Ε[Y − f Z (x)] = Ε[Y − f (x) + f (x) − f Z (x)]
2

2


2
= Ε Y − f ( x) + Ε f Z ( x) − f ( x)

[
]
≥ Ε[Y − f (x)]

[

]

2

2

⇒  true population aggregation never increases mean squared error!
⇒  Bagging will often decrease MSE…
© 2013 G.Seni

2013 Strata Conference + Hadoop World

39
Random Forest (Ho, 1995; Breiman, 2001)
•  Random Forest = Bagging + algorithm randomizing
–  Subset splitting
As each tree is constructed…
•  Draw a random sample of predictors before each node is split

ns = ⎣log 2 (n) + 1⎦
•  Find best split as usual but selecting only from subset of predictors


M
⇒ Increased diversity among {Tm (x)}1 - i.e., wider r (p)
•  Width (inversely) controlled by ns

–  Speed improvement over Bagging
–  R package: randomForest
© 2013 G.Seni

2013 Strata Conference + Hadoop World

40
Bagging vs. Random Forest vs. ISLE
100 Target Functions Comparison (Popescu, 2005)
•  ISLE improvements:
–  Different data sampling strategy (not fixed)
–  Fit coefficients to data

Comparative RMS Error

•  xxx_6_5%_P : 6 terminal nodes trees
5% samples without replacement
Post-processing – i.e., using
estimated “optimal” quadrature
coefficients
⇒ Significantly faster to build!

Bag

© 2013 G.Seni

RF

Bag_6_5%_P RF_6_5%_P

2013 Strata Conference + Hadoop World

41
AdaBoost (Freund & Schapire, 1997)
observation weights : wi( 0 ) = 1 N
For m = 1 to M {
a. Fit a classifier Tm (x) to training data with wi( m )
b. Compute
errm =

∑

N

i =1

(cm , p m ) = arg min

w I ( yi ≠ Tm (x i ))

∑

N

∑ηL( y , F
i

m −1

(x i ) + c ⋅ T (x i ; p) )

i∈S m ( )

Tm (x) = T (x; p m )
Fm (x) = Fm −1 (x) + υ ⋅ cm ⋅ Tm (x)

d. Set wi( m +1) = wi( m ) ⋅ exp[α m ⋅ I ( yi ≠ Tm (x i )]
}
Output sign ∑m =1α mTm (x)

c, p

wi( m )

c. Compue α m = log((1 − errm ) errm )

M

For m = 1 to M {

(m)
i

i =1

(

F0 (x) = 0

}
M
write {cm , Tm (x)}1

)

–  We need to show p m = arg min (⋅) is equivalent to line a. above
p

Book

•  Equivalence to Forward Stagewise Fitting Procedure
–  cm = arg min (⋅) is equivalent to line c.
c

•  R package adabag
© 2013 G.Seni

2013 Strata Conference + Hadoop World

42
AdaBoost
Hands-on Exercise

1.0

• 

Navigate to directory:
example.2.EllipticalBoundary
Set working directory: use setwd()
or with GUI

• 

Load and run

0.0

–  fitModel_Adaboost_by_hand.R

-0.5

• 

After class, load and run
fitModel_Adaboost.R and
fitModel_RandomForest.R

-1.0

x2

0.5

• 

-2

-1

0

1

2

x1

© 2013 G.Seni

2013 Strata Conference + Hadoop World

43
Stochastic Gradient Boosting (Friedman, 2001)
•  Boosting with any differentiable loss criterion
ˆ
•  General L( y, y )

F0 (x) = c00

•  υ = 0.1 ⇒ Sequential sampling

For m = 1 to M {
(cm , p m ) = arg min
m
c, p

•  η = N 2

∑ηL( y , F
i

m −1

i∈S m ( )

(x i ) + c ⋅ T (x i ; p))

Tm (x) = T (x; p m )

•  Tm (x) ⇒ Any “weak” learner
N

•  co = arg minc ∑i =1 L( yi , c)

Fm (x) = Fm −1 (x) +υ ⋅ cm ⋅ Tm (x)
υ
}
M
write {(υ ⋅ cm ), Tm (x)}1

M
•  {cm }1 ⇒ “shrunk” sequential

partial regression coefficients

–  Potential improvements?
–  R package: gbm
© 2013 G.Seni

2013 Strata Conference + Hadoop World

44
Stochastic Gradient Boosting
LAD Regression – L

!, ! = ! − !

•  More robust than ( y − F )2
•  Resistant to outliers in y
…trees already providing
resistance to outliers in x

!

N

F0 (x) = median{yi }1
For m = 1 to M {

// Step1 : find Tm (x)
~ = sign ( y − F (x ) )
yi
i
m −1
i

•  Note:


{R }

J

jm 1

–  Trees are fitted to pseudoresponse

(

// Step2 : find coefficients

⇒ Can’t interpret interpret

γˆ jm = median{yi − Fm −1 (x i )}1N

x i ∈R jm

individual trees

–  “shrunk” version of tree gets
added to ensemble

j = 1… J

// Update expansion

–  Original tree constants are
overwritten

© 2013 G.Seni

N
= J − terminal node LS - regression tree {~i , x i }1
y


Fm (x) = Fm −1 (x) + υ ⋅ ∑ γˆ jm I x i ∈ R jm
J

j =1

(

)

}

2013 Strata Conference + Hadoop World

45

)
Parallel vs. Sequential Ensembles
100 Target Functions Comparison (Popescu, 2005)

Comparative RMS Error

•  xxx_6_5%_P : 6 terminal nodes trees
5% samples without replacement
Post-processing – i.e., using
estimated “optimal” quadrature
coefficients
“Sequential”

“Parallel”
Bag

RF

Boost

Seq_0.01_20%_P

•  Seq_υ_η%_P : “Sequential” ensemble
6 terminal nodes trees
υ : “memory” factor
η % samples without replacement
Post-processing

Bag_6_5%_P RF_6_5%_P
Seq_0.1_50%_P

•  Sequential ISLE tend to perform better than parallel ones
–  Consistent with results observed in classical Monte Carlo integration
© 2013 G.Seni

2013 Strata Conference + Hadoop World

46
Rule Ensembles (Friedman & Popescu, 2005)
J

ˆ
ˆ
•  Trees as collection of conjunctive rules: Tm (x) = ∑ c jm I (x ∈ R jm )
j =1

R1

27

R4
15

R2 ⇒

R5
15

22

x1

r2 (x) = I ( x1 > 22) ⋅ I (0 ≤ x2 ≤ 27)
r3 (x) = I (15 < x1 ≤ 22) ⋅ I (0 ≤ x2 )

R4 ⇒

r4 (x) = I (0 ≤ x1 ≤ 15) ⋅ I ( x2 > 15)

R5 ⇒

x2

r1 (x) = I ( x1 > 22) ⋅ I ( x2 > 27)

R3 ⇒

R3

ˆ
y

R2

R1 ⇒

r5 (x) = I (0 ≤ x1 ≤ 15) ⋅ I (0 ≤ x2 ≤ 15)

–  These simple rules, rm (x) ∈ {0,1} can be used as base learners
,
–  Main motivation is interpretability
© 2013 G.Seni

2013 Strata Conference + Hadoop World

47
Rule Ensembles
ISLE Procedure
•  Rule-based model: F (x) = a0 + ∑ am rm (x)
m

–  Still a piecewise constant model ⇒ complement the non-linear rules
with purely linear terms:

•  Fitting
–  Step 1: derive rules from tree ensemble (shortcut)
•  Tree size controls rule “complexity” (interaction order)

–  Step 2: fit coefficients using linear regularized procedure:

(

N
P
K
ˆ
ˆ
({ak },{b j }) = arg min ∑ L yi , F x; {ak }0 , {b j }1
{ak },{b j }

© 2013 G.Seni

i=1

(

)) +!!! ⋅

2013 Strata Conference + Hadoop World

!(a) + !(b) !
48
Boosting & Rule Ensembles
Hands-on Exercise

2500

• 

example.3.Diamonds

Load and run

1500

Set working directory: use setwd()
or with GUI

• 

2000

• 

–  viewDiamondData.R
–  fitModel_GBM.R

1000

–  fitModel_RE.R
• 

500

Absolute loss

Navigate to directory:

After class, go to:
example.1.LinearBoundary

0

200

400

600

800

1000

Run fitModel_GBM.R

Iteration

© 2013 G.Seni

2013 Strata Conference + Hadoop World

49
Overview
•  Motivation, In a Nutshell & Timeline
•  Predictive Learning & Decision Trees
•  Ensemble Methods
Ø  Summary

© 2013 G.Seni

2013 Strata Conference + Hadoop World

50
Summary
•  Ensemble methods have been found to perform extremely
well in a variety of problem domains
•  Shown to have desirable statistical properties
•  Latest ensemble research brings together important
foundational strands of statistics
•  Emphasis on accuracy but significant progress has been
made on interpretability
Go build Ensembles and keep in touch!

© 2013 G.Seni

2013 Strata Conference + Hadoop World

51

Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles

  • 1.
    How to CreatePredictive Models in R using Ensembles Giovanni Seni, Ph.D. Intuit @IntuitInc Giovanni_Seni@intuit.com Santa Clara University GSeni@scu.edu Strata - Hadoop World, New York October 28, 2013
  • 2.
    Reference © 2013 G.Seni 2013Strata Conference + Hadoop World 2
  • 3.
    Overview •  Motivation, Ina Nutshell & Timeline •  Predictive Learning & Decision Trees •  Ensemble Methods - Diversity & Importance Sampling –  Bagging –  Random Forest –  Ada Boost –  Gradient Boosting –  Rule Ensembles •  Summary © 2013 G.Seni 2013 Strata Conference + Hadoop World 3
  • 4.
    Motivation Volume 9 Issue2 © 2013 G.Seni 2013 Strata Conference + Hadoop World 4
  • 5.
    Motivation (2) “1′st PlaceAlgorithm Description: … 4. Classification: Ensemble classification methods are used to combine multiple classifiers. Two separate Random Forest ensembles are created based on the shadow index (one for the shadow-covered area and one for the shadow-free area). The random forest “Out of Bag” error is used to automatically evaluate features according to their impact, resulting in 45 features selected for the shadow-free and 55 for the shadow-covered part.” © 2013 G.Seni 2013 Strata Conference + Hadoop World 5
  • 6.
    Motivation (3) •  “Whatare the best of the best techniques at winning Kaggle competitions? –  Ensembles of Decisions Trees –  Deep Learning account for 90% of top 3 winners!” Jeremy Howard, Chief Scientist of Kaggle KDD 2013 ⇒ Key common characteristics: –  Resistance to overfitting –  Universal approximations © 2013 G.Seni 2013 Strata Conference + Hadoop World 6
  • 7.
    Ensemble Methods ina Nutshell •  “Algorithmic” statistical procedure •  Based on combining the fitted values from a number of fitting attempts •  Loosely related to: –  Iterative procedures –  Bootstrap procedures •  Original idea: a “weak” procedure can be strengthened if it can operate “by committee” –  e.g., combining low-bias/high-variance procedures •  Accompanied by interpretation methodology © 2013 G.Seni 2013 Strata Conference + Hadoop World 7
  • 8.
    Timeline •  CART (Breiman,Friedman, Stone, Olshen, 1983) •  Bagging (Breiman, 1996) –  Random Forest (Ho, 1995; Breiman 2001) •  AdaBoost (Freund, Schapire, 1997) •  Boosting – a statistical view (Friedman et. al., 2000) –  Gradient Boosting (Friedman, 2001) –  Stochastic Gradient Boosting (Friedman, 1999) •  Importance Sampling Learning Ensembles (ISLE) (Friedman, Popescu, 2003) © 2013 G.Seni 2013 Strata Conference + Hadoop World 8
  • 9.
    Timeline (2) •  Regularization– variance control techniques: –  Lasso (Tibshirani, 1996) –  LARS (Efron, 2004) –  Elastic Net (Zou, Hastie, 2005) –  GLMs via Coordinate Descent (Friedman, Hastie, Tibshirani, 2008) •  Rule Ensembles (Friedman, Popescu, 2008) © 2013 G.Seni 2013 Strata Conference + Hadoop World 9
  • 10.
    Overview •  Motivation, Ina Nutshell & Timeline Ø  Predictive Learning & Decision Trees •  Ensemble Methods •  Summary © 2013 G.Seni 2013 Strata Conference + Hadoop World 10
  • 11.
    Predictive Learning Procedure Summary N N • Given "training" data D = { yi , xi1 , xi 2 ,, xin }1 = { yi , x i }1 –  D is a random sample from some unknown (joint) distribution    •  Build a functional model y = F ( x1 , x2 ,, xn ) = F ( x ) –  Offers adequate and interpretable description of how the inputs affect the outputs –  Parsimony is an important criterion: simpler models are preferred for the sake of scientific insight into the x - y relationship •  Need to specify: < model, score criterion, search strategy > © 2013 G.Seni 2013 Strata Conference + Hadoop World 11
  • 12.
    Predictive Learning Procedure Summary(2) •  Model: underlying functional form sought from data   F (x) = F (x; a) ∈ ℱ family of functions indexed by a •  Score criterion: judges (lack of) quality of fitted model  –  Loss function L( y, F ): penalizes individual errors in prediction  –  Risk R(a) = E y ,x L( y, F (x; a)) : the expected loss over all predictions •  Search Strategy: minimization procedure of score criterion a* = arg min R(a) a © 2013 G.Seni 2013 Strata Conference + Hadoop World 12
  • 13.
    Predictive Learning Procedure Summary(3) •  “Surrogate” Score criterion: N –  Training data: { yi , x i }1 ~ p( x, y ) * –  p ( x, y ) unknown ⇒ a unknown ⇒ Use approximation: Empirical Risk  1 •  R (a) = N  ∑ L( y, F (xi ; a)) N i =1 •  If not N >> n , © 2013 G.Seni ⇒   a = arg min R(a) a  R(a) >> R(a* ) 2013 Strata Conference + Hadoop World 13
  • 14.
    Predictive Learning Example •  Asimple data set Attribute-1 Attribute-2 Class ( x1 ) ( x2 ) 1.0 2.0 blue 2.0 1.0 green … … … 4.5 3.5 x2 ? (y) •  What is the class of new point x1 ? •  Many approaches… no method is universally better; try several / use committee © 2013 G.Seni 2013 Strata Conference + Hadoop World 14
  • 15.
    Predictive Learning Example (2) • Ordinary Linear Regression (OLR) x2 x1 n  –  Model: F(x) = a0 + ∑ a j x j j=1 ;  ⎧ F (x) ≥ 0 ⎨ ⎩else ⇒ Not flexible enough © 2013 G.Seni 2013 Strata Conference + Hadoop World 15
  • 16.
    Decision Trees Overview x2 R2 x1 ≥5 R1 x2 ≥ 3 3 R4 x1 ≥ 2 R3 2 x1 5 M ˆ ˆ ˆ •  Model: y = T (x ) = ∑ cm I R (x ) m =1 m  M {Rm }m=1 = Sub-regions of input variable space where I R (x) = 1 if x ∈ R , 0 otherwise © 2013 G.Seni 2013 Strata Conference + Hadoop World 16
  • 17.
    Decision Trees Overview (2) • Score criterion: –  Classification – "0-1 loss" ⇒ misclassification error (or surrogate) N M ˆ { cˆm, Rm } = argmin 1 M cm ,Rm 1 TM ={ } ∑I (y ≠ T i M (x i )) i=1 2 ˆ ˆ –  Regression – least squares – i.e., L( y , y ) = ( y − y ) M ˆ ˆ { cm, Rm } = argmin 1 M N ∑( y − T i TM ={cm ,Rm }1 i=1 M (x i ))  R(TM ) 2   •  Search: Find T = arg min T R(T ) –  i.e., find best regions Rm and constants cm © 2013 G.Seni 2013 Strata Conference + Hadoop World 17
  • 18.
    Decision Trees Overview (3) • Join optimization with respect to Rm and cm simultaneously is very difficult ⇒ use a greedy iterative procedure R0 R4 R1 R5 R6 R2 R3 • • j 1 , s1 R0 j 2 , s2 • • j 1 , s1 R0 j 2 , s2 R1 • R0 • R1 • R3 • j 1 , s1 • R4 j 3 , s3 j 2 , s2 • j 1 , s1 R0 • R3 •• • R4 R5 R7 2013 Strata Conference + Hadoop World j 4 , s4 R6 • © 2013 G.Seni j 3 , s3 R2 R1 R2 • • R8 18
  • 19.
    Decision Trees What isthe “right” size of a model? y y y ο ο ο ο ο ο ο ο ο ο ο ο ο ο ο ⇒ ο ο ο c1 ο ο ο x ο ο ο ο ο ο ο ο ο c2 ο vs ο ο c1 ο c2 ο ο ο ο ο ο ο ο ο ο c3 ο ο ο ο x x •  Dilemma –  If model (# of splits) is too small, then approximation is too crude (bias) ⇒ increased errors –  If model is too large, then it fits the training data too closely (overfitting, increased variance) ⇒ increased errors © 2013 G.Seni 2013 Strata Conference + Hadoop World 19
  • 20.
    Decision Trees What isthe “right” size of a model? (2) High Bias Low Bias Low Variance Prediction Error High Variance Test Sample Training Sample Low M* High Model Complexity –  Right sized tree, M * when test error is at a minimum , –  Error on the training is not a useful estimator! •  If test set is not available, need alternative method © 2013 G.Seni 2013 Strata Conference + Hadoop World 20
  • 21.
    Decision Trees Pruning toobtain “right” size •  Two strategies –  Prepruning - stop growing a branch when information becomes unreliable •  #(Rm) – i.e., number of data points, too small ⇒ same bound everywhere in the tree •  Next split not worthwhile ⇒ Not sufficient condition –  Postpruning - take a fully-grown tree and discard unreliable parts (i.e., not supported by test data) •  C4.5: pessimistic pruning •  CART: cost-complexity pruning © 2013 G.Seni (more statistically grounded) 2013 Strata Conference + Hadoop World 21
  • 22.
    Decision Trees 1.0 Hands-on Exercise StartRstudio •  0.8 •  Navigate to directory: example.1.LinearBoundary Load and run “fitModel_CART.R” •  If curious, also see “gen2DdataLinear.R” •  After boosting discussion, load and run “fitModel_GBM.R 0.0 0.2 0.4 x2 Set working directory: use setwd() or with GUI •  0.6 •  0.0 0.2 0.4 0.6 0.8 1.0 x1 © 2013 G.Seni 2013 Strata Conference + Hadoop World 22
  • 23.
    Decision Trees Key Features • Ability to deal with irrelevant inputs –  i.e., automatic variable subset selection –  Measure anything you can measure –  Score provided for selected variables ("importance") •  No data preprocessing needed -  Naturally handle all types of variables •  numeric, binary, categorical -  Invariant under monotone transformations: x j = g j (x j ) •  •  © 2013 G.Seni Variable scales are irrelevant Immune to bad x j −distributions (e.g., outliers) 2013 Strata Conference + Hadoop World 23
  • 24.
    Decision Trees Key Features(2) •  Computational scalability –  Relatively fast: O(nN log N ) •  Missing value tolerant -  Moderate loss of accuracy due to missing values -  Handling via "surrogate" splits •  "Off-the-shelf" procedure -  Few tunable parameters •  Interpretable model representation -  Binary tree graphic © 2013 G.Seni 2013 Strata Conference + Hadoop World 24
  • 25.
    Decision Trees Limitations •  Discontinuouspiecewise constant model F (x) x –  In order to have many splits you need to have a lot of data •  In high-dimensions, you often run out of data after a few splits –  Also note error is bigger near region boundaries © 2013 G.Seni 2013 Strata Conference + Hadoop World 25
  • 26.
    Decision Trees Limitations (2) • Not good for low interaction F * (x ) n * –  e.g., F (x ) = ao + ∑ a j x j is worst function for trees j =1 n = ∑ f j* (x j ) (no interaction, additive) j =1 –  In order for xl to enter model, must split on it •  Path from root to node is a product of indicators •  Not good for F * (x ) that has dependence on many variables -  Each split reduces training data for subsequent splits (data fragmentation) © 2013 G.Seni 2013 Strata Conference + Hadoop World 26
  • 27.
    Decision Trees Limitations (3) • High variance caused by greedy search strategy (local optima) –  Errors in upper splits are propagated down to affect all splits below it ⇒ Small changes in data (sampling fluctuations) can cause big changes in tree - Very deep trees might be questionable - Pruning is important •  What to do next? –  Live with problems –  Use other methods (when possible) –  Fix-up trees: use ensembles © 2013 G.Seni 2013 Strata Conference + Hadoop World 27
  • 28.
    Overview •  In aNutshell & Timeline •  Predictive Learning & Decision Trees Ø  Ensemble Methods –  In a Nutshell, Diversity & Importance Sampling –  Generic Ensemble Generation –  Bagging, RF, AdaBoost, Boosting, Rule Ensembles •  Summary © 2013 G.Seni 2013 Strata Conference + Hadoop World 28
  • 29.
    Ensemble Methods In aNutshell M •  Model: F (x) = c0 + ∑m=1 cmTm (x) M –  { m (x)}1 : “basis” functions (or “base learners”) T –  i.e., linear model in a (very) high dimensional space of derived variables •  Learner characterization: Tm (x) = T (x; p m ) –  p m : a specific set of joint parameter values – e.g., split definitions at internal nodes and predictions at terminal nodes –  {T (x; p)}p∈P : function class – i.e., set of all base learners of specified family © 2013 G.Seni 2013 Strata Conference + Hadoop World 29
  • 30.
    Ensemble Methods In aNutshell (2) •  Learning: two-step process; approximate solution to N M   M {cm , p m }o = arg min ∑ L yi , c0 + ∑ cmT (x;p m ) M {cm , p m }o i=1 ( m=1 ) M –  Step 1: Choose points {p m }1 M •  i.e., select {Tm (x)}1 ⊂ {T (x; p)}p∈P M –  Step 2: Determine weights {cm }0 •  e.g., via regularized LR © 2013 G.Seni 2013 Strata Conference + Hadoop World 30
  • 31.
    Ensemble Methods Importance Sampling(Friedman, 2003) •  How to judiciously choose the “basis” functions (i.e., {pm }1M )? M •  Goal: find “good” {pm }1 so that M M F (x;{p m }1 , {cm }1 ) ≅ F * (x ) •  Connection with numerical integration: –  ∫ Ρ M I (p) ∂p ≈ ∑m =1 w m I (p m ) vs. © 2013 G.Seni 2013 Strata Conference + Hadoop World Accuracy improves when we choose more points from this region… 31
  • 32.
    Importance Sampling Numerical Integrationvia Monte Carlo Methods M •  r (p) = sampling pdf of p ∈ P -- i.e, {p m ~ r (p)}1 –  Simple approach: r (p m ) iid -- i.e., uniform –  In our problem: inversely related to p m’s “risk” •  i.e., T (x; p m ) has high error ⇒ lack of relevance of p m ⇒ low r (pm ) •  “Quasi” Monte Carlo: –  with/out knowledge of the other points that will be used •  i.e., single point vs. group importance –  Sequential approximation: p’s relevance judged in the context of the (fixed) previously selected points © 2013 G.Seni 2013 Strata Conference + Hadoop World 32
  • 33.
    Ensemble Methods Importance Sampling– Characterization of •  Let p∗ = arg minp Risk(p) Narrow r (p) Broad r (p) M T •  Ensemble { (x; p m )}1 of “strong” base learners - i.e., all with Risk (p m ) ≈ Risk (p∗ ) •  Diverse ensemble - i.e., predictions are not highly correlated with each other •  T (x; p m ) yield similar highly correlated ’s predictions ⇒ unexceptional performance •  However, many “weak” base learners - i.e., Risk (p m ) >> Risk (p ∗ ) ⇒ poor performance © 2013 G.Seni 2013 Strata Conference + Hadoop World 33
  • 34.
    Ensemble Methods Approximate Processof Drawing from •  Heuristic sampling strategy: sampling around p by iteratively applying small perturbations to existing problem structure ∗ –  Generating ensemble members Tm (x) = T (x; p m ) For m = 1 to M { pm = PERTURBm { minp Ε xy L( y, T (x; p) )} arg } ⋅ –  PERTURB {} is a (random) modification of any of •  Data distribution - e.g., by re-weighting the observations •  Loss function - e.g., by modifying its argument •  Search algorithm (used to find minp)  –  Width of r (p ) is controlled by degree of perturbation © 2013 G.Seni 2013 Strata Conference + Hadoop World 34
  • 35.
    Generic Ensemble Generation Step1: Choose Base Learners p! ! ! ! •  Forward Stagewise Fitting Procedure: 𝐹0 (x) = 0     For    𝑚 = 1  to  𝑀    {            //  Fit  a  single  base  learner              p Modification of data distribution 𝑚 = argmin . p 𝐿0𝑦 𝑖 ,  𝐹 𝑚 −1 + 𝑇(x 𝑖 ; p)8   𝑖∈𝑆 𝑚 ( 𝜂 )          //  Update  additive  expansion            𝑇 𝑚 ( 𝑥 ) = 𝑇0x; p 𝒎 8            𝐹 𝑚 (x) =   𝐹 𝑚 −1 (x) + 𝜐 ∙ 𝑇 𝑚 (x)   }   write  { 𝑇 𝑚 (x)}1𝑀   –  Algorithm control: L, η , υ Modification of loss function (“sequential” approximation) •  Sm (η ) : random sub-sample of size η ≤ N ⇒ impacts ensemble "diversity" m −1 •  Fm−1 (x) = υ ⋅ ∑k =1Tk (x) : “memory” function (0 ≤ υ ≤ 1 ) © 2013 G.Seni 2013 Strata Conference + Hadoop World 35
  • 36.
    Generic Ensemble Generation Step2: Choose Coefficients c! ! !! M M •  Given {Tm (x)}m=1 = {T (x; pm )}m=1 , coefficients can be obtained by a regularized linear regression N M ⎛ ⎞  {cm } = arg min ∑ L⎜ yi , c0 + ∑ cmTm (xi ) ⎟ + λ ⋅ P(c ) {cm } i =1 ⎝ m =1 ⎠ –  Regularization here helps reduce bias (in addition to variance) of the model –  New iterative fast algorithms for various loss/penalty combinations •  “GLMs via Coordinate Descent” (2008) © 2013 G.Seni 2013 Strata Conference + Hadoop World 36
  • 37.
    Bagging (Breiman, 1996) • Bagging = Bootstrap Aggregation ˆ •  L(y, y) : as available for single tree F0 (x) = 0 For m = 1 to M { •  υ = 0 ⇒ no memory p m = arg min p •  η = N / 2 i m −1 i∈S m ( ) ( x i ) + T ( x i ; p )) Tm (x) = T (x; p m ) •  Tm (x) ⇒ are large un-pruned trees ∑ηL(y , F Fm (x) = Fm −1 (x) + υ ⋅ Tm (x) υ } •  co = 0, {cm = 1 / M }1M M i.e., not fit to the data (avg) write {Tm (x)}1 –  i.e., perturbation of the data distribution only –  Potential improvements? –  R package: ipred © 2013 G.Seni 2013 Strata Conference + Hadoop World 37
  • 38.
    Bagging Hands-on Exercise 1.0 •  Navigate todirectory: example.2.EllipticalBoundary 0.0 Load and run –  fitModel_Bagging_by_hand.R -0.5 –  fitModel_CART.R (optional) •  If curious, also see gen2DdataNonLinear.R -1.0 x2 Set working directory: use setwd() or with GUI •  0.5 •  •  After class, load and run fitModel_Bagging.R -2 -1 0 1 2 x1 © 2013 G.Seni 2013 Strata Conference + Hadoop World 38
  • 39.
    Bagging Why it helps?  ˆ • Under L( y, y) = ( y − y) 2, averaging reduces variance and leaves bias unchanged •  Consider “idealized” bagging (aggregate)  estimator: f (x) = Ε f Z (x)  –  f Z fit to bootstrap data set Z = {yi , xi }1N –  Z is sampled from actual population distribution (not training data)   –  We can write: Ε[Y − f Z (x)] = Ε[Y − f (x) + f (x) − f Z (x)] 2 2  2 = Ε Y − f ( x) + Ε f Z ( x) − f ( x) [ ] ≥ Ε[Y − f (x)] [ ] 2 2 ⇒  true population aggregation never increases mean squared error! ⇒  Bagging will often decrease MSE… © 2013 G.Seni 2013 Strata Conference + Hadoop World 39
  • 40.
    Random Forest (Ho,1995; Breiman, 2001) •  Random Forest = Bagging + algorithm randomizing –  Subset splitting As each tree is constructed… •  Draw a random sample of predictors before each node is split ns = ⎣log 2 (n) + 1⎦ •  Find best split as usual but selecting only from subset of predictors  M ⇒ Increased diversity among {Tm (x)}1 - i.e., wider r (p) •  Width (inversely) controlled by ns –  Speed improvement over Bagging –  R package: randomForest © 2013 G.Seni 2013 Strata Conference + Hadoop World 40
  • 41.
    Bagging vs. RandomForest vs. ISLE 100 Target Functions Comparison (Popescu, 2005) •  ISLE improvements: –  Different data sampling strategy (not fixed) –  Fit coefficients to data Comparative RMS Error •  xxx_6_5%_P : 6 terminal nodes trees 5% samples without replacement Post-processing – i.e., using estimated “optimal” quadrature coefficients ⇒ Significantly faster to build! Bag © 2013 G.Seni RF Bag_6_5%_P RF_6_5%_P 2013 Strata Conference + Hadoop World 41
  • 42.
    AdaBoost (Freund &Schapire, 1997) observation weights : wi( 0 ) = 1 N For m = 1 to M { a. Fit a classifier Tm (x) to training data with wi( m ) b. Compute errm = ∑ N i =1 (cm , p m ) = arg min w I ( yi ≠ Tm (x i )) ∑ N ∑ηL( y , F i m −1 (x i ) + c ⋅ T (x i ; p) ) i∈S m ( ) Tm (x) = T (x; p m ) Fm (x) = Fm −1 (x) + υ ⋅ cm ⋅ Tm (x) d. Set wi( m +1) = wi( m ) ⋅ exp[α m ⋅ I ( yi ≠ Tm (x i )] } Output sign ∑m =1α mTm (x) c, p wi( m ) c. Compue α m = log((1 − errm ) errm ) M For m = 1 to M { (m) i i =1 ( F0 (x) = 0 } M write {cm , Tm (x)}1 ) –  We need to show p m = arg min (⋅) is equivalent to line a. above p Book •  Equivalence to Forward Stagewise Fitting Procedure –  cm = arg min (⋅) is equivalent to line c. c •  R package adabag © 2013 G.Seni 2013 Strata Conference + Hadoop World 42
  • 43.
    AdaBoost Hands-on Exercise 1.0 •  Navigate todirectory: example.2.EllipticalBoundary Set working directory: use setwd() or with GUI •  Load and run 0.0 –  fitModel_Adaboost_by_hand.R -0.5 •  After class, load and run fitModel_Adaboost.R and fitModel_RandomForest.R -1.0 x2 0.5 •  -2 -1 0 1 2 x1 © 2013 G.Seni 2013 Strata Conference + Hadoop World 43
  • 44.
    Stochastic Gradient Boosting(Friedman, 2001) •  Boosting with any differentiable loss criterion ˆ •  General L( y, y ) F0 (x) = c00 •  υ = 0.1 ⇒ Sequential sampling For m = 1 to M { (cm , p m ) = arg min m c, p •  η = N 2 ∑ηL( y , F i m −1 i∈S m ( ) (x i ) + c ⋅ T (x i ; p)) Tm (x) = T (x; p m ) •  Tm (x) ⇒ Any “weak” learner N •  co = arg minc ∑i =1 L( yi , c) Fm (x) = Fm −1 (x) +υ ⋅ cm ⋅ Tm (x) υ } M write {(υ ⋅ cm ), Tm (x)}1 M •  {cm }1 ⇒ “shrunk” sequential partial regression coefficients –  Potential improvements? –  R package: gbm © 2013 G.Seni 2013 Strata Conference + Hadoop World 44
  • 45.
    Stochastic Gradient Boosting LADRegression – L !, ! = ! − ! •  More robust than ( y − F )2 •  Resistant to outliers in y …trees already providing resistance to outliers in x ! N F0 (x) = median{yi }1 For m = 1 to M { // Step1 : find Tm (x) ~ = sign ( y − F (x ) ) yi i m −1 i •  Note:  {R } J jm 1 –  Trees are fitted to pseudoresponse ( // Step2 : find coefficients ⇒ Can’t interpret interpret γˆ jm = median{yi − Fm −1 (x i )}1N  x i ∈R jm individual trees –  “shrunk” version of tree gets added to ensemble j = 1… J // Update expansion –  Original tree constants are overwritten © 2013 G.Seni N = J − terminal node LS - regression tree {~i , x i }1 y  Fm (x) = Fm −1 (x) + υ ⋅ ∑ γˆ jm I x i ∈ R jm J j =1 ( ) } 2013 Strata Conference + Hadoop World 45 )
  • 46.
    Parallel vs. SequentialEnsembles 100 Target Functions Comparison (Popescu, 2005) Comparative RMS Error •  xxx_6_5%_P : 6 terminal nodes trees 5% samples without replacement Post-processing – i.e., using estimated “optimal” quadrature coefficients “Sequential” “Parallel” Bag RF Boost Seq_0.01_20%_P •  Seq_υ_η%_P : “Sequential” ensemble 6 terminal nodes trees υ : “memory” factor η % samples without replacement Post-processing Bag_6_5%_P RF_6_5%_P Seq_0.1_50%_P •  Sequential ISLE tend to perform better than parallel ones –  Consistent with results observed in classical Monte Carlo integration © 2013 G.Seni 2013 Strata Conference + Hadoop World 46
  • 47.
    Rule Ensembles (Friedman& Popescu, 2005) J ˆ ˆ •  Trees as collection of conjunctive rules: Tm (x) = ∑ c jm I (x ∈ R jm ) j =1 R1 27 R4 15 R2 ⇒ R5 15 22 x1 r2 (x) = I ( x1 > 22) ⋅ I (0 ≤ x2 ≤ 27) r3 (x) = I (15 < x1 ≤ 22) ⋅ I (0 ≤ x2 ) R4 ⇒ r4 (x) = I (0 ≤ x1 ≤ 15) ⋅ I ( x2 > 15) R5 ⇒ x2 r1 (x) = I ( x1 > 22) ⋅ I ( x2 > 27) R3 ⇒ R3 ˆ y R2 R1 ⇒ r5 (x) = I (0 ≤ x1 ≤ 15) ⋅ I (0 ≤ x2 ≤ 15) –  These simple rules, rm (x) ∈ {0,1} can be used as base learners , –  Main motivation is interpretability © 2013 G.Seni 2013 Strata Conference + Hadoop World 47
  • 48.
    Rule Ensembles ISLE Procedure • Rule-based model: F (x) = a0 + ∑ am rm (x) m –  Still a piecewise constant model ⇒ complement the non-linear rules with purely linear terms: •  Fitting –  Step 1: derive rules from tree ensemble (shortcut) •  Tree size controls rule “complexity” (interaction order) –  Step 2: fit coefficients using linear regularized procedure: ( N P K ˆ ˆ ({ak },{b j }) = arg min ∑ L yi , F x; {ak }0 , {b j }1 {ak },{b j } © 2013 G.Seni i=1 ( )) +!!! ⋅ 2013 Strata Conference + Hadoop World !(a) + !(b) ! 48
  • 49.
    Boosting & RuleEnsembles Hands-on Exercise 2500 •  example.3.Diamonds Load and run 1500 Set working directory: use setwd() or with GUI •  2000 •  –  viewDiamondData.R –  fitModel_GBM.R 1000 –  fitModel_RE.R •  500 Absolute loss Navigate to directory: After class, go to: example.1.LinearBoundary 0 200 400 600 800 1000 Run fitModel_GBM.R Iteration © 2013 G.Seni 2013 Strata Conference + Hadoop World 49
  • 50.
    Overview •  Motivation, Ina Nutshell & Timeline •  Predictive Learning & Decision Trees •  Ensemble Methods Ø  Summary © 2013 G.Seni 2013 Strata Conference + Hadoop World 50
  • 51.
    Summary •  Ensemble methodshave been found to perform extremely well in a variety of problem domains •  Shown to have desirable statistical properties •  Latest ensemble research brings together important foundational strands of statistics •  Emphasis on accuracy but significant progress has been made on interpretability Go build Ensembles and keep in touch! © 2013 G.Seni 2013 Strata Conference + Hadoop World 51