Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines

Deepak George
Senior Data Scientist – Machine Learning
Decision Tree Ensembles
Bagging, Random Forest & Gradient Boosting Machines
December 2015

 Education
 Computer Science Engineering – College Of Engineering Trivandrum
 Business Analytics & Intelligence – Indian Institute Of Management Bangalore
 Career
 Mu Sigma
 Accenture Analytics
 Data Science
 1st Prize Best Data Science Project (BAI 5) – IIM Bangalore
 Top 10% (out of 1100) finish Kaggle Coupon Purchase Prediction (Recommender
System)
 SAS Certified Statistical Business Analyst: Regression and Modeling Credentials
 Statistical Learning – Stanford University
 Passion
 Photography, Football, Data Science, Machine Learning
 Contact
 Deepak.george14@iimb.ernet.in
 linkedin.com/in/deepakgeorge7
Copyright @ Deepak George, IIM Bangalore
2
About Me

3
Bias-Variance Tradeoff
Expected test MSE
 Bias
 Error that is introduced by approximating a
complicated relationship, by a much simpler
model.
 Difference between the truth and what you
expect to learn
 Underfitting
 Variance
 Amount by which model would change if we
estimated it using a different training data.
 If a model has high variance then small
changes in the training data can result in
large changes in the model.
 Overfitting

4
Bias-Variance Tradeoff
Underfitting Ideal Learner Overfitting

 Problem: Decision tree have low bias & suffer from high variance
 Goal: Reduce variance of decision trees
 Hint: Given set of n independent observations Z1, . . . , Zn, each
with variance σ2, the variance of the mean of the observations is given
by σ2/n.
 In other words, averaging a set of observations reduces variance.
 Theoretically: Take multiple independent samples S’ from the population
 Fit “bushy”/deep decision trees on each S1,S2…. Sn
 Trees are grown deep and are not pruned
 Variance reduces linearly & Bias remain unchanged
 Practically: We only have one sample/training set & not the population.
 So take bootstrap samples i.e. multiple samples from the
single sample with replacement
 Variance reduces sub-linearly & Bias often increase slightly
because bootstrap samples are correlated.
 Final Classifier: Average of predictions for regression or majority vote
for classification.
 High Variance introduced by deep decision trees are mitigated by
averaging predictions from each decision trees.
5
Bagging
Population
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
Patrick# 5# 1# 1# L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
S1
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
S2
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
Sn
.
.
.
Samples
Sample
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
S1
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
S2
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Sn
.
.
.
Bootstrap Samples
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#

6
Bootstrap sampling
Bootstrap sample
should have same
sample size as the
original sample.
With replacement results
in repetition of values
Bootstrap sample on an
average uses only 2/3 of
the data in the original
sample

7
Random Forest
 Problem: Bagging still have relatively high variance
 Goal: Reduce variance of Bagging
 Solution: Along with sampling of data in Bagging, take samples of features also!
 In other words, in building a random forest, at each split in the tree,
the use only a random subset of features instead of all the features.
 This de-correlates the trees.
 Its mathematically proved that 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟𝑠 is a good approximate value for
predictor subset size (mtry/max_features).
 Evaluation: A bootstrap sample uses only approximately 2/3 of the observations of original
sample.
 Remaining training data (OOB) are used to estimate error and variable importance

 Hyperparameters are knobs to control bias & variance tradeoff of any
machine learning algorithm.
 Key Hyper parameters
 Max Features – De-correlates the trees
 Number of Trees in the forest – Higher number reduce more variance
Random Forest - Key Hyperparameters
8

9
Random Forest – R Implementation
library(randomForest)
library(MASS) #Contains Boston dataframe
library(caret)
View(Boston)
#Cross Validation
cv.ctrl <- trainControl(method = "repeatedcv", repeats = 2,number = 5, allowParallel=T)
#GridSeach
rf.grid <- expand.grid(mtry = 2:13)
set.seed(1861) ## make reproducible here, but not if generating many random samples
#Hyper Parametertuning
rf_tune <-train(medv~.,
data=Boston,
method="rf",
trControl=cv.ctrl,
tuneGrid=rf.grid,
ntree = 1000,
importance = TRUE)
#Cross Validation results
rf_tune
plot(rf_tune)
#Variable Importance
varImp(rf_tune)
plot(varImp(rf_tune), top = 10)

10
Boosting
 Intuition: Ensemble many “weak” classifiers (typically decision trees) to
produce a final “strong” classifier
 Weak classifier  Error rate is only slightly better than random
guessing.
 Boosting is a Forward Stagewise Additive model
 Boosting sequentially apply the weak classifiers one by one to repeatedly
reweighted versions of the data.
 Each new weak learner in the sequence tries to correct the
misclassification/error made by the previous weak learners.
 Initially all of the weights are set to Wi = 1/N
 For each successive step the observation weights are individually
modified and a new weak learner is fitted on the reweighted
observations.
 At step m, those observations that were misclassified by the
classifier Gm−1(x) induced at the previous step have their weights
increased, whereas the weights are decreased for those that were
classified correctly.
 Final “strong” classifier is based on weighted vote of weak classifiers

X1
X2
AdaBoost – Illustration
11Copyright @ Deepak George, IIM Bangalore
Step 1
Input Data
Initially all observations are
assigned equal weight (1/N)
Observations that are
misclassified in the ith
iteration is given higher
weights in the (i+1)th iteration
Observations that are correctly
classified in the ith iteration is
given lower weights in the
(i+1)th iteration

12
Step 2
Step 3

13
Final Ensemble/Model

AdaBoost - Algorithm
14

 Generalization of AdaBoost to work with arbitrary loss functions resulted in GBM.
Gradient Boosting = Gradient Descent + Boosting
 GBM uses gradient descent algorithm which can optimize any differentiable loss
function.
 In Adaboost, ‘shortcomings’ are identified by high-weight data points.
 In Gradient Boosting,“shortcomings” are identified by negative gradients (also
called pseudo residuals).
 In GBM instead of reweighting used in adaboost, each new tree is fit to the
negative gradients of the previous tree.
 Each tree in GBM is a successive gradient descent step.
Gradient Boosting Machines
15
 AdaBoost is equivalent to forward stagewise additive modeling using the
exponential loss function.

Gradient Boosting - Algorithm
16

 GBM has 3 types of hyper parameters
 Tree Structure
 Max depth of the trees - Controls the degree of features
interactions
 Min samples leaf – Minimum number of samples in leaf node.
 Number of Trees
 Shrinkage
 Learning rate - Slows learning by shrinking tree predictions.
 Unlike fitting a single large decision tree to the data, which amounts
to fitting the data hard and potentially overfitting, the boosting
approach instead learns slowly
 Stochastic Gradient Boosting
 SubSample: Select random subset of the training set for fitting each
tree than using the complete training data.
 Max features: Select random subset of features for each tree.
GBM – Key Hyperparameters
17

18
Tree Ensembles- Interpretation

library(xgboost)
library(MASS) #Contains Boston dataframe
library(caret)
#Cross Validation
cv.ctrl <- trainControl(method = "repeatedcv", repeats = 2,number = 5, allowParallel=T)
#GridSeach
xgb.grid <- expand.grid(nrounds=1000,eta = c(0.005,0.01,0.05,0.1) ,max_depth = c(4,5,6,7,8))
set.seed(1860)
#Model training
xgb_tune <-train(medv~.,
data=Boston,
method="xgbTree",
trControl=cv.ctrl,
tuneGrid=xgb.grid,
importance = TRUE,
subsample =0.8)
#Cross Validation results
xgb_tune
plot(xgb_tune)
#Variable Importance
plot(varImp(xgb_tune), top = 10)
19
GBM – R Implementation

20
End
Questions ?

Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines

Similar to Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines (8)

Recently uploaded

Recently uploaded (20)

Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines