Deepak George
Senior Data Scientist – Machine Learning
Decision Tree Ensembles
Bagging, Random Forest & Gradient Boosting Machines
December 2015
 Education
 Computer Science Engineering – College Of Engineering Trivandrum
 Business Analytics & Intelligence – Indian Institute Of Management Bangalore
 Career
 Mu Sigma
 Accenture Analytics
 Data Science
 1st Prize Best Data Science Project (BAI 5) – IIM Bangalore
 Top 10% (out of 1100) finish Kaggle Coupon Purchase Prediction (Recommender
System)
 SAS Certified Statistical Business Analyst: Regression and Modeling Credentials
 Statistical Learning – Stanford University
 Passion
 Photography, Football, Data Science, Machine Learning
 Contact
 Deepak.george14@iimb.ernet.in
 linkedin.com/in/deepakgeorge7
Copyright @ Deepak George, IIM Bangalore
2
About Me
Copyright @ Deepak George, IIM Bangalore
3
Bias-Variance Tradeoff
Expected test MSE
 Bias
 Error that is introduced by approximating a
complicated relationship, by a much simpler
model.
 Difference between the truth and what you
expect to learn
 Underfitting
 Variance
 Amount by which model would change if we
estimated it using a different training data.
 If a model has high variance then small
changes in the training data can result in
large changes in the model.
 Overfitting
Copyright @ Deepak George, IIM Bangalore
4
Bias-Variance Tradeoff
Underfitting Ideal Learner Overfitting
 Problem: Decision tree have low bias & suffer from high variance
 Goal: Reduce variance of decision trees
 Hint: Given set of n independent observations Z1, . . . , Zn, each
with variance σ2, the variance of the mean of the observations is given
by σ2/n.
 In other words, averaging a set of observations reduces variance.
 Theoretically: Take multiple independent samples S’ from the population
 Fit “bushy”/deep decision trees on each S1,S2…. Sn
 Trees are grown deep and are not pruned
 Variance reduces linearly & Bias remain unchanged
 Practically: We only have one sample/training set & not the population.
 So take bootstrap samples i.e. multiple samples from the
single sample with replacement
 Variance reduces sub-linearly & Bias often increase slightly
because bootstrap samples are correlated.
 Final Classifier: Average of predictions for regression or majority vote
for classification.
 High Variance introduced by deep decision trees are mitigated by
averaging predictions from each decision trees.
Copyright @ Deepak George, IIM Bangalore
5
Bagging
Population
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
Patrick# 5# 1# 1# L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
Patrick# 5# 1# 1# L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###
S1
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
Patrick# 5# 1# 1# L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###
S2
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
Patrick# 5# 1# 1# L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###
Sn
.
.
.
Samples
Sample
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
Patrick# 5# 1# 1# L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###
S1
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
Patrick# 5# 1# 1# L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###
S2
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Sn
.
.
.
Bootstrap Samples
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
Patrick# 5# 1# 1# L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###
Copyright @ Deepak George, IIM Bangalore
6
Bootstrap sampling
Bootstrap sample
should have same
sample size as the
original sample.
With replacement results
in repetition of values
Bootstrap sample on an
average uses only 2/3 of
the data in the original
sample
Copyright @ Deepak George, IIM Bangalore
7
Random Forest
 Problem: Bagging still have relatively high variance
 Goal: Reduce variance of Bagging
 Solution: Along with sampling of data in Bagging, take samples of features also!
 In other words, in building a random forest, at each split in the tree,
the use only a random subset of features instead of all the features.
 This de-correlates the trees.
 Its mathematically proved that 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟𝑠 is a good approximate value for
predictor subset size (mtry/max_features).
 Evaluation: A bootstrap sample uses only approximately 2/3 of the observations of original
sample.
 Remaining training data (OOB) are used to estimate error and variable importance
 Hyperparameters are knobs to control bias & variance tradeoff of any
machine learning algorithm.
 Key Hyper parameters
 Max Features – De-correlates the trees
 Number of Trees in the forest – Higher number reduce more variance
Random Forest - Key Hyperparameters
8
Copyright @ Deepak George, IIM Bangalore
Copyright @ Deepak George, IIM Bangalore
9
Random Forest – R Implementation
library(randomForest)
library(MASS) #Contains Boston dataframe
library(caret)
View(Boston)
#Cross Validation
cv.ctrl <- trainControl(method = "repeatedcv", repeats = 2,number = 5, allowParallel=T)
#GridSeach
rf.grid <- expand.grid(mtry = 2:13)
set.seed(1861) ## make reproducible here, but not if generating many random samples
#Hyper Parametertuning
rf_tune <-train(medv~.,
data=Boston,
method="rf",
trControl=cv.ctrl,
tuneGrid=rf.grid,
ntree = 1000,
importance = TRUE)
#Cross Validation results
rf_tune
plot(rf_tune)
#Variable Importance
varImp(rf_tune)
plot(varImp(rf_tune), top = 10)
Copyright @ Deepak George, IIM Bangalore
10
Boosting
 Intuition: Ensemble many “weak” classifiers (typically decision trees) to
produce a final “strong” classifier
 Weak classifier  Error rate is only slightly better than random
guessing.
 Boosting is a Forward Stagewise Additive model
 Boosting sequentially apply the weak classifiers one by one to repeatedly
reweighted versions of the data.
 Each new weak learner in the sequence tries to correct the
misclassification/error made by the previous weak learners.
 Initially all of the weights are set to Wi = 1/N
 For each successive step the observation weights are individually
modified and a new weak learner is fitted on the reweighted
observations.
 At step m, those observations that were misclassified by the
classifier Gm−1(x) induced at the previous step have their weights
increased, whereas the weights are decreased for those that were
classified correctly.
 Final “strong” classifier is based on weighted vote of weak classifiers
X1
X2
AdaBoost – Illustration
11Copyright @ Deepak George, IIM Bangalore
Step 1
Input Data
Initially all observations are
assigned equal weight (1/N)
Observations that are
misclassified in the ith
iteration is given higher
weights in the (i+1)th iteration
Observations that are correctly
classified in the ith iteration is
given lower weights in the
(i+1)th iteration
Copyright @ Deepak George, IIM Bangalore
12
Copyright @ Deepak George, IIM Bangalore
Step 2
Step 3
AdaBoost – Illustration
13
Copyright @ Deepak George, IIM Bangalore
Final Ensemble/Model
AdaBoost – Illustration
AdaBoost - Algorithm
14
Copyright @ Deepak George, IIM Bangalore
 Generalization of AdaBoost to work with arbitrary loss functions resulted in GBM.
Gradient Boosting = Gradient Descent + Boosting
 GBM uses gradient descent algorithm which can optimize any differentiable loss
function.
 In Adaboost, ‘shortcomings’ are identified by high-weight data points.
 In Gradient Boosting,“shortcomings” are identified by negative gradients (also
called pseudo residuals).
 In GBM instead of reweighting used in adaboost, each new tree is fit to the
negative gradients of the previous tree.
 Each tree in GBM is a successive gradient descent step.
Gradient Boosting Machines
15
Copyright @ Deepak George, IIM Bangalore
 AdaBoost is equivalent to forward stagewise additive modeling using the
exponential loss function.
Gradient Boosting - Algorithm
16
Copyright @ Deepak George, IIM Bangalore
 GBM has 3 types of hyper parameters
 Tree Structure
 Max depth of the trees - Controls the degree of features
interactions
 Min samples leaf – Minimum number of samples in leaf node.
 Number of Trees
 Shrinkage
 Learning rate - Slows learning by shrinking tree predictions.
 Unlike fitting a single large decision tree to the data, which amounts
to fitting the data hard and potentially overfitting, the boosting
approach instead learns slowly
 Stochastic Gradient Boosting
 SubSample: Select random subset of the training set for fitting each
tree than using the complete training data.
 Max features: Select random subset of features for each tree.
GBM – Key Hyperparameters
17
Copyright @ Deepak George, IIM Bangalore
Copyright @ Deepak George, IIM Bangalore
18
Tree Ensembles- Interpretation
library(xgboost)
library(MASS) #Contains Boston dataframe
library(caret)
#Cross Validation
cv.ctrl <- trainControl(method = "repeatedcv", repeats = 2,number = 5, allowParallel=T)
#GridSeach
xgb.grid <- expand.grid(nrounds=1000,eta = c(0.005,0.01,0.05,0.1) ,max_depth = c(4,5,6,7,8))
set.seed(1860)
#Model training
xgb_tune <-train(medv~.,
data=Boston,
method="xgbTree",
trControl=cv.ctrl,
tuneGrid=xgb.grid,
importance = TRUE,
subsample =0.8)
#Cross Validation results
xgb_tune
plot(xgb_tune)
#Variable Importance
plot(varImp(xgb_tune), top = 10)
Copyright @ Deepak George, IIM Bangalore
19
GBM – R Implementation
Copyright @ Deepak George, IIM Bangalore
20
End
Questions ?

Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines

  • 1.
    Deepak George Senior DataScientist – Machine Learning Decision Tree Ensembles Bagging, Random Forest & Gradient Boosting Machines December 2015
  • 2.
     Education  ComputerScience Engineering – College Of Engineering Trivandrum  Business Analytics & Intelligence – Indian Institute Of Management Bangalore  Career  Mu Sigma  Accenture Analytics  Data Science  1st Prize Best Data Science Project (BAI 5) – IIM Bangalore  Top 10% (out of 1100) finish Kaggle Coupon Purchase Prediction (Recommender System)  SAS Certified Statistical Business Analyst: Regression and Modeling Credentials  Statistical Learning – Stanford University  Passion  Photography, Football, Data Science, Machine Learning  Contact  Deepak.george14@iimb.ernet.in  linkedin.com/in/deepakgeorge7 Copyright @ Deepak George, IIM Bangalore 2 About Me
  • 3.
    Copyright @ DeepakGeorge, IIM Bangalore 3 Bias-Variance Tradeoff Expected test MSE  Bias  Error that is introduced by approximating a complicated relationship, by a much simpler model.  Difference between the truth and what you expect to learn  Underfitting  Variance  Amount by which model would change if we estimated it using a different training data.  If a model has high variance then small changes in the training data can result in large changes in the model.  Overfitting
  • 4.
    Copyright @ DeepakGeorge, IIM Bangalore 4 Bias-Variance Tradeoff Underfitting Ideal Learner Overfitting
  • 5.
     Problem: Decisiontree have low bias & suffer from high variance  Goal: Reduce variance of decision trees  Hint: Given set of n independent observations Z1, . . . , Zn, each with variance σ2, the variance of the mean of the observations is given by σ2/n.  In other words, averaging a set of observations reduces variance.  Theoretically: Take multiple independent samples S’ from the population  Fit “bushy”/deep decision trees on each S1,S2…. Sn  Trees are grown deep and are not pruned  Variance reduces linearly & Bias remain unchanged  Practically: We only have one sample/training set & not the population.  So take bootstrap samples i.e. multiple samples from the single sample with replacement  Variance reduces sub-linearly & Bias often increase slightly because bootstrap samples are correlated.  Final Classifier: Average of predictions for regression or majority vote for classification.  High Variance introduced by deep decision trees are mitigated by averaging predictions from each decision trees. Copyright @ Deepak George, IIM Bangalore 5 Bagging Population Alice# 14# 0# 1# Bob# 10# 1# 1# Carol# 13# 0# 1# Dave# 8# 1# 0# Erin# 11# 0# 0# Frank# 9# 1# 1# Gena# 8# 0# 0# James# 11# 1# 1# Jessica# 14# 0# 1# Alice# 14# 0# 1# Amy# 12# 0# 1# Bob# 10# 1# 1# Xavier# 9# 1# 0# Cathy# 9# 0# 1# Carol# 13# 0# 1# Eugene# 13# 1# 0# Rafael# 12# 1# 1# Dave# 8# 1# 0# Peter# 9# 1# 0# Henry# 13# 1# 0# Erin# 11# 0# 0# Rose# 7# 0# 0# Iain# 8# 1# 1# Paulo# 12# 1# 0# Margaret# 10# 0# 1# Frank# 9# 1# 1# Jill# 13# 0# 0# Leon# 10# 1# 0# Sarah# 12# 0# 0# Gena# 8# 0# 0# Patrick# 5# 1# 1# L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]### Alice# 14# 0# 1# Bob# 10# 1# 1# Carol# 13# 0# 1# Dave# 8# 1# 0# Erin# 11# 0# 0# Frank# 9# 1# 1# Gena# 8# 0# 0# James# 11# 1# 1# Jessica# 14# 0# 1# Alice# 14# 0# 1# Amy# 12# 0# 1# Bob# 10# 1# 1# Xavier# 9# 1# 0# Cathy# 9# 0# 1# Carol# 13# 0# 1# Eugene# 13# 1# 0# Rafael# 12# 1# 1# Dave# 8# 1# 0# Peter# 9# 1# 0# Henry# 13# 1# 0# Erin# 11# 0# 0# Rose# 7# 0# 0# Iain# 8# 1# 1# Paulo# 12# 1# 0# Margaret# 10# 0# 1# Frank# 9# 1# 1# Jill# 13# 0# 0# Leon# 10# 1# 0# Sarah# 12# 0# 0# Gena# 8# 0# 0# Patrick# 5# 1# 1# L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]### S1 Alice# 14# 0# 1# Bob# 10# 1# 1# Carol# 13# 0# 1# Dave# 8# 1# 0# Erin# 11# 0# 0# Frank# 9# 1# 1# Gena# 8# 0# 0# James# 11# 1# 1# Jessica# 14# 0# 1# Alice# 14# 0# 1# Amy# 12# 0# 1# Bob# 10# 1# 1# Xavier# 9# 1# 0# Cathy# 9# 0# 1# Carol# 13# 0# 1# Eugene# 13# 1# 0# Rafael# 12# 1# 1# Dave# 8# 1# 0# Peter# 9# 1# 0# Henry# 13# 1# 0# Erin# 11# 0# 0# Rose# 7# 0# 0# Iain# 8# 1# 1# Paulo# 12# 1# 0# Margaret# 10# 0# 1# Frank# 9# 1# 1# Jill# 13# 0# 0# Leon# 10# 1# 0# Sarah# 12# 0# 0# Gena# 8# 0# 0# Patrick# 5# 1# 1# L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]### S2 Alice# 14# 0# 1# Bob# 10# 1# 1# Carol# 13# 0# 1# Dave# 8# 1# 0# Erin# 11# 0# 0# Frank# 9# 1# 1# Gena# 8# 0# 0# James# 11# 1# 1# Jessica# 14# 0# 1# Alice# 14# 0# 1# Amy# 12# 0# 1# Bob# 10# 1# 1# Xavier# 9# 1# 0# Cathy# 9# 0# 1# Carol# 13# 0# 1# Eugene# 13# 1# 0# Rafael# 12# 1# 1# Dave# 8# 1# 0# Peter# 9# 1# 0# Henry# 13# 1# 0# Erin# 11# 0# 0# Rose# 7# 0# 0# Iain# 8# 1# 1# Paulo# 12# 1# 0# Margaret# 10# 0# 1# Frank# 9# 1# 1# Jill# 13# 0# 0# Leon# 10# 1# 0# Sarah# 12# 0# 0# Gena# 8# 0# 0# Patrick# 5# 1# 1# L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]### Sn . . . Samples Sample Alice# 14# 0# 1# Bob# 10# 1# 1# Carol# 13# 0# 1# Dave# 8# 1# 0# Erin# 11# 0# 0# Frank# 9# 1# 1# Gena# 8# 0# 0# James# 11# 1# 1# Jessica# 14# 0# 1# Alice# 14# 0# 1# Amy# 12# 0# 1# Bob# 10# 1# 1# Xavier# 9# 1# 0# Cathy# 9# 0# 1# Carol# 13# 0# 1# Eugene# 13# 1# 0# Rafael# 12# 1# 1# Dave# 8# 1# 0# Peter# 9# 1# 0# Henry# 13# 1# 0# Erin# 11# 0# 0# Rose# 7# 0# 0# Iain# 8# 1# 1# Paulo# 12# 1# 0# Margaret# 10# 0# 1# Frank# 9# 1# 1# Jill# 13# 0# 0# Leon# 10# 1# 0# Sarah# 12# 0# 0# Gena# 8# 0# 0# Patrick# 5# 1# 1# L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]### S1 Alice# 14# 0# 1# Bob# 10# 1# 1# Carol# 13# 0# 1# Dave# 8# 1# 0# Erin# 11# 0# 0# Frank# 9# 1# 1# Gena# 8# 0# 0# James# 11# 1# 1# Jessica# 14# 0# 1# Alice# 14# 0# 1# Amy# 12# 0# 1# Bob# 10# 1# 1# Xavier# 9# 1# 0# Cathy# 9# 0# 1# Carol# 13# 0# 1# Eugene# 13# 1# 0# Rafael# 12# 1# 1# Dave# 8# 1# 0# Peter# 9# 1# 0# Henry# 13# 1# 0# Erin# 11# 0# 0# Rose# 7# 0# 0# Iain# 8# 1# 1# Paulo# 12# 1# 0# Margaret# 10# 0# 1# Frank# 9# 1# 1# Jill# 13# 0# 0# Leon# 10# 1# 0# Sarah# 12# 0# 0# Gena# 8# 0# 0# Patrick# 5# 1# 1# L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]### S2 Alice# 14# 0# 1# Bob# 10# 1# 1# Carol# 13# 0# 1# Dave# 8# 1# 0# Erin# 11# 0# 0# Frank# 9# 1# 1# Gena# 8# 0# 0# James# 11# 1# 1# Jessica# 14# 0# 1# Alice# 14# 0# 1# Amy# 12# 0# 1# Bob# 10# 1# 1# Xavier# 9# 1# 0# Cathy# 9# 0# 1# Carol# 13# 0# 1# Eugene# 13# 1# 0# Rafael# 12# 1# 1# Dave# 8# 1# 0# Peter# 9# 1# 0# Henry# 13# 1# 0# Erin# 11# 0# 0# Rose# 7# 0# 0# Iain# 8# 1# 1# Paulo# 12# 1# 0# Margaret# 10# 0# 1# Frank# 9# 1# 1# Jill# 13# 0# 0# Sn . . . Bootstrap Samples Alice# 14# 0# 1# Bob# 10# 1# 1# Carol# 13# 0# 1# Dave# 8# 1# 0# Erin# 11# 0# 0# Frank# 9# 1# 1# Gena# 8# 0# 0# James# 11# 1# 1# Jessica# 14# 0# 1# Alice# 14# 0# 1# Amy# 12# 0# 1# Bob# 10# 1# 1# Xavier# 9# 1# 0# Cathy# 9# 0# 1# Carol# 13# 0# 1# Eugene# 13# 1# 0# Rafael# 12# 1# 1# Dave# 8# 1# 0# Peter# 9# 1# 0# Henry# 13# 1# 0# Erin# 11# 0# 0# Rose# 7# 0# 0# Iain# 8# 1# 1# Paulo# 12# 1# 0# Margaret# 10# 0# 1# Frank# 9# 1# 1# Jill# 13# 0# 0# Leon# 10# 1# 0# Sarah# 12# 0# 0# Gena# 8# 0# 0# Patrick# 5# 1# 1# L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###
  • 6.
    Copyright @ DeepakGeorge, IIM Bangalore 6 Bootstrap sampling Bootstrap sample should have same sample size as the original sample. With replacement results in repetition of values Bootstrap sample on an average uses only 2/3 of the data in the original sample
  • 7.
    Copyright @ DeepakGeorge, IIM Bangalore 7 Random Forest  Problem: Bagging still have relatively high variance  Goal: Reduce variance of Bagging  Solution: Along with sampling of data in Bagging, take samples of features also!  In other words, in building a random forest, at each split in the tree, the use only a random subset of features instead of all the features.  This de-correlates the trees.  Its mathematically proved that 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟𝑠 is a good approximate value for predictor subset size (mtry/max_features).  Evaluation: A bootstrap sample uses only approximately 2/3 of the observations of original sample.  Remaining training data (OOB) are used to estimate error and variable importance
  • 8.
     Hyperparameters areknobs to control bias & variance tradeoff of any machine learning algorithm.  Key Hyper parameters  Max Features – De-correlates the trees  Number of Trees in the forest – Higher number reduce more variance Random Forest - Key Hyperparameters 8 Copyright @ Deepak George, IIM Bangalore
  • 9.
    Copyright @ DeepakGeorge, IIM Bangalore 9 Random Forest – R Implementation library(randomForest) library(MASS) #Contains Boston dataframe library(caret) View(Boston) #Cross Validation cv.ctrl <- trainControl(method = "repeatedcv", repeats = 2,number = 5, allowParallel=T) #GridSeach rf.grid <- expand.grid(mtry = 2:13) set.seed(1861) ## make reproducible here, but not if generating many random samples #Hyper Parametertuning rf_tune <-train(medv~., data=Boston, method="rf", trControl=cv.ctrl, tuneGrid=rf.grid, ntree = 1000, importance = TRUE) #Cross Validation results rf_tune plot(rf_tune) #Variable Importance varImp(rf_tune) plot(varImp(rf_tune), top = 10)
  • 10.
    Copyright @ DeepakGeorge, IIM Bangalore 10 Boosting  Intuition: Ensemble many “weak” classifiers (typically decision trees) to produce a final “strong” classifier  Weak classifier  Error rate is only slightly better than random guessing.  Boosting is a Forward Stagewise Additive model  Boosting sequentially apply the weak classifiers one by one to repeatedly reweighted versions of the data.  Each new weak learner in the sequence tries to correct the misclassification/error made by the previous weak learners.  Initially all of the weights are set to Wi = 1/N  For each successive step the observation weights are individually modified and a new weak learner is fitted on the reweighted observations.  At step m, those observations that were misclassified by the classifier Gm−1(x) induced at the previous step have their weights increased, whereas the weights are decreased for those that were classified correctly.  Final “strong” classifier is based on weighted vote of weak classifiers
  • 11.
    X1 X2 AdaBoost – Illustration 11Copyright@ Deepak George, IIM Bangalore Step 1 Input Data Initially all observations are assigned equal weight (1/N) Observations that are misclassified in the ith iteration is given higher weights in the (i+1)th iteration Observations that are correctly classified in the ith iteration is given lower weights in the (i+1)th iteration Copyright @ Deepak George, IIM Bangalore
  • 12.
    12 Copyright @ DeepakGeorge, IIM Bangalore Step 2 Step 3 AdaBoost – Illustration
  • 13.
    13 Copyright @ DeepakGeorge, IIM Bangalore Final Ensemble/Model AdaBoost – Illustration
  • 14.
    AdaBoost - Algorithm 14 Copyright@ Deepak George, IIM Bangalore
  • 15.
     Generalization ofAdaBoost to work with arbitrary loss functions resulted in GBM. Gradient Boosting = Gradient Descent + Boosting  GBM uses gradient descent algorithm which can optimize any differentiable loss function.  In Adaboost, ‘shortcomings’ are identified by high-weight data points.  In Gradient Boosting,“shortcomings” are identified by negative gradients (also called pseudo residuals).  In GBM instead of reweighting used in adaboost, each new tree is fit to the negative gradients of the previous tree.  Each tree in GBM is a successive gradient descent step. Gradient Boosting Machines 15 Copyright @ Deepak George, IIM Bangalore  AdaBoost is equivalent to forward stagewise additive modeling using the exponential loss function.
  • 16.
    Gradient Boosting -Algorithm 16 Copyright @ Deepak George, IIM Bangalore
  • 17.
     GBM has3 types of hyper parameters  Tree Structure  Max depth of the trees - Controls the degree of features interactions  Min samples leaf – Minimum number of samples in leaf node.  Number of Trees  Shrinkage  Learning rate - Slows learning by shrinking tree predictions.  Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly  Stochastic Gradient Boosting  SubSample: Select random subset of the training set for fitting each tree than using the complete training data.  Max features: Select random subset of features for each tree. GBM – Key Hyperparameters 17 Copyright @ Deepak George, IIM Bangalore
  • 18.
    Copyright @ DeepakGeorge, IIM Bangalore 18 Tree Ensembles- Interpretation
  • 19.
    library(xgboost) library(MASS) #Contains Bostondataframe library(caret) #Cross Validation cv.ctrl <- trainControl(method = "repeatedcv", repeats = 2,number = 5, allowParallel=T) #GridSeach xgb.grid <- expand.grid(nrounds=1000,eta = c(0.005,0.01,0.05,0.1) ,max_depth = c(4,5,6,7,8)) set.seed(1860) #Model training xgb_tune <-train(medv~., data=Boston, method="xgbTree", trControl=cv.ctrl, tuneGrid=xgb.grid, importance = TRUE, subsample =0.8) #Cross Validation results xgb_tune plot(xgb_tune) #Variable Importance plot(varImp(xgb_tune), top = 10) Copyright @ Deepak George, IIM Bangalore 19 GBM – R Implementation
  • 20.
    Copyright @ DeepakGeorge, IIM Bangalore 20 End Questions ?