Machine
Learning V
Random Forest - II
Random Forest
• Random forest is one of the most powerful and popular machine
learning algorithms because of it ensemble methods. It is also called
as Bootstrap Aggregation or bagging.
• Before we get to the bagging, lets take a quick look what is
bootstrap.
• Bootstrapping is a statistical technique for estimating a quantity from
a data sample. The word quantity means descriptive statistics like
mean or a standard deviation.
For example if we have 3 samples and the mean values are 4.5, 3,2.3
Now if we again take the average value of these we will get an again
estimated mean of 3.26
The same techniques can be also used to get an estimate for other
quantities like standard deviation, quantiles, coefficients etc.
Rupak Roy
Bootstrap Aggregation(Bagging)
Now Bootstrap Aggregation in short Bagging is again an ensemble method
that combines the predictive scores from multiple models together to make
more accurate predictions than an individual model.
In general it can be used to reduce the variance of those algorithms that
prone to high variance like classification and regression trees.
When bagging with DT we are less concerned about individual trees over
fitting the model. With this reason the individual trees are grown without
pruning. In the end the tree will have both high variance and low bias.
These are important features of sub models that helps a lot in creating a
combine high accuracy prediction.
The only parameter used when bagging decision trees is the number of
samples i.e. the number of trees to include. The number of trees can be set
based on the accuracy until it stops improving.
Same just like Bagging is an improvement of Decision Trees, Random Forest
are an improvement over the bagging of decision trees.
Rupak Roy
Random Forest
The problem with decision trees they choose which variable that will be used to split
which in turn can have high correlation in the predictions.
In Decision Tree the learning algorithm selects the optimal split point by looking
through all variables.
The random forest algorithm changes this setup so that the machine learning
algorithm is limited to a random sample of features to search the split.
The Number of randomly selected features at each split(m) can be tuned using
=>m = sqrt(p) (for Classification)
=>m= 1/3rd of P (for Regression)
Where m, is the number of randomly selected features for each split
For each sample ran from the data set(training dataset), there will be samples left
behind that were not included due to its robustness to outliers and missing values
comes with a cost of throwing some data as we have learned in our previous chapter.
So these samples are called as Out of Bag (OOB) samples.
And the performance of each model on its OOB samples when averaged can also
provide addition important estimates to increase the accuracy of the model. This is
called as OOB estimate of performance.
Rupak Roy
Bias-Variance Trade off
Low Bias Low Variance Example
Accurately captures Generalizes well to A self driving car
patterns in its unseen data
training data
Practically it is difficult to have both Low Bias and Low Variance
simultaneously
Therefore
Low Bias (train dataset)  High variance (test dataset)
High Bias (train dataset)  Low variance (test dataset)
However bagging can help reducing the variance.
So average error from N number of trees would be each of the
individual tree/ total no. of trees. In this way it helps in reducing the
variance without increasing Bias.
Average reduces variance without increasing the bias Var(x)=Var(x)/N
Rupak Roy
More on Random Forest Algorithm
As we know
• Bagging with DT we are less concerned about individual trees over
fitting the model. With this reason the individual trees are grown
without pruning.
• So it utilizes the same full set of predictors to determine each split. In
the end the tree will have both high variance and low bias.
• However Random Forest changes this setup so that the machine
learning algorithm is limited to a random sample of
features/predictors to search the split.
The number of features/predictors to try at each split in Random Forest
is known as Mtry.
Rupak Roy
More on Random Forest Algorithm
• Reducing the number of features/predictors for each node might
lower the accuracy specially if a few good predictors exists among
many non informative predictors.
• Splits are done again with a purity measure for example
Squared error/Reduction in variance for Regression,
Gini or Information Gain for Classification.
How to select N trees?
The number of trees can be set based on the accuracy until it stops
improving in other words until the error no longer decreases.
Rupak Roy
More on Random Forest Algorithm
More benefits of random forest over decision trees
> RF don’t require any pruning as we are already aware that it also uses
OOB samples to get more predictions.
>RF with default 500 trees combined gives more accuracy than a single
tree. Hence we can used over variety of domains.
Parameters Description Ideal Value
Mtry Number of randomly selected
predictors/features will be
used to make the split
Classification – sqrt(p)
Regression- 1/3rd of P
Ntree Number of trees to be build By default it takes 500
trees
Rupak Roy
More on Random Forest Algorithm
Steps on how to use and build a random forest model:
1. Select the number of trees to be build i.e. Ntree = N (default N is 500)
2. Now select a bagging sample from the train dataset.
3. Define the mtry that is the number of randomly selected
predictors/features will be used to make the split.
4. Grow until it stops improving, in other words until the error no longer
decreases.
Model Validation
• In RF we don’t need to separately create a test data set for cross
validation as each model uses 60% of the observations and 30%
approx. for accessing the performance of the model.
• OOB or Out Of Bag sample also works as a cross validation for the
accuracy of a random forest model.
Rupak Roy
Model Validation Metrics
 1) Confusion Matrix: one way of assessing model fit is to assess how
often has the model correctly predicted. This is applicable for binary
target variable.
To access the confusion matrix we will have:
True Positive(TP): model is predicting true that is actually true.
True Negative(TN): model is predicting negative that is actually negative.
False Positive(FP): model is predicting positive which is actually negative.
False Negative(FN): model is predicting negative which is actually positive.
The number of correct events & incorrect events will depend on the
cut- off probability. Hence it will be different with different values of
cut-off probability values.
example: proportion of good customers in all my data is 0.399
Predicted 0 1 > table(dataset$target)/nrow(dataset)
> cutoff<-ifelse (pred>=0.399,1,0)
To apply confusion matrix
> library(caret)
> confusionMatrix(cutoff,testdata$target, positive =“1”)
Actual 0
Actual 1
TN FP
FN TP
Reference
Rupak Roy
Model Validation Matrices
2) Precision & Recall :
Precision indicates when it predicts yes, how often is it correct? In other
words what proportion of positive identifications was actually correct?
TP/Predicted Yes = TP/(TP+FP)
While Recall answers what proportion of actual positives was identified
correctly?
TP / (TP+FN)
3) ROC Curve (Receiver Operating Characteristic Curve):
In statistics ROCR (receiver operating characteristic curve) is a plot that
illustrates the performance of a binary classifier system. by plotting the
tree positive rate against the false positive rate at various threshold
settings.
Rupak Roy
Model Validation Matrices
The true-positive rate is also known as sensitivity, recall or probability of
detection in.
The false-positive rate is also known as the fall-out or probability of false
alarm and can be calculated as (1 − specificity)
>library(ROCR)
>train_data$predicted<-predict(model2,type = “prob",newdata =
data_test)
>View(train_data)
>P<-prediction(train_data$predicted,train_data$target)
>class(P)
Rupak Roy
Interpreting ROCR curve
#Tpr and Fpr are the arguments of the “performance” function
indicating that the plot is between true positive & false positive rate.
>pref<-performance (P, "tpr", "fpr")
>class(pref)
#now this will give confusion matrix for all possible cut off values
>plot(pref,col=“red”)
#plot the absolute line
>abline(0,1, lty=8, col=“grey”)
Basically TPR should be greater than FPR
than the model is considered to be good
#now convert it into a data frame
>cutoff<-
data.frame(cut=pref@alpha.values[[1]],fpr=pref@x.values[[1]],tpr=pref
@y.values[[1]] )
Rupak Roy
Interpreting ROCR curve
>cutoff<-cutoff[order(cutoff$tpr, decreasing = TRUE)
>View(cutoff)
#alternative way
>P<-prediction(bfn_train$predicted,bfn_train$target)
>class(P)
>auc1<-performance(P,"auc")
>auc1<-unlist(slot(auc1,"y.values"))
>auc1 #Now choose the model with highest AUC values.
Rupak Roy
Class Imbalance
For a Classification target variable there’s a common problem of Class
Imbalance
For extremely imbalanced data, random forest off-course will tend to
be biased towards the majority class. For example the positive cases of
fraudulent transactions are much lesser than the non-fraudulent
(negative classes)
Usually we can 2 methods to deal with imbalanced data
1) One approach is by assigning different weights to different classes so
the minority classes will be a high weight to counter the biasness
towards the majority classes and the other is
2) Re-sampling by taking more sample of positive(1) classes than the
negative(0) classes. However note there is a trade off and can harm
the integrity of the data, in simple words it can change the original
meaning of the population.
Rupak Roy
Over fitting
Over fitting is a common problem for all machine learning models
where the machine learning algorithm continues to develop hypothesis
that tries to reduce error at the cost of increasing error in the test
dataset(unseen data)
Over fitting refers to a model that models the training data too well to
the extent that it cannot recognize the pattern on an unseen new
data. Hence negatively impacts the performance of the model on new
data.
There are two common ways to avoid overfitting:
• Pre-Pruning that stops growing the tree, before it classifies the splitting
criteria for the training dataset.
• Post-pruning where we restrict the splitting criteria for the training
dataset to avoid unnecessary growth.
Post-pruning is more preferable because predicting an estimate
(i.e. pre-pruning) when to stop growing the tree will not be
correct/accurate. The good thing is RF is resistant to over fitting.
Rupak Roy
More on Random Forest Algorithm
Getting the best number of trees for Random Forest is very important for
the final output.
The simple way to test the best number of trees is build a validation
dataset where we can evaluate the accuracy/error scores of post
pruning
By default number of trees in Random Forest is 500
Rupak Roy
Random Forest
vs Boosting Machine Learning algorithm
• RF with the same accuracy outperforms boosting in speed to process
big datasets since RF selects only a subset of features/predictors as
compare to boosting. However for continuous target variable
boosting outperforms Random Forest.
• Since Random Forest grows trees in parallel independent to each
other, RF if provided will use parallel hardware like multiple cores to
handle and process big data sets.
Rupak Roy
Random Forest
vs Other Machine learning algorithms
> Random Forest with combine predictive power of more than 500 trees
give same accuracy as high end machine learning algorithms like
Neural Networks.
> Random forest requires less data pre-processing as it can handle
outliers, missing values.
> With built in methods to cross validate the accuracy reduces the
work load of manual cross validation.
> Boosting learning algorithm grows trees in series which depends on
the scores of the previous trees and RF grows trees in parallel
independently of one another.
Rupak Roy
Next
Random forest offers a important feature selection & explicit / implicit
ranking of predictor variables which helps us to quickly understand the
important and non-important features that is effecting the final result.
Rupak Roy

Random Forest / Bootstrap Aggregation

  • 1.
  • 2.
    Random Forest • Randomforest is one of the most powerful and popular machine learning algorithms because of it ensemble methods. It is also called as Bootstrap Aggregation or bagging. • Before we get to the bagging, lets take a quick look what is bootstrap. • Bootstrapping is a statistical technique for estimating a quantity from a data sample. The word quantity means descriptive statistics like mean or a standard deviation. For example if we have 3 samples and the mean values are 4.5, 3,2.3 Now if we again take the average value of these we will get an again estimated mean of 3.26 The same techniques can be also used to get an estimate for other quantities like standard deviation, quantiles, coefficients etc. Rupak Roy
  • 3.
    Bootstrap Aggregation(Bagging) Now BootstrapAggregation in short Bagging is again an ensemble method that combines the predictive scores from multiple models together to make more accurate predictions than an individual model. In general it can be used to reduce the variance of those algorithms that prone to high variance like classification and regression trees. When bagging with DT we are less concerned about individual trees over fitting the model. With this reason the individual trees are grown without pruning. In the end the tree will have both high variance and low bias. These are important features of sub models that helps a lot in creating a combine high accuracy prediction. The only parameter used when bagging decision trees is the number of samples i.e. the number of trees to include. The number of trees can be set based on the accuracy until it stops improving. Same just like Bagging is an improvement of Decision Trees, Random Forest are an improvement over the bagging of decision trees. Rupak Roy
  • 4.
    Random Forest The problemwith decision trees they choose which variable that will be used to split which in turn can have high correlation in the predictions. In Decision Tree the learning algorithm selects the optimal split point by looking through all variables. The random forest algorithm changes this setup so that the machine learning algorithm is limited to a random sample of features to search the split. The Number of randomly selected features at each split(m) can be tuned using =>m = sqrt(p) (for Classification) =>m= 1/3rd of P (for Regression) Where m, is the number of randomly selected features for each split For each sample ran from the data set(training dataset), there will be samples left behind that were not included due to its robustness to outliers and missing values comes with a cost of throwing some data as we have learned in our previous chapter. So these samples are called as Out of Bag (OOB) samples. And the performance of each model on its OOB samples when averaged can also provide addition important estimates to increase the accuracy of the model. This is called as OOB estimate of performance. Rupak Roy
  • 5.
    Bias-Variance Trade off LowBias Low Variance Example Accurately captures Generalizes well to A self driving car patterns in its unseen data training data Practically it is difficult to have both Low Bias and Low Variance simultaneously Therefore Low Bias (train dataset)  High variance (test dataset) High Bias (train dataset)  Low variance (test dataset) However bagging can help reducing the variance. So average error from N number of trees would be each of the individual tree/ total no. of trees. In this way it helps in reducing the variance without increasing Bias. Average reduces variance without increasing the bias Var(x)=Var(x)/N Rupak Roy
  • 6.
    More on RandomForest Algorithm As we know • Bagging with DT we are less concerned about individual trees over fitting the model. With this reason the individual trees are grown without pruning. • So it utilizes the same full set of predictors to determine each split. In the end the tree will have both high variance and low bias. • However Random Forest changes this setup so that the machine learning algorithm is limited to a random sample of features/predictors to search the split. The number of features/predictors to try at each split in Random Forest is known as Mtry. Rupak Roy
  • 7.
    More on RandomForest Algorithm • Reducing the number of features/predictors for each node might lower the accuracy specially if a few good predictors exists among many non informative predictors. • Splits are done again with a purity measure for example Squared error/Reduction in variance for Regression, Gini or Information Gain for Classification. How to select N trees? The number of trees can be set based on the accuracy until it stops improving in other words until the error no longer decreases. Rupak Roy
  • 8.
    More on RandomForest Algorithm More benefits of random forest over decision trees > RF don’t require any pruning as we are already aware that it also uses OOB samples to get more predictions. >RF with default 500 trees combined gives more accuracy than a single tree. Hence we can used over variety of domains. Parameters Description Ideal Value Mtry Number of randomly selected predictors/features will be used to make the split Classification – sqrt(p) Regression- 1/3rd of P Ntree Number of trees to be build By default it takes 500 trees Rupak Roy
  • 9.
    More on RandomForest Algorithm Steps on how to use and build a random forest model: 1. Select the number of trees to be build i.e. Ntree = N (default N is 500) 2. Now select a bagging sample from the train dataset. 3. Define the mtry that is the number of randomly selected predictors/features will be used to make the split. 4. Grow until it stops improving, in other words until the error no longer decreases. Model Validation • In RF we don’t need to separately create a test data set for cross validation as each model uses 60% of the observations and 30% approx. for accessing the performance of the model. • OOB or Out Of Bag sample also works as a cross validation for the accuracy of a random forest model. Rupak Roy
  • 10.
    Model Validation Metrics 1) Confusion Matrix: one way of assessing model fit is to assess how often has the model correctly predicted. This is applicable for binary target variable. To access the confusion matrix we will have: True Positive(TP): model is predicting true that is actually true. True Negative(TN): model is predicting negative that is actually negative. False Positive(FP): model is predicting positive which is actually negative. False Negative(FN): model is predicting negative which is actually positive. The number of correct events & incorrect events will depend on the cut- off probability. Hence it will be different with different values of cut-off probability values. example: proportion of good customers in all my data is 0.399 Predicted 0 1 > table(dataset$target)/nrow(dataset) > cutoff<-ifelse (pred>=0.399,1,0) To apply confusion matrix > library(caret) > confusionMatrix(cutoff,testdata$target, positive =“1”) Actual 0 Actual 1 TN FP FN TP Reference Rupak Roy
  • 11.
    Model Validation Matrices 2)Precision & Recall : Precision indicates when it predicts yes, how often is it correct? In other words what proportion of positive identifications was actually correct? TP/Predicted Yes = TP/(TP+FP) While Recall answers what proportion of actual positives was identified correctly? TP / (TP+FN) 3) ROC Curve (Receiver Operating Characteristic Curve): In statistics ROCR (receiver operating characteristic curve) is a plot that illustrates the performance of a binary classifier system. by plotting the tree positive rate against the false positive rate at various threshold settings. Rupak Roy
  • 12.
    Model Validation Matrices Thetrue-positive rate is also known as sensitivity, recall or probability of detection in. The false-positive rate is also known as the fall-out or probability of false alarm and can be calculated as (1 − specificity) >library(ROCR) >train_data$predicted<-predict(model2,type = “prob",newdata = data_test) >View(train_data) >P<-prediction(train_data$predicted,train_data$target) >class(P) Rupak Roy
  • 13.
    Interpreting ROCR curve #Tprand Fpr are the arguments of the “performance” function indicating that the plot is between true positive & false positive rate. >pref<-performance (P, "tpr", "fpr") >class(pref) #now this will give confusion matrix for all possible cut off values >plot(pref,col=“red”) #plot the absolute line >abline(0,1, lty=8, col=“grey”) Basically TPR should be greater than FPR than the model is considered to be good #now convert it into a data frame >cutoff<- data.frame(cut=pref@alpha.values[[1]],fpr=pref@x.values[[1]],tpr=pref @y.values[[1]] ) Rupak Roy
  • 14.
    Interpreting ROCR curve >cutoff<-cutoff[order(cutoff$tpr,decreasing = TRUE) >View(cutoff) #alternative way >P<-prediction(bfn_train$predicted,bfn_train$target) >class(P) >auc1<-performance(P,"auc") >auc1<-unlist(slot(auc1,"y.values")) >auc1 #Now choose the model with highest AUC values. Rupak Roy
  • 15.
    Class Imbalance For aClassification target variable there’s a common problem of Class Imbalance For extremely imbalanced data, random forest off-course will tend to be biased towards the majority class. For example the positive cases of fraudulent transactions are much lesser than the non-fraudulent (negative classes) Usually we can 2 methods to deal with imbalanced data 1) One approach is by assigning different weights to different classes so the minority classes will be a high weight to counter the biasness towards the majority classes and the other is 2) Re-sampling by taking more sample of positive(1) classes than the negative(0) classes. However note there is a trade off and can harm the integrity of the data, in simple words it can change the original meaning of the population. Rupak Roy
  • 16.
    Over fitting Over fittingis a common problem for all machine learning models where the machine learning algorithm continues to develop hypothesis that tries to reduce error at the cost of increasing error in the test dataset(unseen data) Over fitting refers to a model that models the training data too well to the extent that it cannot recognize the pattern on an unseen new data. Hence negatively impacts the performance of the model on new data. There are two common ways to avoid overfitting: • Pre-Pruning that stops growing the tree, before it classifies the splitting criteria for the training dataset. • Post-pruning where we restrict the splitting criteria for the training dataset to avoid unnecessary growth. Post-pruning is more preferable because predicting an estimate (i.e. pre-pruning) when to stop growing the tree will not be correct/accurate. The good thing is RF is resistant to over fitting. Rupak Roy
  • 17.
    More on RandomForest Algorithm Getting the best number of trees for Random Forest is very important for the final output. The simple way to test the best number of trees is build a validation dataset where we can evaluate the accuracy/error scores of post pruning By default number of trees in Random Forest is 500 Rupak Roy
  • 18.
    Random Forest vs BoostingMachine Learning algorithm • RF with the same accuracy outperforms boosting in speed to process big datasets since RF selects only a subset of features/predictors as compare to boosting. However for continuous target variable boosting outperforms Random Forest. • Since Random Forest grows trees in parallel independent to each other, RF if provided will use parallel hardware like multiple cores to handle and process big data sets. Rupak Roy
  • 19.
    Random Forest vs OtherMachine learning algorithms > Random Forest with combine predictive power of more than 500 trees give same accuracy as high end machine learning algorithms like Neural Networks. > Random forest requires less data pre-processing as it can handle outliers, missing values. > With built in methods to cross validate the accuracy reduces the work load of manual cross validation. > Boosting learning algorithm grows trees in series which depends on the scores of the previous trees and RF grows trees in parallel independently of one another. Rupak Roy
  • 20.
    Next Random forest offersa important feature selection & explicit / implicit ranking of predictor variables which helps us to quickly understand the important and non-important features that is effecting the final result. Rupak Roy