Bagging, Boosting, Random Forest, AdaBoost
In supervised learning, our goal is to learn predictor h(x) with high accuracy (low
error) using a training data {(x1,y1),…,(xn,yn)}.
Decision Tree
None of the classifiers is perfect. Examples which are not correctly classified by one classifier may
be correctly classified by the other classifiers. But, we can improve it by utilizing the Ensemble
classifier. General Idea of Ensemble Classifier:
The primary principle behind the ensemble model is that a group of weak learners come
together to form an active learner. Ensembles of Classifiers combine the classifiers to
improve the performance. It Combine the classification results from different
classifiers to produce the final output using unweighted voting or weighted voting.
Bias/Variance Tradeoff:
Ensemble methods that minimize variance
– Bagging
– Random Forests
Ensemble methods that minimize bias
– Functional Gradient Descent
– Boosting
– Ensemble Selection
o In Bagging, our goal is to reduce variance.
o Bagging combining many unstable predictors to produce a ensemble (stable)
predictor.
o We independently select many training sets S’.
o We train model using each S’ and finally average predictions.
Bagging Algorithm
• Training
o Given a dataset S, at each iteration i, a training set Si is sampled with replacement from S
(i.e. bootstraping)
o A classifier Ci is learned for each Si
• Classification: given an unseen sample X,
o Each classifier Ci returns its class prediction
o The bagged classifier H counts the votes and assigns the class with the most votes to X
• Regression: can be applied to the prediction of continuous values by taking the average
value of each prediction.
The Bagging Model
Bagging is more powerful than a single decision tree.
Bagging is used when our objective is to reduce the variance of a decision tree.
We create a few subsets of data from the training sample, which is chosen randomly with
replacement. Now each collection of subset data is used to prepare their decision trees
thus, we end up with an ensemble of various models. The average of all the assumptions
from numerous tress is used, which is more powerful than a single decision tree.
Random Forest is an expansion over bagging. It takes one additional step to predict a
random subset of data. It also makes the random selection of features rather than
using all features to develop trees. When we have numerous random trees, it is called
the Random Forest.
These are the following steps which are taken to implement a Random forest:
o Let us consider X observations Y features in the training data set. First, a model
from the training data set is taken randomly with substitution.
o The tree is developed to the largest.
o The given steps are repeated, and prediction is given, which is based on the
collection of predictions from n number of trees.
Advantages of using Random Forest technique:
o It manages a higher dimension data set very well.
o It manages missing quantities and keeps accuracy for missing data.
Disadvantages of using Random Forest technique:
Since the last prediction depends on the mean predictions from subset trees, it won't
give precise value for the regression model.
Boosting is another ensemble procedure to make a collection of predictors. In other
words, we fit consecutive trees, usually random samples, and at each step, the
objective is to solve net error from the prior trees.
If a given input is misclassified by theory, then its weight is increased so that the
upcoming hypothesis is more likely to classify it correctly by consolidating the entire set
at last converts weak learners into better performing models.
Gradient Boosting is an expansion of the boosting procedure.
1. Gradient Boosting = Gradient Descent + Boosting
Bagging Boosting
Various training data subsets are randomly drawn
with replacement from the whole training dataset.
Each new subset contains the components that
were misclassified by previous models.
Bagging attempts to tackle the over-fitting issue. Boosting tries to reduce bias.
If the classifier is unstable (high variance), then we
need to apply bagging.
If the classifier is steady and straightforward
(high bias), then we need to apply boosting.
Every model receives an equal weight. Models are weighted by their performance.
Objective to decrease variance, not bias. Objective to decrease bias, not variance.
It is the easiest way of connecting predictions that
belong to the same type.
It is a way of connecting predictions that
belong to the different types.
Every model is constructed independently. New models are affected by the performance
of the previously developed model.
Comparison b/w Bagging and Boosting | Data Mining
Bagging and Boosting are two types of Ensemble Learning. These two decrease the
variance of single estimate as they combine several estimates from different models. So
the result may be a model with higher stability.
• If the difficulty of the single model is over-fitting, then Bagging is the best option.
• If the problem is that the single model gets a very low performance, Boosting could
generate a combined model with lower errors as it optimizes the advantages and reduces
pitfalls of the single model.
•
Similarities between Bagging and Boosting –
1. Both are ensemble methods to get N learners from 1 learner.
2. Both generate several training data sets by random sampling.
3. Both make the final decision by averaging the N learners (or taking the majority of
them i.e Majority Voting).
4. Both are good at reducing variance and provide higher stability.
Differences Between Bagging and Boosting –
S.NO Bagging Boosting
.
Simplest way of combining predictions
that belong to the same type.
A way of combining predictions
that belong to the different types.
2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.
3. Each model receives equal weight.
Models are weighted according to
their performance.
4. Each model is built independently.
New models are influenced
by performance of previously built
models.
5.
Different training data subsets are
randomly drawn with replacement from
the entire training dataset.
Every new subsets contains the
elements that were misclassified by
previous models.
6.
Bagging tries to solve over-fitting
problem. Boosting tries to reduce bias.
7.
If the classifier is unstable (high variance),
then apply bagging.
If the classifier is stable and simple
(high bias) the apply boosting.
8. Random forest. Gradient boosting.
Boosting is a general ensemble method that creates a strong classifier from a number of
weak classifiers. This is done by building a model from the training data, then creating a
second model that attempts to correct the errors from the first model. Models are added
until the training set is predicted perfectly or a maximum number of models are added.
AdaBoost was the first really successful boosting algorithm developed for binary
classification. It is the best starting point for understanding boosting.
AdaBoost Ensemble
Weak models are added sequentially, trained using the weighted training data.
The process continues until a pre-set number of weak learners have been created (a user
parameter) or no further improvement can be made on the training dataset.
Once completed, you are left with a pool of weak learners each with a stage value.
Data Preparation for AdaBoost
This section lists some heuristics for best preparing your data for AdaBoost.
• Quality Data: Because the ensemble method continues to attempt to correct
misclassifications in the training data, you need to be careful that the training data is of a high-
quality.
• Outliers: Outliers will force the ensemble down the rabbit hole of working hard to correct for
cases that are unrealistic. These could be removed from the training dataset.
• Noisy Data: Noisy data, specifically noise in the output variable can be problematic. If
possible, attempt to isolate and clean these from your training dataset.
Making Predictions with AdaBoost
Predictions are made by calculating the weighted average of the weak classifiers.
For a new input instance, each weak learner calculates a predicted value as either +1.0 or -
1.0. The predicted values are weighted by each weak learners stage value. The prediction
for the ensemble model is taken as a the sum of the weighted predictions. If the sum is
positive, then the first class is predicted, if negative the second class is predicted.
For example, 5 weak classifiers may predict the values 1.0, 1.0, -1.0, 1.0, -1.0. From a
majority vote, it looks like the model will predict a value of 1.0 or the first class. These same
5 weak classifiers may have the stage values 0.2, 0.5, 0.8, 0.2 and 0.9 respectively.
Calculating the weighted sum of these predictions results in an output of -0.8, which would
be an ensemble prediction of -1.0 or the second class.

BaggingBoosting.pdf

  • 1.
    Bagging, Boosting, RandomForest, AdaBoost In supervised learning, our goal is to learn predictor h(x) with high accuracy (low error) using a training data {(x1,y1),…,(xn,yn)}. Decision Tree None of the classifiers is perfect. Examples which are not correctly classified by one classifier may be correctly classified by the other classifiers. But, we can improve it by utilizing the Ensemble classifier. General Idea of Ensemble Classifier: The primary principle behind the ensemble model is that a group of weak learners come together to form an active learner. Ensembles of Classifiers combine the classifiers to improve the performance. It Combine the classification results from different classifiers to produce the final output using unweighted voting or weighted voting.
  • 2.
    Bias/Variance Tradeoff: Ensemble methodsthat minimize variance – Bagging – Random Forests Ensemble methods that minimize bias – Functional Gradient Descent – Boosting – Ensemble Selection o In Bagging, our goal is to reduce variance. o Bagging combining many unstable predictors to produce a ensemble (stable) predictor. o We independently select many training sets S’. o We train model using each S’ and finally average predictions.
  • 3.
    Bagging Algorithm • Training oGiven a dataset S, at each iteration i, a training set Si is sampled with replacement from S (i.e. bootstraping) o A classifier Ci is learned for each Si • Classification: given an unseen sample X, o Each classifier Ci returns its class prediction o The bagged classifier H counts the votes and assigns the class with the most votes to X • Regression: can be applied to the prediction of continuous values by taking the average value of each prediction. The Bagging Model Bagging is more powerful than a single decision tree. Bagging is used when our objective is to reduce the variance of a decision tree. We create a few subsets of data from the training sample, which is chosen randomly with replacement. Now each collection of subset data is used to prepare their decision trees thus, we end up with an ensemble of various models. The average of all the assumptions from numerous tress is used, which is more powerful than a single decision tree. Random Forest is an expansion over bagging. It takes one additional step to predict a random subset of data. It also makes the random selection of features rather than using all features to develop trees. When we have numerous random trees, it is called the Random Forest.
  • 4.
    These are thefollowing steps which are taken to implement a Random forest: o Let us consider X observations Y features in the training data set. First, a model from the training data set is taken randomly with substitution. o The tree is developed to the largest. o The given steps are repeated, and prediction is given, which is based on the collection of predictions from n number of trees. Advantages of using Random Forest technique: o It manages a higher dimension data set very well. o It manages missing quantities and keeps accuracy for missing data. Disadvantages of using Random Forest technique: Since the last prediction depends on the mean predictions from subset trees, it won't give precise value for the regression model. Boosting is another ensemble procedure to make a collection of predictors. In other words, we fit consecutive trees, usually random samples, and at each step, the objective is to solve net error from the prior trees. If a given input is misclassified by theory, then its weight is increased so that the upcoming hypothesis is more likely to classify it correctly by consolidating the entire set at last converts weak learners into better performing models. Gradient Boosting is an expansion of the boosting procedure. 1. Gradient Boosting = Gradient Descent + Boosting Bagging Boosting Various training data subsets are randomly drawn with replacement from the whole training dataset. Each new subset contains the components that were misclassified by previous models. Bagging attempts to tackle the over-fitting issue. Boosting tries to reduce bias. If the classifier is unstable (high variance), then we need to apply bagging. If the classifier is steady and straightforward (high bias), then we need to apply boosting.
  • 5.
    Every model receivesan equal weight. Models are weighted by their performance. Objective to decrease variance, not bias. Objective to decrease bias, not variance. It is the easiest way of connecting predictions that belong to the same type. It is a way of connecting predictions that belong to the different types. Every model is constructed independently. New models are affected by the performance of the previously developed model. Comparison b/w Bagging and Boosting | Data Mining Bagging and Boosting are two types of Ensemble Learning. These two decrease the variance of single estimate as they combine several estimates from different models. So the result may be a model with higher stability. • If the difficulty of the single model is over-fitting, then Bagging is the best option. • If the problem is that the single model gets a very low performance, Boosting could generate a combined model with lower errors as it optimizes the advantages and reduces pitfalls of the single model. • Similarities between Bagging and Boosting – 1. Both are ensemble methods to get N learners from 1 learner. 2. Both generate several training data sets by random sampling. 3. Both make the final decision by averaging the N learners (or taking the majority of them i.e Majority Voting). 4. Both are good at reducing variance and provide higher stability.
  • 6.
    Differences Between Baggingand Boosting – S.NO Bagging Boosting . Simplest way of combining predictions that belong to the same type. A way of combining predictions that belong to the different types. 2. Aim to decrease variance, not bias. Aim to decrease bias, not variance. 3. Each model receives equal weight. Models are weighted according to their performance. 4. Each model is built independently. New models are influenced by performance of previously built models. 5. Different training data subsets are randomly drawn with replacement from the entire training dataset. Every new subsets contains the elements that were misclassified by previous models. 6. Bagging tries to solve over-fitting problem. Boosting tries to reduce bias. 7. If the classifier is unstable (high variance), then apply bagging. If the classifier is stable and simple (high bias) the apply boosting. 8. Random forest. Gradient boosting. Boosting is a general ensemble method that creates a strong classifier from a number of weak classifiers. This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added. AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting.
  • 7.
    AdaBoost Ensemble Weak modelsare added sequentially, trained using the weighted training data. The process continues until a pre-set number of weak learners have been created (a user parameter) or no further improvement can be made on the training dataset. Once completed, you are left with a pool of weak learners each with a stage value. Data Preparation for AdaBoost This section lists some heuristics for best preparing your data for AdaBoost. • Quality Data: Because the ensemble method continues to attempt to correct misclassifications in the training data, you need to be careful that the training data is of a high- quality. • Outliers: Outliers will force the ensemble down the rabbit hole of working hard to correct for cases that are unrealistic. These could be removed from the training dataset. • Noisy Data: Noisy data, specifically noise in the output variable can be problematic. If possible, attempt to isolate and clean these from your training dataset. Making Predictions with AdaBoost Predictions are made by calculating the weighted average of the weak classifiers. For a new input instance, each weak learner calculates a predicted value as either +1.0 or - 1.0. The predicted values are weighted by each weak learners stage value. The prediction for the ensemble model is taken as a the sum of the weighted predictions. If the sum is positive, then the first class is predicted, if negative the second class is predicted. For example, 5 weak classifiers may predict the values 1.0, 1.0, -1.0, 1.0, -1.0. From a majority vote, it looks like the model will predict a value of 1.0 or the first class. These same 5 weak classifiers may have the stage values 0.2, 0.5, 0.8, 0.2 and 0.9 respectively. Calculating the weighted sum of these predictions results in an output of -0.8, which would be an ensemble prediction of -1.0 or the second class.