A brief presentation given on the basics of Ensemble Methods. Given as a 'Lightning Talk' during the 7th Cohort of General Assembly's Data Science Immersive Course
4. Ensemble Methods
● In contrast to traditional modeling methods
which train a single model on a set of data,
ensemble methods seek to train multiple
models and aggregate the results to achieve
superior results
● In most cases, a single base learning algorithm
is used in ensemble methods which is called a
homogenous ensemble
● In some cases it is useful to combine multiple
algorithms in a heterogeneous ensemble
5. The Statistical Issue
● It is often the case that the hypothesis space is too
large to explore for limited training data, and that
there may be several different hypotheses giving the
same accuracy on the training data. By "averaging"
predictions from multiple models, we can often
cancel out errors and get closer to the true function
we are seeking to learn.
6. The Computational Issue
● It might be impossible to develop one model
that globally optimizes our objective function.
For instance, classification and regression
trees reach locally-optimal solutions and that
all generalized linear models iterate toward a
solution that isn't guaranteed to be the
globally-optimal. Starting "local searches" at
different points and aggregating our
predictions might produce a better result.
7. The Representational Issue
● Even with vast data and computer power,
it might still be impossible for one model
to exactly model our objective function.
For example, a linear regression model can
never model a relationship where a one-
unit change in X effects some different
change in y based on the value of X. All
models have their shortcomings but by
creating multiple model and aggregating
their predictions it is often possible to get
closer to the objective function.
8. Two Ensemble Paradigms - Parallel
● The base models are generated in
parallel and the results are aggregated
○ Bagging
○ Random Forests
9. Bootstrap Aggregated Decision Trees
(Bagging)
● From the original data of size N, bootstrap K samples each of size N (with replacement!).
● Build a classification or regression decision tree on each bootstrapped sample.
● Make predictions by passing a test observation through all K trees and developing one
aggregate prediction for that observation.
○ Discrete: In ensemble methods, we will most commonly predict a discrete y by
"plurality vote," where the most common class is the predicted value for a given
observation.
○ Continuous: In ensemble methods, we will most commonly predict a continuous y
by averaging the predicted values into one final prediction.
10. Random Forests
● Bagging reduces variance in individual decision trees but they are still highly correlated
with each other.
● By "decorrelating" our trees from one another, we can drastically reduce the variance of
our model.
● Random forests differ from bagging decision trees in only one way: they use a modified
tree learning algorithm that selects, at each split in the learning process, a random
subset of the features. This process is sometimes called the random subspace method.
11. Two Ensemble Paradigms - Sequential
● The base models are generated
sequentially and the results are
updated based on the previous results
○ AdaBoosting
○ Gradient Boosting
12. AdaBoosting
● Instead of using deep/full decision trees like in bagging, boosting uses shallow/high-
bias base estimators.
● Iterative fitting is used to explain error/misclassification unexplained by the previous
base models and reduces bias without increasing variance.
● The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that
are only slightly better than random guessing, such as a single-split tree) on repeatedly
modified versions of the data. After each fit, the importance weights on each
observation need to be updated.
● The predictions are then combined through a weighted majority vote (or sum) to
produce the final prediction. AdaBoost, like all boosting ensemble methods, focuses the
next model's fit on the misclassifications/weaknesses of the prior models.
13. Gradient Boosting
● Gradient boosting is similar to AdaBoosting but
functions by fitting the next model to the
residuals of the previous model, for instance:
● Suppose you start with a model F that gives you a
less than satisfactory result.
○ F(x1) = 0.8, while y1 = 0.9
● Each successive model is correcting for that
distance between the true value and predicted
● This process can be repeated until you have fit an
effective model.
● In other words, residuals are interpreted as
negative gradients.
Editor's Notes
Ensemble methods can be useful in almost all areas where learning techniques are used. For example, computer vision has benefited much from ensemble methods in almost all branches such as object detection, recognition and tracking.
On September 21, 2009, Netflix awarded the $1M grand prize to the team BellKor’s Pragmatic Chaos, whose solution was based on combining various classifiers including asymmetric factor models, regression models, restricted Boltzmann machines, matrix factorization, k-nearest neighbor and more.
Dietterich, 2000 identified 3 main issues with conventional modeling techniques
The true function might be outside the set of all possible decision trees.
The true function is a true process that does not follow a decision tree and how the actual observations are generated
As we have seen, decision trees are very powerful machine learning models. However, decision trees have some limitations. In particular, trees that are grown very deep tend to learn highly irregular patterns (a.k.a. they overfit their training sets).
Bagging (bootstrap aggregating) helps to mitigate this problem by exposing different trees to different sub-samples of the whole training set.
The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be used in many/all of the bagged decision trees, causing them to become correlated. By selecting a random subset of the features at each split, we counter this correlation between base trees, strengthening the overall model.
ExtraTrees are where we build random forests, but when it comes to building each individual decision tree, we also select a random split of each feature at each node.
Pros:
Achieves higher performance than bagging when the hyperparameters are properly tuned.
Works equally well for classification and regression.
Can use "robust" loss functions that make the model resistant to outliers.
Cons:
Difficult and time consuming to properly tune hyperparameters.
Cannot be parallelized like bagging (bad scalability when there are huge amounts of data).
Higher risk of overfitting compared to bagging.