Ensemble methods

Evaluation and
recommendation Treatment

Ensemble Methods
● In contrast to traditional modeling methods
which train a single model on a set of data,
ensemble methods seek to train multiple
models and aggregate the results to achieve
superior results
● In most cases, a single base learning algorithm
is used in ensemble methods which is called a
homogenous ensemble
● In some cases it is useful to combine multiple
algorithms in a heterogeneous ensemble

The Statistical Issue
● It is often the case that the hypothesis space is too
large to explore for limited training data, and that
there may be several different hypotheses giving the
same accuracy on the training data. By "averaging"
predictions from multiple models, we can often
cancel out errors and get closer to the true function
we are seeking to learn.

The Computational Issue
● It might be impossible to develop one model
that globally optimizes our objective function.
For instance, classification and regression
trees reach locally-optimal solutions and that
all generalized linear models iterate toward a
solution that isn't guaranteed to be the
globally-optimal. Starting "local searches" at
different points and aggregating our
predictions might produce a better result.

The Representational Issue
● Even with vast data and computer power,
it might still be impossible for one model
to exactly model our objective function.
For example, a linear regression model can
never model a relationship where a one-
unit change in X effects some different
change in y based on the value of X. All
models have their shortcomings but by
creating multiple model and aggregating
their predictions it is often possible to get
closer to the objective function.

Two Ensemble Paradigms - Parallel
● The base models are generated in
parallel and the results are aggregated
○ Bagging
○ Random Forests

Bootstrap Aggregated Decision Trees
(Bagging)
● From the original data of size N, bootstrap K samples each of size N (with replacement!).
● Build a classification or regression decision tree on each bootstrapped sample.
● Make predictions by passing a test observation through all K trees and developing one
aggregate prediction for that observation.
○ Discrete: In ensemble methods, we will most commonly predict a discrete y by
"plurality vote," where the most common class is the predicted value for a given
observation.
○ Continuous: In ensemble methods, we will most commonly predict a continuous y
by averaging the predicted values into one final prediction.

Random Forests
● Bagging reduces variance in individual decision trees but they are still highly correlated
with each other.
● By "decorrelating" our trees from one another, we can drastically reduce the variance of
our model.
● Random forests differ from bagging decision trees in only one way: they use a modified
tree learning algorithm that selects, at each split in the learning process, a random
subset of the features. This process is sometimes called the random subspace method.

Two Ensemble Paradigms - Sequential
● The base models are generated
sequentially and the results are
updated based on the previous results
○ AdaBoosting
○ Gradient Boosting

AdaBoosting
● Instead of using deep/full decision trees like in bagging, boosting uses shallow/high-
bias base estimators.
● Iterative fitting is used to explain error/misclassification unexplained by the previous
base models and reduces bias without increasing variance.
● The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that
are only slightly better than random guessing, such as a single-split tree) on repeatedly
modified versions of the data. After each fit, the importance weights on each
observation need to be updated.
● The predictions are then combined through a weighted majority vote (or sum) to
produce the final prediction. AdaBoost, like all boosting ensemble methods, focuses the
next model's fit on the misclassifications/weaknesses of the prior models.

Gradient Boosting
● Gradient boosting is similar to AdaBoosting but
functions by fitting the next model to the
residuals of the previous model, for instance:
● Suppose you start with a model F that gives you a
less than satisfactory result.
○ F(x1) = 0.8, while y1 = 0.9
● Each successive model is correcting for that
distance between the true value and predicted
● This process can be repeated until you have fit an
effective model.
● In other words, residuals are interpreted as
negative gradients.

Ensemble methods

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ensemble methods

Similar to Ensemble methods (20)

Recently uploaded

Recently uploaded (20)

Ensemble methods

Editor's Notes