Ensemble Learning: The Wisdom of Crowds (of Machines)

Ensemble Learning:
The Wisdom of Crowds
(of Machines)
Lior Rokach
Department of Information Systems Engineering
Ben-Gurion University of the Negev

About Me
Prof. Lior Rokach
Department of Information Systems Engineering
Faculty of Engineering Sciences
Head of the Machine Learning Lab
Ben-Gurion University of the Negev

Email: liorrk@bgu.ac.il
http://www.ise.bgu.ac.il/faculty/liorr/

PhD (2004) from Tel Aviv University

The Condorcet Jury Theorem
• If each voter has a probability p of being correct
and the probability of a majority of voters being
correct is M,
• then p > 0.5 implies M > p.
• Also M approaches 1, for all p > 0.5 as the number
of voters approaches infinity.
• This theorem was proposed by the Marquis of
Condorcet in 1784

Francis Galton
• Galton promoted statistics and invented the
concept of correlation.
• In 1906 Galton visited a livestock fair and
stumbled upon an intriguing contest.
• An ox was on display, and the villagers were
invited to guess the animal's weight.
• Nearly 800 gave it a go and, not surprisingly,
not one hit the exact mark: 1,198 pounds.
• Astonishingly, however, the average of those
800 guesses came close - very close indeed. It
was 1,197 pounds.

The Wisdom of Crowds
Why the Many Are Smarter Than the Few and How Collective Wisdom

Shapes Business, Economies, Societies and Nations

• Under certain controlled conditions, the
aggregation of information in groups,
resulting in decisions that are often
superior to those that can been made by
any single - even experts.
• Imitates our second nature to seek several
opinions before making any crucial
decision. We weigh the individual
opinions, and combine them to reach a
final decision

Committees of Experts
– ― … a medical school that has the objective that all students,
given a problem, come up with an identical solution‖

• There is not much point in setting up a committee of
experts from such a group - such a committee will not
improve on the judgment of an individual.

• Consider:
– There needs to be disagreement for the committee to have
the potential to be better than an individual.

Does it always work?
• Not all crowds (groups) are wise.
– Example: crazed investors in a stock market
bubble.

Key Criteria
• Diversity of opinion
– Each person should have private information even if it's just an
eccentric interpretation of the known facts.
• Independence
– People's opinions aren't determined by the opinions of those
around them.
• Decentralization
– People are able to specialize and draw on local knowledge.
• Aggregation
– Some mechanism exists for turning private judgments into a
collective decision.

Teaser: How good are ensemble methods?

Let’s look at the Netflix Prize Competition…

Began October 2006

• Supervised learning task
– Training data is a set of users and ratings (1,2,3,4,5
stars) those users have given to movies.
– Construct a classifier that given a user and an unrated
movie, correctly classifies that movie as either 1, 2, 3,
4, or 5 stars

• $1 million prize for a 10% improvement over
Netflix’s current movie recommender/classifier
(MSE = 0.9514)

Learning biases
• Occam’s razor
―among the theories that are consistent with the data,
select the simplest one‖.

• Epicurus’ principle
―keep all theories that are consistent with the data,‖
[not necessarily with equal weights]
E.g. Bayesian learning
Ensemble learning
12

Strong and Weak Learners
• Strong (PAC) Learner
– Take labeled data for training
– Produce a classifier which can be arbitrarily
accurate
– Objective of machine learning
• Weak (PAC) Learner
– Take labeled data for training
– Produce a classifier which is more accurate
than random guessing

Ensembles of classifiers
• Given some training data
Dtrain x n , yn ; n 1,, N train
• Inductive learning
L: Dtrain h( ), where h( ): X Y
• Ensemble learning
L1: Dtrain h1( )
L2: Dtrain h2( )
... Ensemble:
LT: Dtrain hT ( ) {h1( ), h2( ), ... , hT ( )}
14

Classification by majority voting
New Instance: x

T=7 classifiers

1 1 1 2 1 2 1

Accumulated votes: t 5 2
0
1
2
3 0
4 1 Final class: 1
t1 t2

Alberto Suárez (2012)
15

Boosting
• Learners
– Strong learners are very difficult to construct
– Constructing weaker Learners is relatively easy
• Strategy
– Derive strong learner from weak learner
– Boost weak classifiers to a strong learner

Construct Weak Classifiers
• Using Different Data Distribution
– Start with uniform weighting
– During each step of learning
• Increase weights of the examples which are not
correctly learned by the weak learner
• Decrease weights of the examples which are
correctly learned by the weak learner
• Idea
– Focus on difficult examples which are not
correctly classified in the previous steps

Combine Weak Classifiers
• Weighted Voting
– Construct strong classifier by weighted voting
of the weak classifiers
• Idea
– Better weak classifier gets a larger weight
– Iteratively add weak classifiers
• Increase accuracy of the combined classifier through
minimization of a cost function

AdaBoost (Adaptive Boosting)
(Freund and Schapire, 1997)

Generate a
sequence of
base-learners
each focusing
on previous
one’s errors
(Freund and
Schapire,
1996)

Example

Training Combined classifier

Example of a Good Classifier
+
+ +
+

+

The initial distribution
Train data
x1 x2 y D1
1 5 + 0.10
2 3 + 0.10
3 2 0.10
4 6 0.10
4 7 + 0.10
5 9 + 0.10
6 5 0.10
6 7 + 0.10
8 5 0.10
8 8 0.10
1.00

Initialization

Round 1 of 3
+O +
+O +O + +
+ +

+ +

h1 1 = 0.30 D2
1=0.42

How the distribution has changed?
Train data Round 1
x1 x2 y D1 h1e D2
1 5 + 0.10 0 0.00 0.07
2 3 + 0.10 0 0.00 0.07
3 2 0.10 0 0.00 0.07
4 6 0.10 0 0.00 0.07
4 7 + 0.10 1 0.10 0.17
5 9 + 0.10 1 0.10 0.17
6 5 0.10 0 0.00 0.07
6 7 + 0.10 1 0.10 0.17
8 5 0.10 0 0.00 0.07
8 8 0.10 0 0.00 0.07
1.00 0.30 1.00
0.42 
Zt 0.92

Round 2 of 3
+ +
+O + + +
+ O +

+ O
+

2 = 0.21 h2 D2
2=0.65

Train data Round 1 Round 2
x1 x2 y D1 h1e D2 h2e D3
1 5 + 0.10 0 0.00 0.07 0 0.00 0.05
2 3 + 0.10 0 0.00 0.07 0 0.00 0.05
3 2 0.10 0 0.00 0.07 1 0.07 0.17
4 6 0.10 0 0.00 0.07 1 0.07 0.17
4 7 + 0.10 1 0.10 0.17 0 0.00 0.11
5 9 + 0.10 1 0.10 0.17 0 0.00 0.11
6 5 0.10 0 0.00 0.07 1 0.07 0.17
6 7 + 0.10 1 0.10 0.17 0 0.00 0.11
8 5 0.10 0 0.00 0.07 0 0.00 0.05
8 8 0.10 0 0.00 0.07 0 0.00 0.05
1.00 0.30 1.00 0.21 1.00
0.42  0.65 
Zt 0.92 Zt 0.82

Round 3 of 3
+ O
+ + h3
+ O STOP

+ O

3 = 0.14

3=0.92

Train data Round 1 Round 2 Round 3
x1 x2 y D1 h1e D2 h2e D3 h3e
1 5 + 0.10 0 0.00 0.07 0 0.00 0.05 1 0.05
2 3 + 0.10 0 0.00 0.07 0 0.00 0.05 1 0.05
3 2 0.10 0 0.00 0.07 1 0.07 0.17 0 0.00
4 6 0.10 0 0.00 0.07 1 0.07 0.17 0 0.00
4 7 + 0.10 1 0.10 0.17 0 0.00 0.11 0 0.00
5 9 + 0.10 1 0.10 0.17 0 0.00 0.11 0 0.00
6 5 0.10 0 0.00 0.07 1 0.07 0.17 0 0.00
6 7 + 0.10 1 0.10 0.17 0 0.00 0.11 0 0.00
8 5 0.10 0 0.00 0.07 0 0.00 0.05 0 0.00
8 8 0.10 0 0.00 0.07 0 0.00 0.05 1 0.05
1.00 0.30 1.00 0.21 1.00 0.14
0.42  0.65  0.92
Zt 0.92 Zt 0.82
Initialization
Importance of each learner

Final Hypothesis

0.42 + 0.65 + 0.92

Hfinal = sign[ 0.42(h1? 1|-1) + 0.65(h2? 1|-1) + 0.92(h3? 1|-1) ]

+
+ +
+
+

Training Errors vs Test Errors
Performance on ‘letter’ dataset
(Schapire et al. 1997)

Test
error
Training
error

Training error drops to 0 on round 5

Test error continues to drop after round 5
(from 8.4% to 3.1%)

Adaboost Variants Proposed By
Friedman
• LogitBoost

Adaboost Variants Proposed By
Friedman
• GentleBoost

BrownBoost
• Reduce the weight given to misclassified
example

• Good (only) for very noisy data.

Bagging
Bootstrap AGGregatING
• Employs simplest way of combining predictions
that belong to the same type.
• Combining can be realized with voting or
averaging
• Each model receives equal weight
• ―Idealized‖ version of bagging:
– Sample several training sets of size n (instead of just
having one training set of size n)
– Build a classifier for each training set
– Combine the classifier’s predictions
• This improves performance in almost all cases if
learning scheme is unstable.

Wagging
Weighted AGGregatING
• A variant of bagging in which each
classifier is trained on the entire training set,
but each instance is stochastically assigned
a weight.

Random Forests
1. Choose T—number of trees to grow.
2. Choose m—number of variables used to split each node. m
≪ M, where M is the number of input variables. m is hold
constant while growing the forest.
3. Grow T trees. When growing each tree do the following.
(a) Construct a bootstrap sample of size n sampled from Sn with
replacement and grow a tree from this bootstrap sample.
(b) When growing a tree at each node select m variables at random
and use them to find the best split.
(c) Grow the tree to a maximal extent. There is no pruning.
4. To classify point X collect votes from every tree in the
forest and
then use majority voting to decide on the class label.

Variation of Random Forests
• Random Split Selection (Dietterich, 2000)
– Grow multiple trees
– When splitting, choose split uniformly at random from
– K best splits
– Can be used with or without pruning
• Random Subspace (Ho, 1998)
– Grow multiple trees
– Each tree is grown using a fixed subset of variables
– Do a majority vote or averaging to combine votes from
– different trees

DECORATE
(Melville & Mooney, 2003)
• Change training data by adding new
artificial training examples that encourage
diversity in the resulting ensemble.
• Improves accuracy when the training set is
small, and therefore resampling and
reweighting the training set has limited
ability to generate diverse alternative
hypotheses.

Overview of DECORATE

Current Ensemble
Training Examples
+
- C1
-
+
+
Base Learner
+
+
-
+
-
Artificial Examples


Current Ensemble
Training Examples
+
- C1
-
+
+
Base Learner C2
+
-
-
+
-
-
+
Artificial Examples


Current Ensemble
Training Examples
+
- C1
-
+
+
Base Learner C2
-
+
+
+ C3
-
Artificial Examples

Ensemble Taxonomy
(Rokach, 2009)

Diversity
generator

Members
Combiner
Dependency

Ensemble

Cross- Ensemble
Inducer size

Combiner
• Weighting methods
– Majority Voting
– Performance Weighting
– Distribution Summation
– Gating Network
• Meta-Learning
– Stacking
– Arbiter Trees
– Grading

Stacking
• Combiner f () is
another learner
(Wolpert, 1992)

Members Dependency
• Dependent Methods: There is an interaction
between the learning runs (AdaBoost)
– Model-guided Instance Selection: the classifiers that
were constructed in previous iterations are used for
selecting the training set in the subsequent iteration.
– Incremental Batch Learning: In this method the
classification produced in one iteration is given as prior
knowledge (a new feature) to the learning algorithm in
the subsequent iteration.
• Independent Methods (Bagging)

Cascading
Use dj only if
preceding ones
are not confident

Cascade learners
in order of
complexity

Diversity
• Manipulating the Inducer
• Manipulating the Training Sample
• Changing the target attribute representation
• Partitioning the search space - Each member is
trained on a different search subspace.
• Hybridization - Diversity is obtained by using
various base inducers or ensemble strategies.

Measuring the Diversity
• Pairwise measures calculate the average of a
particular distance metric between all possible
pairings of members in the ensemble, such as Q-
statistic or kappa-statistic.
• The non-pairwise measures either use the idea of
entropy or calculate a correlation of each ensemble
member with the averaged output.

Kappa-Statistic

i, j i, j
i, j
1 i, j

where i, j is the proportion of instances on which the classifiers i and j agree with each

other on the training set, and i, j is the probability that the two classifiers agree by

chance.

How crowded should the crowd be?
Ensemble Selection
• Why bother?
– Desired accuracy
– Computational cost
• Predetermine the ensemble size
• Use a certain criterion to stops training
• Pruning

Cross Inducer
• Inducer-dependent (like RandomForest).
• Inducer-independent (like bagging)

Multi-strategy Ensemble Learning
• Combines several ensemble strategies.
• MultiBoosting, an extension to AdaBoost
expressed by adding wagging-like features
can harness both AdaBoost's high bias and
variance reduction with wagging's superior
variance reduction.
• produces decision committees with lower
error than either AdaBoost or wagging.

Why using Ensembles?
• Statistical Reasons: Out of many classifier models with similar training / test
errors, which one shall we pick? If we just pick one at random, we risk the
possibility of choosing a really poor one
– Combining / averaging them may prevent us from making one such unfortunate
decision
• Computational Reasons: Every time we run a classification algorithm, we
may find different local optima
– Combining their outputs may allow us to find a solution that is closer to the global
minimum.
• Too little data / too much data:
– Generating multiple classifiers with the resampling of the available data / mutually
exclusive subsets of the available data
• Representational Reasons: The classifier space may not contain the solution
to a given particular problem. However, an ensemble of such classifiers may
– For example, linear classifiers cannot solve non-linearly separable problems,
however, their combination can.

There’s no real Paradox…
• Ideally, all committee members would be
right about everything!
• If not, they should be wrong about different
things.

No Free Lunch Theorem in Machine
Learning (Wolpert, 2001)
• “Or to put it another way, for any two
learning algorithms, there are just as many
situations (appropriately weighted) in
which algorithm one is superior to
algorithm two as vice versa, according to
any of the measures of "superiority"

So why developing new algorithms?
• The science of pattern recognition is mostly concerned with choosing
the most appropriate algorithm for the problem at hand
• This requires some a priori knowledge – data distribution, prior
probabilities, complexity of the problem, the physics of the underlying
phenomenon, etc.
• The No Free Lunch theorem tells us that – unless we have some a
priori knowledge – simple classifiers (or complex ones for that matter)
are not necessarily better than others. However, given some a priori
information, certain classifiers may better MATCH the characteristics
of certain type of problems.
• The main challenge of the patter recognition professional is then, to
identify the correct match between the problem and the classifier!
…which is yet another reason to arm yourself with a diverse set of PR
arsenal !

Ensemble and the
No Free Lunch Theorem
• Ensemble combine the strengths of each
classifier to make a super-learner.
• But … Ensemble only improves classification
if the component classifiers perform better
than chance
– Can not be guaranteed a priori
• Proven effective in many real-world
applications

Ensemble and Optimal Bayes Rule
• Given a finite amount of data, many hypothesis are
typically equally good. How can the learning algorithm
select among them?
• Optimal Bayes classifier recipe: take a weighted majority
• vote of all hypotheses weighted by their posterior
probability.
• That is, put most weight on hypotheses consistent with the
data.
• Hence, ensemble learning may be viewed as an
approximation of the Optimal Bayes rule (which is
provably the best possible classifier).

Bias and Variance Decomposition

Bias
– The hypothesis space made available by a
particular classification method does not
include sufficient hypotheses

Variance
– The hypothesis space made available is too
large for the training data, and the selected
hypothesis may not be accurate on unseen data

Bias and Variance
Decision Trees
• Small trees have high bias.
• Large trees have high
variance. Why?

from Elder, John. From Trees
to Forests and Rule Sets - A
Unified Overview of Ensemble
Methods. 2007.

For Any Model
(Not only decision trees)
• Given a target function
• Model has many parameters
– Generally low bias
– Fits data well
– Yields high variance
• Model has few parameters
– Generally high bias
– May not fit data well
– The fit does not change much for different data sets
(low variance)

Bias-Variance and Ensemble Learning
• Bagging: There exists empirical and theoretical
evidence that Bagging acts as variance reduction
machine (i.e., it reduces the variance part of the
error).
• AdaBoost: Empirical evidence suggests that
AdaBoost reduces both the bias and the variance
part of the error. In particular, it seems that bias is
mostly reduced in early iterations, while variance
in later ones.

Illustration on Bagging

y

x

Occam's razor
• The explanation of any phenomenon should
make as few assumptions as possible,
eliminating those that make no difference in
the observable predictions of the
explanatory hypothesis or theory

Contradiction with Occam’s Razor
• Ensemble Contradicts with Occam’s Razor
– More rounds -> more classifiers for voting ->
more complicated
– With the 0 training error, a more complicated
classifier may perform worse

Two Razors (Domingos, 1999)
• First razor: Given two models with the same
generalization error, the simpler one should be
preferred because simplicity is desirable in itself.

• On the other hand, within KDD Occam's razor is often
used in a quite different sense, that can be stated as:

• Second razor: Given two models with the same
training-set error, the simpler one should be preferred
because it is likely to have lower generalization error.
• Domingos: The first one is largely uncontroversial, while
the second one, taken literally, is false.

Summary
• “Two heads are better than none. One
hundred heads are so much better than
one”
– Dearg Doom, The Tain, Horslips, 1973
• “Great minds think alike, clever minds
think together‖ L. Zoref, 2011.

• But they must be different, specialised
• And it might be an idea to select only the
best of them for the problem at hand

Ensemble Learning: The Wisdom of Crowds (of Machines)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ensemble Learning: The Wisdom of Crowds (of Machines)

Similar to Ensemble Learning: The Wisdom of Crowds (of Machines) (20)

Recently uploaded

Recently uploaded (20)

Ensemble Learning: The Wisdom of Crowds (of Machines)