L4. Ensembles of Decision Trees

Ensembles
Gonzalo Martínez Muñoz
Universidad Autónoma de Madrid

2!
•  What is an ensemble? How to build them?
•  Bagging, Boosting, Random forests, class-
switching
•  Combiners
•  Stacking
•  Other techniques
•  Why they work? Success stories
Outline

•  The combination of opinions is rooted
in the culture of humans
•  Formalized with the Condorcet Jury
Theorem:
Given a jury of voters and assuming
independent errors. If the probability of
each single person in the jury of being
correct is above 50% then the
probability of the jury of being correct
tends to 100% as the number persons
increase
Condorcet Jury theorem
Nicolas de Condorcet (1743-1794),!
French mathematician!

4!
•  An ensemble is a combination of classifiers that
output a final classification.
What is an ensemble?
New Instance: x
1! 1! 2! 1! 2! 1!
T=7 classifiers
1!

General idea
•  Generate many classifiers and combine them to get
a final classification
•  They perform very good. In general better than any
of the single learners they are composed of
•  The classifiers should be different from one another
•  It is important to generate diverse classifiers from
the available data
5/63!

How to build them?
•  There are several techniques to build diverse base
learners in an ensemble:
•  Use modiﬁed versions of the training set to train
the base learners
•  Introduce changes in the learning algorithms
•  These strategies can also be used in combination.
•  Generally the greater the randomization the better
are the results

How to build them?
•  Modiﬁcations of the training set can be generated by
•  Resampling the dataset. By bootstrap sampling (e.g.
bagging), weighted sampling (e.g. boosting).
•  Altering the attributes: The base learners are trained
using diﬀerent feature subsets (e.g Random
subspaces)
•  Altering the class labels: Grouping classes into two
new class values at random (e.g. ECOC) or modifying
at random the class labels (e.g. Class-switching)

How to build them?
•  Randomizing the learning algorithms
•  Introducing certain randomness into the learning
algorithms, so that two consecutive executions of
the algorithm would output different classifiers
•  Running the base learner with different
architectures, paremeters, etc.

Bagging
Input:
Dataset L
Ensemble size T
1.for t=1 to T:
2. sample = BootstrapSample(L)
3. ht = TrainClassifier(sample)
( )( )⎟
⎠
⎞
⎜
⎝
⎛
== ∑=
T
t
t
j
jhIH
1
argmax)( xx
Bootstrap
Aggregation
+Output:

Bagging
Original dataset!
Bootstrap !
sample 1!
!
Repeated example!!
Removed example!
…!
…!
Bootstrap !
sample T!

Considerations about bagging
•  Uses 63,2% of the training data on average to build
each classiﬁer.
•  It is very robust against label noise.
•  In general, it improves the error of the single
learner.
•  Easily parallelizable

Boosting
Input:
Dataset L
Ensemble size T
1.Assign example weights to 1/N
2.for t=1 to T:
3. ht = BuildClassifier(L, pesos)
4. et = WeightedError(L, pesos)
5. if et==0 or et ≥ 0.5 break
6. Multiply incorrectly classified
instances weights ht by et/
(1-et)
7. Normalize weights

Boosting
Original dataset!
Iteration 1!
…!
…!
Iteration 2!

Considerations about boosting
•  Obtains very good generalization error on average
•  It is not robust against class label noise
•  It can increment the error of the base classiﬁer
•  Cannot be easily implemented in parallel

Random forest
•  Breiman deﬁned a Random forest as an ensemble
that:
•  Has decision trees as its base learner
•  Introduces some randomness in the learning
process.
•  Under this deﬁnition bagging of decision trees is a
random forest and in fact it is. However…

Random forest
•  In practice, it is often considered an ensemble that:
•  Each tree is generated, as in bagging, using bootstrap
samples
•  The tree is a special tree that each split is computed
using:
•  A random subset of the features
•  The best split within this subset is then selected
•  Unpruned trees are used

Considerations about random
forests
•  Its performance is better than boosting in most
cases
•  It is robust to noise (does not overﬁt)
•  Random forest introduces an additional
randomization mechanism with respect to bagging
•  Easily parallelizable
•  Random trees are very fast to train

Class switching
•  Class switching is an ensemble method in which
diversity is obtained by using different versions of
the training data polluted with class label noise.
•  Specifically, to train each base learner, the class
label of each training point is changed to a different
class label with probability p.

Class switching
Original dataset!
Random!
noise 1!
…!
…!
Random!
noise T!
p=30%!

Example
•  2D example
•  Boundary is x1=x2
•  x1~U[0, 1] x2~U[0, 1]
•  Not an easy task for a normal decision tree
•  Let’s try bagging, boosting and class-switching with
p=0.2 y p=0.4
x1
x2
Clase 1
Clase 2
1
1

bagging! boosting! switching p=0.2! switching p=0.4!
1 clasf..!
11 clasf..!
101 clasf..!
1001 clasf..!
Results

22!
Parametrization
Base classiﬁers
Ensemble size T
Other
parameters /
options
Bagging
Unpruned decision
trees
As much as
possible
Smaller samples
Boosting
Pruned decision
trees
Weak learners
Hundreds
Random forest
Unpruned random
decision trees

As much as
possible
# random features
for the split =
log(#features) or
sqrt(#features)
Class-switching
Unpruned decision
trees
>Thousands
% of instances to
modiﬁy, p~30%
Generally used parameters !

Combiners
•  The combination techniques can be divided into two
groups:
•  Voting strategies: The ensemble prediction is the class
label that is predicted most often by the base learners.
Could be weighted
•  Non voting strategies: Some operations such as
maximum, minimum, product, median and mean can
be employed on the conﬁdence levels that are the
output of the individual base learners.
•  There is no winner strategy among the diﬀerent
combination techniques. Depends on many factors

Stacking
•  In stacking the combination phase included in the
learning process.
•  First the base learners are trained on some version of
the original training set
•  After that, the predictions of the base learners are used
as new feature vectors to train a second level learner
(meta-learner).
•  The key point in this strategy is to improve the guesses
that are made by the base learners, by generalizing
these guesses using a meta learner.

Evidence!
histograms! Stacked classiﬁer!
Stacking dataset!
Random forest!
…

…

h1

h2

hn

h1
h2
hn

output!
Stacking example
Extract descriptors
1. A Random forest is trained on the descriptors:
• Each leaf node stores the class histogram
•  In a second phase stacking is applied:
•  The histograms of the leaf nodes are accumulated for all
tree
•  The accumulated histograms are concatenated
•  Boosting is applied to the concatenated histograms.

1.- Random ordering produced by bagging

h1 , h2 , h3 ,..,hT

0.08!
0.09!
0.1!
0.11!
0.12!
0.13!
0.14!
20! 40! 60! 80! 100! 120! 140! 160! 180! 200!
Error!
# of classifiers!
Bagging!Reduce-error!CART!
2.- New ordering
hs1 , hs2 , hs3 ,..,hsT
% pruning!
!
!
!
3.- Pruning
hs1 ,..,hsM
Size reduction!
Classiﬁcation error reduction!
Ensemble pruning

Accumulated votes:
2
1
5
4
3
2
1
Dynamic ensemble pruning
New Instance:
x
1
t à
1
1
2
1
2
1
T=7 classiﬁers
0
0
Final class:
1
 Do we really need to query all classiﬁers in the ensemble?
 NO
t2
t1

Why they work?
•  Reasons for their good results:
•  Statistical reasons: There are not enough data for
the classiﬁcation algorithm to obtain an optimum
hypothesis.
•  Computational reasons: The single algorithm is
not capable of reaching the optimum solution.
•  Expressive reasons: The solution is outside the
hypothesis space.
28/63!

Why they work?
Thomas Dietterich!

Why they work?
30/63!
A set of suboptimal solutions can be created that
compensate their limitations when combined in the
ensemble.!

Success story 1: Netﬂix prize
challenge
•  Dataset: rating of 17770 movies and 480189 users

Combines
hundreds of
models from three
teams
Variant of stacking

Success story 2: KDD cup
•  KDD cup 2013: Predict papers written by given author.
•  The winning team used Random Forest and
Boosting among other models combined with
regularized linear regression.
•  KDD cup 2014: Predict funding requests that deserve an
A+ in donorschoose.org
•  Multistage ensemble
•  KDD cup 2015: Predict dropouts in MOOC
•  Multistage ensemble

Success story 3: Kinect
•  Computer Vision
•  Classify pixels into body parts (leg, head, etc)
•  Use Random Forests

34!
•  A family of machine learning algorithms with one of
the best over all performances. Comparable or better
than SVMs
•  Almost parameter less learning algorithms.
•  If decision trees are the base learners, they are cheap
(fast) to train and in test.
Good things about ensembles

35!
•  None! Well maybe something…
•  Slower than single classiﬁer. Since we create
hundreds or thousands of classiﬁers.
•  Can be mitigated using ensemble pruning
Bad things about ensembles

L4. Ensembles of Decision Trees

More Related Content

What's hot

Similar to L4. Ensembles of Decision Trees

More from Machine Learning Valencia

Recently uploaded

L4. Ensembles of Decision Trees