Prof. Pier Luca Lanzi
Classification: Ensemble Methods
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
Readings
•  “The Elements of Statistical Learning: Data Mining, Inference, and
Prediction,” Second Edition, February 2009, Trevor Hastie,
Robert Tibshirani, Jerome Friedman (http://statweb.stanford.edu/
~tibs/ElemStatLearn/)
§ Section 8.7
§ Chapter 10
§ Chapter 15
§ Chapter 16
2
Prof. Pier Luca Lanzi
What are Ensemble Methods?
Prof. Pier Luca Lanzi
What is the idea? 4
Diagnosis Diagnosis Diagnosis
Final DiagnosisFinal Diagnosis
Prof. Pier Luca Lanzi
Ensemble Methods
Generate a set of classifiers from the training data
Predict class label of previously unseen cases by
aggregating predictions made by multiple classifiers
Prof. Pier Luca Lanzi
What is the General Idea?
Original
Training data
....D1
D2 Dt-1 Dt
D
Step 1:
Create Multiple
Data Sets
C1 C2 Ct -1 Ct
Step 2:
Build Multiple
Classifiers
C*
Step 3:
Combine
Classifiers
6
Prof. Pier Luca Lanzi
Building models ensembles
•  Basic idea
§ Build different “experts”, let them vote
•  Advantage
§ Often improves predictive performance
•  Disadvantage
§ Usually produces output that is very hard to analyze
7
However, there are approaches that aim to
produce a single comprehensible structure
Prof. Pier Luca Lanzi
Why does it work?
•  Suppose there are 25 base classifiers
•  Each classifier has error rate, ε = 0.35
•  Assume classifiers are independent
•  The probability that the ensemble makes a wrong prediction is
8
Prof. Pier Luca Lanzi
how can we generate several models using
the same data and the same approach (e.g.,Trees)?
Prof. Pier Luca Lanzi
we might …
randomize the selection of training data?
randomizing the approach?
…
Prof. Pier Luca Lanzi
Bagging or Bootstrap Aggregation
Prof. Pier Luca Lanzi
What is Bagging? (Bootstrap Aggregation)
•  Analogy
§ Diagnosis based on multiple doctors’ majority vote
•  Training
§ Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
§ A classifier model Mi is learned for each training set Di
•  Classification (classify an unknown sample X)
§ Each classifier Mi returns its class prediction
§ The bagged classifier M* counts the votes and 
assigns the class with the most votes to X
•  Prediction
§ It can be applied to the prediction of continuous values by taking the
average value of each prediction for a given test tuple
12
Prof. Pier Luca Lanzi
What is Bagging?
•  Combining predictions by voting/averaging
§ The simplest way is to give equal weight to each model
•  “Idealized” version
§ Sample several training sets of size n
(instead of just having one training set of size n)‫‏‬
§ Build a classifier for each training set
§ Combine the classifiers’ predictions
•  Problem
§ We only have one dataset!
•  Solution
§ Generate new ones of size n using Bootstrap, 
i.e., sampling from it with replacement
13
Prof. Pier Luca Lanzi
Bagging Classifiers 14
Let n be the number of instances in the training data
For each of t iterations:
Sample n instances from training set
(with replacement)‫‏‬
Apply learning algorithm to the sample
Store resulting model
For each of the t models:
Predict class of instance using model
Return class that is predicted most often
Model generation
Classification
Prof. Pier Luca Lanzi
Bagging works because it reduces
variance by voting/averaging
usually, the more classifiers the better
it can help a lot if data are noisy
however, in some pathological hypothetical
situations the overall error might increase
Prof. Pier Luca Lanzi
When Does Bagging Work?
•  A Learning algorithm is unstable, if small changes to the training
set cause large changes in the learned classifier
•  If the learning algorithm is unstable, then 
Bagging almost always improves performance
•  Bagging stable classifiers is not a good idea
•  Examples of unstable classifiers? Decision trees, regression trees,
linear regression, neural networks
•  Examples of stable classifiers? K-nearest neighbors
16
Prof. Pier Luca Lanzi
Randomize the Algorithm not the Data
•  We could randomize the learning algorithm instead of the data
•  Some algorithms already have a random component, 
e.g. initial weights in neural net
•  Most algorithms can be randomized (e.g., attribute selection in
decision trees)
•  More generally applicable than bagging, e.g. random subsets in
nearest-neighbor scheme
•  It can be combined with bagging
17
Prof. Pier Luca Lanzi
Boosting
Prof. Pier Luca Lanzi
What is Boosting?
•  Analogy
§ Consult several doctors, based on a combination of weighted diagnoses
§ Weight assigned based on the previous diagnosis accuracy
•  How does Boosting work?
§ Weights are assigned to each training example
§ A series of k classifiers is iteratively learned
§ After a classifier Mi is learned, the weights are updated to allow the
subsequent classifier Mi+1 to pay more attention to the training tuples that
were misclassified by Mi
§ The final M* combines the votes of each individual classifier, where the
weight of each classifier's vote is a function of its accuracy
19
Prof. Pier Luca Lanzi
Example of Boosting
•  Suppose there are just 5 training examples {1,2,3,4,5}
•  Initially each example has a 0.2 (1/5) probability of being sampled
•  1st round of boosting samples (with replacement) 5 examples:
{2, 4, 4, 3, 2} and builds a classifier from them
•  Suppose examples 2, 3, 5 are correctly predicted by this classifier,
and examples 1, 4 are wrongly predicted
§ Weight of examples 1 and 4 is increased,
§ Weight of examples 2, 3, 5 is decreased
•  2nd round of boosting samples again 5 examples, but now
examples 1 and 4 are more likely to be sampled; and so on until
convergence is achieved
20
Prof. Pier Luca Lanzi
AdaBoost.M1 21
Assign equal weight to each training instance
For t iterations:
Apply learning algorithm to weighted dataset,
store resulting model
Compute model’s error e on weighted dataset
If e = 0 or e ≥ 0.5:
Terminate model generation
For each instance in dataset:
If classified correctly by model:
Multiply instance’s weight by e/(1-e)‫‏‬
Normalize weight of all instances
Model generation
Assign weight = 0 to all classes
For each of the t (or less) models:
For the class this model predicts
add –log e/(1-e) to this class’s weight
Return class with highest weight
Classification
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
More on Boosting
•  Boosting needs weights ... but
•  Can adapt learning algorithm ... or
•  Can apply boosting without weights
§ Resample with probability determined by weights
§ Disadvantage: not all instances are used
§ Advantage: if error  0.5, can resample again
•  Stems from computational learning theory
•  Theoretical result: training error decreases exponentially
•  It works if base classifiers are not too complex, and their error doesn’t become
too large too quickly
•  Boosting works with weak learners only condition: error doesn’t exceed 0.5
•  In practice, boosting sometimes overfits (in contrast to bagging)
23
Prof. Pier Luca Lanzi
AdaBoost
•  Given the base classifiers C1, C2, …, CT
•  The error rate is computed as
•  The importance of a classifier is computed as
•  Classification is computed as
24
Prof. Pier Luca Lanzi
Boosting
Round 1 + + + -- - - - - -
0.0094 0.0094 0.4623
B1
a= 1.9459
Data points
for training
Initial weights for each data point
Original
Data + + + -- - - - + +
0.1 0.1 0.1
Prof. Pier Luca Lanzi
Boosting
Round 1 + + + -- - - - - -
Boosting
Round 2 - - - -- - - - + +
Boosting
Round 3 + + + ++ + + + + +
Overall + + + -- - - - + +
0.0094 0.0094 0.4623
0.3037 0.0009 0.0422
0.0276 0.1819 0.0038
B1
B2
B3
a= 1.9459
a= 2.9323
a= 3.8744
Prof. Pier Luca Lanzi
Training Sample
classifier
C(0)(x)
Weighted Sample
re-weight
classifier
C(1)(x)
Weighted Sample
re-weight
classifier
C(2)(x)
Weighted Sample
re-weight
Weighted Sample
re-weight
classifier
C(3)(x)
classifier
C(m)(x)
ClassifierN
(i)
i
i
y(x) w C (x)= ∑
Prof. Pier Luca Lanzi
Training Sample
classifier
C(0)(x)
Weighted Sample
re-weight
classifier
C(1)(x)
Weighted Sample
re-weight
classifier
C(2)(x)
Weighted Sample
re-weight
Weighted Sample
re-weight
classifier
C(3)(x)
classifier
C(m)(x)
err
err
err
1 f
with :
f
misclassified events
f
all events
−
=
ClassifierN (i)
(i)err
(i)
i err
1 f
y(x) log C (x)
f
⎛ ⎞−
= ⎜ ⎟
⎝ ⎠
∑
AdaBoost re-weights events misclassified by
previous classifier by:
AdaBoost weights the classifiers also using the
error rate of the individual classifier
according to:
Prof. Pier Luca Lanzi
AdaBoost in Pictures
Start here:
equal event weights
misclassified events get
larger weights
… and so on
Prof. Pier Luca Lanzi
Boosting
•  Also uses voting/averaging
•  Weights models according to performance
•  Iterative: new models are influenced 
by the performance of previously built ones
§ Encourage new model to become an “expert” for instances
misclassified by earlier models
§ Intuitive justification: models should be experts that
complement each other
•  Several variants
§ Boosting by sampling, 
the weights are used to sample the data for training
§ Boosting by weighting,
the weights are used by the learning algorithm
30
Prof. Pier Luca Lanzi
Random Forests
Prof. Pier Luca Lanzi
learning ensemble consisting of a bagging
of unpruned decision tree learners with
randomized selection of features at each split
Prof. Pier Luca Lanzi
What is a Random Forest?
•  Random forests (RF) are a combination of tree predictors
•  Each tree depends on the values of a random vector sampled
independently and with the same distribution for all trees in the
forest
•  The generalization error of a forest of tree classifiers depends on
the strength of the individual trees in the forest and the
correlation between them
•  Using a random selection of features to split each node yields
error rates that compare favorably to Adaboost, and are more
robust with respect to noise
33
Prof. Pier Luca Lanzi
How do Random Forests Work?
•  D = training set F = set of variables to test for split
•  k = number of trees in forest n = number of variables
•  for i = 1 to k do
§ build data set Di by sampling with replacement from D
§ learn tree Ti from Di using at each node only a subset F of the n variables
§ save tree Ti as it is, do not perform any pruning
•  Output is computed as the majority voting (for classification) or average (for
prediction) of the set of k generated trees
34
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Out of Bag Samples
•  For each observation (xi, yi), construct its random forest predictor
by averaging only those trees corresponding to bootstrap samples
in which the observation did not appear.
•  The OOB error estimate is almost identical to that obtained by
N-fold crossvalidation.
•  Thus, random forests can be fit in one sequence, with cross-
validation being performed along the way
•  Once the OOB error stabilizes, the training can be terminated.
36
Prof. Pier Luca Lanzi
37
Prof. Pier Luca Lanzi
Properties of Random Forests
•  Easy to use (off-the-shelve), only 2 parameters 
(no. of trees, %variables for split)
•  Very high accuracy
•  No overfitting if selecting large number of trees (choose high)
•  Insensitive to choice of split% (~20%)
•  Returns an estimate of variable importance
•  Random forests are an effective tool in prediction.
•  Forests give results competitive with boosting and adaptive
bagging, yet do not progressively change the training set.
•  Random inputs and random features produce good results in
classification - less so in regression.
•  For larger data sets, we can gain accuracy by combining random
features with boosting.
38

DMTM 2015 - 15 Classification Ensembles

  • 1.
    Prof. Pier LucaLanzi Classification: Ensemble Methods Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2.
    Prof. Pier LucaLanzi Readings •  “The Elements of Statistical Learning: Data Mining, Inference, and Prediction,” Second Edition, February 2009, Trevor Hastie, Robert Tibshirani, Jerome Friedman (http://statweb.stanford.edu/ ~tibs/ElemStatLearn/) § Section 8.7 § Chapter 10 § Chapter 15 § Chapter 16 2
  • 3.
    Prof. Pier LucaLanzi What are Ensemble Methods?
  • 4.
    Prof. Pier LucaLanzi What is the idea? 4 Diagnosis Diagnosis Diagnosis Final DiagnosisFinal Diagnosis
  • 5.
    Prof. Pier LucaLanzi Ensemble Methods Generate a set of classifiers from the training data Predict class label of previously unseen cases by aggregating predictions made by multiple classifiers
  • 6.
    Prof. Pier LucaLanzi What is the General Idea? Original Training data ....D1 D2 Dt-1 Dt D Step 1: Create Multiple Data Sets C1 C2 Ct -1 Ct Step 2: Build Multiple Classifiers C* Step 3: Combine Classifiers 6
  • 7.
    Prof. Pier LucaLanzi Building models ensembles •  Basic idea § Build different “experts”, let them vote •  Advantage § Often improves predictive performance •  Disadvantage § Usually produces output that is very hard to analyze 7 However, there are approaches that aim to produce a single comprehensible structure
  • 8.
    Prof. Pier LucaLanzi Why does it work? •  Suppose there are 25 base classifiers •  Each classifier has error rate, ε = 0.35 •  Assume classifiers are independent •  The probability that the ensemble makes a wrong prediction is 8
  • 9.
    Prof. Pier LucaLanzi how can we generate several models using the same data and the same approach (e.g.,Trees)?
  • 10.
    Prof. Pier LucaLanzi we might … randomize the selection of training data? randomizing the approach? …
  • 11.
    Prof. Pier LucaLanzi Bagging or Bootstrap Aggregation
  • 12.
    Prof. Pier LucaLanzi What is Bagging? (Bootstrap Aggregation) •  Analogy § Diagnosis based on multiple doctors’ majority vote •  Training § Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap) § A classifier model Mi is learned for each training set Di •  Classification (classify an unknown sample X) § Each classifier Mi returns its class prediction § The bagged classifier M* counts the votes and assigns the class with the most votes to X •  Prediction § It can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple 12
  • 13.
    Prof. Pier LucaLanzi What is Bagging? •  Combining predictions by voting/averaging § The simplest way is to give equal weight to each model •  “Idealized” version § Sample several training sets of size n (instead of just having one training set of size n)‫‏‬ § Build a classifier for each training set § Combine the classifiers’ predictions •  Problem § We only have one dataset! •  Solution § Generate new ones of size n using Bootstrap, i.e., sampling from it with replacement 13
  • 14.
    Prof. Pier LucaLanzi Bagging Classifiers 14 Let n be the number of instances in the training data For each of t iterations: Sample n instances from training set (with replacement)‫‏‬ Apply learning algorithm to the sample Store resulting model For each of the t models: Predict class of instance using model Return class that is predicted most often Model generation Classification
  • 15.
    Prof. Pier LucaLanzi Bagging works because it reduces variance by voting/averaging usually, the more classifiers the better it can help a lot if data are noisy however, in some pathological hypothetical situations the overall error might increase
  • 16.
    Prof. Pier LucaLanzi When Does Bagging Work? •  A Learning algorithm is unstable, if small changes to the training set cause large changes in the learned classifier •  If the learning algorithm is unstable, then Bagging almost always improves performance •  Bagging stable classifiers is not a good idea •  Examples of unstable classifiers? Decision trees, regression trees, linear regression, neural networks •  Examples of stable classifiers? K-nearest neighbors 16
  • 17.
    Prof. Pier LucaLanzi Randomize the Algorithm not the Data •  We could randomize the learning algorithm instead of the data •  Some algorithms already have a random component, e.g. initial weights in neural net •  Most algorithms can be randomized (e.g., attribute selection in decision trees) •  More generally applicable than bagging, e.g. random subsets in nearest-neighbor scheme •  It can be combined with bagging 17
  • 18.
    Prof. Pier LucaLanzi Boosting
  • 19.
    Prof. Pier LucaLanzi What is Boosting? •  Analogy § Consult several doctors, based on a combination of weighted diagnoses § Weight assigned based on the previous diagnosis accuracy •  How does Boosting work? § Weights are assigned to each training example § A series of k classifiers is iteratively learned § After a classifier Mi is learned, the weights are updated to allow the subsequent classifier Mi+1 to pay more attention to the training tuples that were misclassified by Mi § The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy 19
  • 20.
    Prof. Pier LucaLanzi Example of Boosting •  Suppose there are just 5 training examples {1,2,3,4,5} •  Initially each example has a 0.2 (1/5) probability of being sampled •  1st round of boosting samples (with replacement) 5 examples: {2, 4, 4, 3, 2} and builds a classifier from them •  Suppose examples 2, 3, 5 are correctly predicted by this classifier, and examples 1, 4 are wrongly predicted § Weight of examples 1 and 4 is increased, § Weight of examples 2, 3, 5 is decreased •  2nd round of boosting samples again 5 examples, but now examples 1 and 4 are more likely to be sampled; and so on until convergence is achieved 20
  • 21.
    Prof. Pier LucaLanzi AdaBoost.M1 21 Assign equal weight to each training instance For t iterations: Apply learning algorithm to weighted dataset, store resulting model Compute model’s error e on weighted dataset If e = 0 or e ≥ 0.5: Terminate model generation For each instance in dataset: If classified correctly by model: Multiply instance’s weight by e/(1-e)‫‏‬ Normalize weight of all instances Model generation Assign weight = 0 to all classes For each of the t (or less) models: For the class this model predicts add –log e/(1-e) to this class’s weight Return class with highest weight Classification
  • 22.
  • 23.
    Prof. Pier LucaLanzi More on Boosting •  Boosting needs weights ... but •  Can adapt learning algorithm ... or •  Can apply boosting without weights § Resample with probability determined by weights § Disadvantage: not all instances are used § Advantage: if error 0.5, can resample again •  Stems from computational learning theory •  Theoretical result: training error decreases exponentially •  It works if base classifiers are not too complex, and their error doesn’t become too large too quickly •  Boosting works with weak learners only condition: error doesn’t exceed 0.5 •  In practice, boosting sometimes overfits (in contrast to bagging) 23
  • 24.
    Prof. Pier LucaLanzi AdaBoost •  Given the base classifiers C1, C2, …, CT •  The error rate is computed as •  The importance of a classifier is computed as •  Classification is computed as 24
  • 25.
    Prof. Pier LucaLanzi Boosting Round 1 + + + -- - - - - - 0.0094 0.0094 0.4623 B1 a= 1.9459 Data points for training Initial weights for each data point Original Data + + + -- - - - + + 0.1 0.1 0.1
  • 26.
    Prof. Pier LucaLanzi Boosting Round 1 + + + -- - - - - - Boosting Round 2 - - - -- - - - + + Boosting Round 3 + + + ++ + + + + + Overall + + + -- - - - + + 0.0094 0.0094 0.4623 0.3037 0.0009 0.0422 0.0276 0.1819 0.0038 B1 B2 B3 a= 1.9459 a= 2.9323 a= 3.8744
  • 27.
    Prof. Pier LucaLanzi Training Sample classifier C(0)(x) Weighted Sample re-weight classifier C(1)(x) Weighted Sample re-weight classifier C(2)(x) Weighted Sample re-weight Weighted Sample re-weight classifier C(3)(x) classifier C(m)(x) ClassifierN (i) i i y(x) w C (x)= ∑
  • 28.
    Prof. Pier LucaLanzi Training Sample classifier C(0)(x) Weighted Sample re-weight classifier C(1)(x) Weighted Sample re-weight classifier C(2)(x) Weighted Sample re-weight Weighted Sample re-weight classifier C(3)(x) classifier C(m)(x) err err err 1 f with : f misclassified events f all events − = ClassifierN (i) (i)err (i) i err 1 f y(x) log C (x) f ⎛ ⎞− = ⎜ ⎟ ⎝ ⎠ ∑ AdaBoost re-weights events misclassified by previous classifier by: AdaBoost weights the classifiers also using the error rate of the individual classifier according to:
  • 29.
    Prof. Pier LucaLanzi AdaBoost in Pictures Start here: equal event weights misclassified events get larger weights … and so on
  • 30.
    Prof. Pier LucaLanzi Boosting •  Also uses voting/averaging •  Weights models according to performance •  Iterative: new models are influenced by the performance of previously built ones § Encourage new model to become an “expert” for instances misclassified by earlier models § Intuitive justification: models should be experts that complement each other •  Several variants § Boosting by sampling, the weights are used to sample the data for training § Boosting by weighting, the weights are used by the learning algorithm 30
  • 31.
    Prof. Pier LucaLanzi Random Forests
  • 32.
    Prof. Pier LucaLanzi learning ensemble consisting of a bagging of unpruned decision tree learners with randomized selection of features at each split
  • 33.
    Prof. Pier LucaLanzi What is a Random Forest? •  Random forests (RF) are a combination of tree predictors •  Each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest •  The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them •  Using a random selection of features to split each node yields error rates that compare favorably to Adaboost, and are more robust with respect to noise 33
  • 34.
    Prof. Pier LucaLanzi How do Random Forests Work? •  D = training set F = set of variables to test for split •  k = number of trees in forest n = number of variables •  for i = 1 to k do § build data set Di by sampling with replacement from D § learn tree Ti from Di using at each node only a subset F of the n variables § save tree Ti as it is, do not perform any pruning •  Output is computed as the majority voting (for classification) or average (for prediction) of the set of k generated trees 34
  • 35.
  • 36.
    Prof. Pier LucaLanzi Out of Bag Samples •  For each observation (xi, yi), construct its random forest predictor by averaging only those trees corresponding to bootstrap samples in which the observation did not appear. •  The OOB error estimate is almost identical to that obtained by N-fold crossvalidation. •  Thus, random forests can be fit in one sequence, with cross- validation being performed along the way •  Once the OOB error stabilizes, the training can be terminated. 36
  • 37.
  • 38.
    Prof. Pier LucaLanzi Properties of Random Forests •  Easy to use (off-the-shelve), only 2 parameters (no. of trees, %variables for split) •  Very high accuracy •  No overfitting if selecting large number of trees (choose high) •  Insensitive to choice of split% (~20%) •  Returns an estimate of variable importance •  Random forests are an effective tool in prediction. •  Forests give results competitive with boosting and adaptive bagging, yet do not progressively change the training set. •  Random inputs and random features produce good results in classification - less so in regression. •  For larger data sets, we can gain accuracy by combining random features with boosting. 38