SlideShare a Scribd company logo
1 of 74
Ensemble Learning:
The Wisdom of Crowds
    (of Machines)
  Lior Rokach
  Department of Information Systems Engineering
  Ben-Gurion University of the Negev
About Me
Prof. Lior Rokach
Department of Information Systems Engineering
Faculty of Engineering Sciences
Head of the Machine Learning Lab
Ben-Gurion University of the Negev

Email: liorrk@bgu.ac.il
http://www.ise.bgu.ac.il/faculty/liorr/



PhD (2004) from Tel Aviv University
The Condorcet Jury Theorem
• If each voter has a probability p of being correct
  and the probability of a majority of voters being
  correct is M,
• then p > 0.5 implies M > p.
• Also M approaches 1, for all p > 0.5 as the number
  of voters approaches infinity.
• This theorem was proposed by the Marquis of
  Condorcet in 1784
Francis Galton
• Galton promoted statistics and invented the
  concept of correlation.
• In 1906 Galton visited a livestock fair and
  stumbled upon an intriguing contest.
• An ox was on display, and the villagers were
  invited to guess the animal's weight.
• Nearly 800 gave it a go and, not surprisingly,
  not one hit the exact mark: 1,198 pounds.
• Astonishingly, however, the average of those
  800 guesses came close - very close indeed. It
  was 1,197 pounds.
The Wisdom of Crowds
Why the Many Are Smarter Than the Few and How Collective Wisdom

        Shapes Business, Economies, Societies and Nations



             • Under certain controlled conditions, the
               aggregation of information in groups,
               resulting in decisions that are often
               superior to those that can been made by
               any single - even experts.
             • Imitates our second nature to seek several
               opinions before making any crucial
               decision. We weigh the individual
               opinions, and combine them to reach a
               final decision
Committees of Experts
   – ― … a medical school that has the objective that all students,
     given a problem, come up with an identical solution‖

• There is not much point in setting up a committee of
  experts from such a group - such a committee will not
  improve on the judgment of an individual.

• Consider:
   – There needs to be disagreement for the committee to have
     the potential to be better than an individual.
Does it always work?
• Not all crowds (groups) are wise.
  – Example: crazed investors in a stock market
    bubble.
Key Criteria
• Diversity of opinion
   – Each person should have private information even if it's just an
     eccentric interpretation of the known facts.
• Independence
   – People's opinions aren't determined by the opinions of those
     around them.
• Decentralization
   – People are able to specialize and draw on local knowledge.
• Aggregation
   – Some mechanism exists for turning private judgments into a
     collective decision.
Teaser: How good are ensemble methods?



   Let’s look at the Netflix Prize Competition…
Began October 2006




• Supervised learning task
   – Training data is a set of users and ratings (1,2,3,4,5
     stars) those users have given to movies.
   – Construct a classifier that given a user and an unrated
     movie, correctly classifies that movie as either 1, 2, 3,
     4, or 5 stars


• $1 million prize for a 10% improvement over
  Netflix’s current movie recommender/classifier
  (MSE = 0.9514)
Learning biases
• Occam’s razor
  ―among the theories that are consistent with the data,
  select the simplest one‖.

• Epicurus’ principle
  ―keep all theories that are consistent with the data,‖
  [not necessarily with equal weights]
     E.g. Bayesian learning
           Ensemble learning
                                                     12
Strong and Weak Learners
• Strong (PAC) Learner
  – Take labeled data for training
  – Produce a classifier which can be arbitrarily
    accurate
  – Objective of machine learning
• Weak (PAC) Learner
  – Take labeled data for training
  – Produce a classifier which is more accurate
    than random guessing
Ensembles of classifiers
• Given some training data
             Dtrain      x n , yn ; n 1,, N train
• Inductive learning
     L: Dtrain        h( ), where h( ): X            Y
• Ensemble learning
     L1: Dtrain       h1( )
     L2: Dtrain       h2( )
          ...                        Ensemble:
     LT: Dtrain        hT ( )        {h1( ), h2( ), ... , hT ( )}
                                                             14
Classification by majority voting
              New Instance:           x




                                                                   T=7 classifiers



          1         1             1       2   1   2      1


Accumulated votes:          t 5 2
                               0
                               1
                               2
                               3 0
                               4 1                Final class: 1
                               t1 t2




          Alberto Suárez (2012)
                                                                            15
Popular Ensemble Methods
Boosting
• Learners
  – Strong learners are very difficult to construct
  – Constructing weaker Learners is relatively easy
• Strategy
  – Derive strong learner from weak learner
  – Boost weak classifiers to a strong learner
Construct Weak Classifiers
• Using Different Data Distribution
  – Start with uniform weighting
  – During each step of learning
     • Increase weights of the examples which are not
       correctly learned by the weak learner
     • Decrease weights of the examples which are
       correctly learned by the weak learner
• Idea
  – Focus on difficult examples which are not
    correctly classified in the previous steps
Combine Weak Classifiers
• Weighted Voting
  – Construct strong classifier by weighted voting
    of the weak classifiers
• Idea
  – Better weak classifier gets a larger weight
  – Iteratively add weak classifiers
     • Increase accuracy of the combined classifier through
       minimization of a cost function
AdaBoost (Adaptive Boosting)
    (Freund and Schapire, 1997)

Generate a
sequence of
base-learners
each focusing
on previous
one’s errors
(Freund and
Schapire,
1996)
AdaBoost
Example




Training             Combined classifier
Example of a Good Classifier
             +
         +       +
    +

     +
The initial distribution
Train data
x1 x2 y            D1
 1 5 +             0.10
 2 3 +             0.10
 3 2               0.10
 4 6               0.10
 4 7 +             0.10
 5 9 +             0.10
 6 5               0.10
 6 7 +             0.10
 8 5               0.10
 8 8               0.10
                   1.00


  Initialization
Round 1 of 3
          +O                        +
         +O       +O            +       +
+                          +

+                           +

    h1        1   = 0.30            D2
              1=0.42
How the distribution has changed?
Train data             Round 1
x1 x2 y      D1     h1e       D2
 1 5 +       0.10    0 0.00 0.07
 2 3 +       0.10    0 0.00 0.07
 3 2         0.10    0 0.00 0.07
 4 6         0.10    0 0.00 0.07
 4 7 +       0.10    1 0.10 0.17
 5 9 +       0.10    1 0.10 0.17
 6 5         0.10    0 0.00 0.07
 6 7 +       0.10    1 0.10 0.17
 8 5         0.10    0 0.00 0.07
 8 8         0.10    0 0.00 0.07
             1.00       0.30 1.00
                        0.42 
                         Zt 0.92
Round 2 of 3
            +                     +
    +O +                      +        +
+                O        +

+       O
                          +

    2   = 0.21       h2               D2
    2=0.65
How the distribution has changed?
Train data             Round 1        Round 2
x1 x2 y      D1     h1e       D2    h2e       D3
 1 5 +       0.10    0 0.00 0.07     0 0.00 0.05
 2 3 +       0.10    0 0.00 0.07     0 0.00 0.05
 3 2         0.10    0 0.00 0.07     1 0.07 0.17
 4 6         0.10    0 0.00 0.07     1 0.07 0.17
 4 7 +       0.10    1 0.10 0.17     0 0.00 0.11
 5 9 +       0.10    1 0.10 0.17     0 0.00 0.11
 6 5         0.10    0 0.00 0.07     1 0.07 0.17
 6 7 +       0.10    1 0.10 0.17     0 0.00 0.11
 8 5         0.10    0 0.00 0.07     0 0.00 0.05
 8 8         0.10    0 0.00 0.07     0 0.00 0.05
             1.00       0.30 1.00       0.21 1.00
                        0.42           0.65 
                         Zt 0.92         Zt 0.82
Round 3 of 3
                +               O
            +         +             h3
+   O                                    STOP

+       O



                3   = 0.14

                3=0.92
How the distribution has changed?
Train data                   Round 1          Round 2          Round 3
x1 x2 y            D1     h1e       D2      h2e       D3       h3e
 1 5 +             0.10    0 0.00 0.07       0 0.00 0.05        1 0.05
 2 3 +             0.10    0 0.00 0.07       0 0.00 0.05        1 0.05
 3 2               0.10    0 0.00 0.07       1 0.07 0.17        0 0.00
 4 6               0.10    0 0.00 0.07       1 0.07 0.17        0 0.00
 4 7 +             0.10    1 0.10 0.17       0 0.00 0.11        0 0.00
 5 9 +             0.10    1 0.10 0.17       0 0.00 0.11        0 0.00
 6 5               0.10    0 0.00 0.07       1 0.07 0.17        0 0.00
 6 7 +             0.10    1 0.10 0.17       0 0.00 0.11        0 0.00
 8 5               0.10    0 0.00 0.07       0 0.00 0.05        0 0.00
 8 8               0.10    0 0.00 0.07       0 0.00 0.05        1 0.05
                   1.00       0.30 1.00         0.21 1.00          0.14
                              0.42             0.65              0.92
                               Zt 0.92           Zt 0.82
  Initialization
                                  Importance of each learner
Final Hypothesis

  0.42                + 0.65                + 0.92



Hfinal = sign[ 0.42(h1? 1|-1) + 0.65(h2? 1|-1) + 0.92(h3? 1|-1) ]

                               +
                            + +
                      +
                      +
Training Errors vs Test Errors
           Performance on ‘letter’ dataset
               (Schapire et al. 1997)

                                 Test
                                 error
Training
 error




              Training error drops to 0 on round 5

           Test error continues to drop after round 5
                      (from 8.4% to 3.1%)
Adaboost Variants Proposed By
            Friedman
• LogitBoost
Adaboost Variants Proposed By
            Friedman
• GentleBoost
BrownBoost
• Reduce the weight given to misclassified
  example

• Good (only) for very noisy data.
Bagging
         Bootstrap AGGregatING
• Employs simplest way of combining predictions
  that belong to the same type.
• Combining can be realized with voting or
  averaging
• Each model receives equal weight
• ―Idealized‖ version of bagging:
   – Sample several training sets of size n (instead of just
     having one training set of size n)
   – Build a classifier for each training set
   – Combine the classifier’s predictions
• This improves performance in almost all cases if
  learning scheme is unstable.
Wagging
        Weighted AGGregatING
• A variant of bagging in which each
  classifier is trained on the entire training set,
  but each instance is stochastically assigned
  a weight.
Random Forests
1. Choose T—number of trees to grow.
2. Choose m—number of variables used to split each node. m
   ≪ M, where M is the number of input variables. m is hold
   constant while growing the forest.
3. Grow T trees. When growing each tree do the following.
   (a) Construct a bootstrap sample of size n sampled from Sn with
      replacement and grow a tree from this bootstrap sample.
   (b) When growing a tree at each node select m variables at random
      and use them to find the best split.
   (c) Grow the tree to a maximal extent. There is no pruning.
4. To classify point X collect votes from every tree in the
   forest and
then use majority voting to decide on the class label.
Variation of Random Forests
• Random Split Selection (Dietterich, 2000)
   –   Grow multiple trees
   –   When splitting, choose split uniformly at random from
   –   K best splits
   –   Can be used with or without pruning
• Random Subspace (Ho, 1998)
   –   Grow multiple trees
   –   Each tree is grown using a fixed subset of variables
   –   Do a majority vote or averaging to combine votes from
   –   different trees
DECORATE
       (Melville & Mooney, 2003)
• Change training data by adding new
  artificial training examples that encourage
  diversity in the resulting ensemble.
• Improves accuracy when the training set is
  small, and therefore resampling and
  reweighting the training set has limited
  ability to generate diverse alternative
  hypotheses.
Overview of DECORATE

                                             Current Ensemble
Training Examples
              +
               -                                  C1
               -
              +
              +
                             Base Learner
                +
                +
                -
                +
                -
Artificial Examples
Overview of DECORATE

                                             Current Ensemble
Training Examples
              +
               -                                  C1
               -
              +
              +
                             Base Learner         C2
                +
                -
                -
                +
                -
                -
                +
Artificial Examples
Overview of DECORATE

                                             Current Ensemble
Training Examples
              +
               -                                  C1
               -
              +
              +
                             Base Learner         C2
                -
                +
                +
                +                                  C3
                -
Artificial Examples
Error-Correcting Output Codes
Ensemble Taxonomy
  (Rokach, 2009)

                   Diversity
                   generator




 Members
                                      Combiner
Dependency

                   Ensemble




          Cross-               Ensemble
         Inducer                 size
Combiner
• Weighting methods
  –   Majority Voting
  –   Performance Weighting
  –   Distribution Summation
  –   Gating Network
• Meta-Learning
  – Stacking
  – Arbiter Trees
  – Grading
Mixtures of Experts
Stacking
• Combiner f () is
  another learner
  (Wolpert, 1992)
Members Dependency
• Dependent Methods: There is an interaction
  between the learning runs (AdaBoost)
   – Model-guided Instance Selection: the classifiers that
     were constructed in previous iterations are used for
     selecting the training set in the subsequent iteration.
   – Incremental Batch Learning: In this method the
     classification produced in one iteration is given as prior
     knowledge (a new feature) to the learning algorithm in
     the subsequent iteration.
• Independent Methods (Bagging)
Cascading
Use dj only if
preceding ones
are not confident

Cascade learners
in order of
complexity
Diversity
• Manipulating the Inducer
• Manipulating the Training Sample
• Changing the target attribute representation
• Partitioning the search space - Each member is
  trained on a different search subspace.
• Hybridization - Diversity is obtained by using
  various base inducers or ensemble strategies.
Measuring the Diversity
• Pairwise measures calculate the average of a
  particular distance metric between all possible
  pairings of members in the ensemble, such as Q-
  statistic or kappa-statistic.
• The non-pairwise measures either use the idea of
  entropy or calculate a correlation of each ensemble
  member with the averaged output.
Kappa-Statistic

          i, j       i, j
 i, j
        1         i, j


where      i, j   is the proportion of instances on which the classifiers i and j agree with each

other on the training set, and        i, j   is the probability that the two classifiers agree by

chance.
How crowded should the crowd be?
       Ensemble Selection
• Why bother?
  – Desired accuracy
  – Computational cost
• Predetermine the ensemble size
• Use a certain criterion to stops training
• Pruning
Cross Inducer
• Inducer-dependent (like RandomForest).
• Inducer-independent (like bagging)
Multi-strategy Ensemble Learning
• Combines several ensemble strategies.
• MultiBoosting, an extension to AdaBoost
  expressed by adding wagging-like features
  can harness both AdaBoost's high bias and
  variance reduction with wagging's superior
  variance reduction.
• produces decision committees with lower
  error than either AdaBoost or wagging.
Some Insights
Why using Ensembles?
•   Statistical Reasons: Out of many classifier models with similar training / test
    errors, which one shall we pick? If we just pick one at random, we risk the
    possibility of choosing a really poor one
     – Combining / averaging them may prevent us from making one such unfortunate
       decision
•   Computational Reasons: Every time we run a classification algorithm, we
    may find different local optima
     – Combining their outputs may allow us to find a solution that is closer to the global
       minimum.
•   Too little data / too much data:
     – Generating multiple classifiers with the resampling of the available data / mutually
       exclusive subsets of the available data
•   Representational Reasons: The classifier space may not contain the solution
    to a given particular problem. However, an ensemble of such classifiers may
     – For example, linear classifiers cannot solve non-linearly separable problems,
       however, their combination can.
The Diversity Paradox
There’s no real Paradox…
• Ideally, all committee members would be
  right about everything!
• If not, they should be wrong about different
  things.
No Free Lunch Theorem in Machine
    Learning (Wolpert, 2001)
• “Or to put it another way, for any two
  learning algorithms, there are just as many
  situations (appropriately weighted) in
  which algorithm one is superior to
  algorithm two as vice versa, according to
  any of the measures of "superiority"
So why developing new algorithms?
• The science of pattern recognition is mostly concerned with choosing
  the most appropriate algorithm for the problem at hand
• This requires some a priori knowledge – data distribution, prior
  probabilities, complexity of the problem, the physics of the underlying
  phenomenon, etc.
• The No Free Lunch theorem tells us that – unless we have some a
  priori knowledge – simple classifiers (or complex ones for that matter)
  are not necessarily better than others. However, given some a priori
  information, certain classifiers may better MATCH the characteristics
  of certain type of problems.
• The main challenge of the patter recognition professional is then, to
  identify the correct match between the problem and the classifier!
  …which is yet another reason to arm yourself with a diverse set of PR
  arsenal !
Ensemble and the
         No Free Lunch Theorem
• Ensemble combine the strengths of each
  classifier to make a super-learner.
• But … Ensemble only improves classification
  if the component classifiers perform better
  than chance
  – Can not be guaranteed a priori
• Proven effective in many real-world
  applications
Ensemble and Optimal Bayes Rule
• Given a finite amount of data, many hypothesis are
  typically equally good. How can the learning algorithm
  select among them?
• Optimal Bayes classifier recipe: take a weighted majority
• vote of all hypotheses weighted by their posterior
  probability.
• That is, put most weight on hypotheses consistent with the
  data.
• Hence, ensemble learning may be viewed as an
  approximation of the Optimal Bayes rule (which is
  provably the best possible classifier).
Bias and Variance Decomposition

Bias
  – The hypothesis space made available by a
    particular classification method does not
    include sufficient hypotheses

Variance
  – The hypothesis space made available is too
    large for the training data, and the selected
    hypothesis may not be accurate on unseen data
Bias and Variance
Decision Trees
• Small trees have high bias.
• Large trees have high
variance. Why?




                                from Elder, John. From Trees
                                to Forests and Rule Sets - A
                                Unified Overview of Ensemble
                                Methods. 2007.
For Any Model
        (Not only decision trees)
• Given a target function
• Model has many parameters
  – Generally low bias
  – Fits data well
  – Yields high variance
• Model has few parameters
  – Generally high bias
  – May not fit data well
  – The fit does not change much for different data sets
    (low variance)
Bias-Variance and Ensemble Learning
  • Bagging: There exists empirical and theoretical
    evidence that Bagging acts as variance reduction
    machine (i.e., it reduces the variance part of the
    error).
  • AdaBoost: Empirical evidence suggests that
    AdaBoost reduces both the bias and the variance
    part of the error. In particular, it seems that bias is
    mostly reduced in early iterations, while variance
    in later ones.
Illustration on Bagging

y




                              x
Occam's razor
• The explanation of any phenomenon should
  make as few assumptions as possible,
  eliminating those that make no difference in
  the observable predictions of the
  explanatory hypothesis or theory
Contradiction with Occam’s Razor
• Ensemble Contradicts with Occam’s Razor
  – More rounds -> more classifiers for voting ->
    more complicated
  – With the 0 training error, a more complicated
    classifier may perform worse
Two Razors (Domingos, 1999)
• First razor: Given two models with the same
  generalization error, the simpler one should be
  preferred because simplicity is desirable in itself.

• On the other hand, within KDD Occam's razor is often
  used in a quite different sense, that can be stated as:

• Second razor: Given two models with the same
  training-set error, the simpler one should be preferred
  because it is likely to have lower generalization error.
• Domingos: The first one is largely uncontroversial, while
  the second one, taken literally, is false.
Summary
• “Two heads are better than none. One
  hundred heads are so much better than
  one”
        – Dearg Doom, The Tain, Horslips, 1973
• “Great minds think alike, clever minds
  think together‖ L. Zoref, 2011.

• But they must be different, specialised
• And it might be an idea to select only the
  best of them for the problem at hand
Additional Readings

More Related Content

What's hot

Notes from Coursera Deep Learning courses by Andrew Ng
Notes from Coursera Deep Learning courses by Andrew NgNotes from Coursera Deep Learning courses by Andrew Ng
Notes from Coursera Deep Learning courses by Andrew NgTess Ferrandez
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...Edureka!
 
Hyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningHyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningFrancesco Casalegno
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
random forest regression
random forest regressionrandom forest regression
random forest regressionAkhilesh Joshi
 
Machine learning with ADA Boost
Machine learning with ADA BoostMachine learning with ADA Boost
Machine learning with ADA BoostAman Patel
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science CompetitionsJeong-Yoon Lee
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning AlgorithmsDezyreAcademy
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest Rupak Roy
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...Simplilearn
 

What's hot (20)

Notes from Coursera Deep Learning courses by Andrew Ng
Notes from Coursera Deep Learning courses by Andrew NgNotes from Coursera Deep Learning courses by Andrew Ng
Notes from Coursera Deep Learning courses by Andrew Ng
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Ensemble methods
Ensemble methodsEnsemble methods
Ensemble methods
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
 
Hyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningHyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine Learning
 
Question answering
Question answeringQuestion answering
Question answering
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
random forest regression
random forest regressionrandom forest regression
random forest regression
 
Machine learning with ADA Boost
Machine learning with ADA BoostMachine learning with ADA Boost
Machine learning with ADA Boost
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
 

Similar to Ensemble Learning: The Wisdom of Crowds (of Machines)

machine learning.ppt
machine learning.pptmachine learning.ppt
machine learning.pptPratik Gohel
 
introduction to machine learning 3c.pptx
introduction to machine learning 3c.pptxintroduction to machine learning 3c.pptx
introduction to machine learning 3c.pptxPratik Gohel
 
GoshawkDB: Making Time with Vector Clocks
GoshawkDB: Making Time with Vector ClocksGoshawkDB: Making Time with Vector Clocks
GoshawkDB: Making Time with Vector ClocksC4Media
 
Normal distribution slide share
Normal distribution slide shareNormal distribution slide share
Normal distribution slide shareKate FLR
 
PyCon Philippines 2012 Keynote
PyCon Philippines 2012 KeynotePyCon Philippines 2012 Keynote
PyCon Philippines 2012 KeynoteDaniel Greenfeld
 
Terminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryTerminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryGiuseppe Rizzo
 
Pipeline of Supervised learning algorithms
Pipeline of Supervised learning algorithmsPipeline of Supervised learning algorithms
Pipeline of Supervised learning algorithmsEvgeniy Marinov
 
Supervised Learning Algorithms - Analysis of different approaches
Supervised Learning Algorithms - Analysis of different approachesSupervised Learning Algorithms - Analysis of different approaches
Supervised Learning Algorithms - Analysis of different approachesPhilip Yankov
 
Making the Crowd Wiser: (Re)combination through Teaming in Crowdsourcing
Making the Crowd Wiser: (Re)combination through Teaming in CrowdsourcingMaking the Crowd Wiser: (Re)combination through Teaming in Crowdsourcing
Making the Crowd Wiser: (Re)combination through Teaming in CrowdsourcingJungpil Hahn
 
Teaching Constraint Programming, Patrick Prosser
Teaching Constraint Programming,  Patrick ProsserTeaching Constraint Programming,  Patrick Prosser
Teaching Constraint Programming, Patrick ProsserPierre Schaus
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptxImXaib
 
Growing Intelligence by Properly Storing and Mining Call Center Data
Growing Intelligence by Properly Storing and Mining Call Center DataGrowing Intelligence by Properly Storing and Mining Call Center Data
Growing Intelligence by Properly Storing and Mining Call Center DataBay Bridge Decision Technologies
 
Story points considered harmful - or why the future of estimation is really i...
Story points considered harmful - or why the future of estimation is really i...Story points considered harmful - or why the future of estimation is really i...
Story points considered harmful - or why the future of estimation is really i...Vasco Duarte
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfnikola_tesla1
 
m2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptxm2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptxMesfinMelese4
 
support-vector-machines.ppt
support-vector-machines.pptsupport-vector-machines.ppt
support-vector-machines.pptshyedshahriar
 
Scott Clark, Software Engineer, Yelp at MLconf SF
Scott Clark, Software Engineer, Yelp at MLconf SFScott Clark, Software Engineer, Yelp at MLconf SF
Scott Clark, Software Engineer, Yelp at MLconf SFMLconf
 
Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsViet-Trung TRAN
 

Similar to Ensemble Learning: The Wisdom of Crowds (of Machines) (20)

machine learning.ppt
machine learning.pptmachine learning.ppt
machine learning.ppt
 
introduction to machine learning 3c.pptx
introduction to machine learning 3c.pptxintroduction to machine learning 3c.pptx
introduction to machine learning 3c.pptx
 
GoshawkDB: Making Time with Vector Clocks
GoshawkDB: Making Time with Vector ClocksGoshawkDB: Making Time with Vector Clocks
GoshawkDB: Making Time with Vector Clocks
 
Normal distribution slide share
Normal distribution slide shareNormal distribution slide share
Normal distribution slide share
 
PyCon Philippines 2012 Keynote
PyCon Philippines 2012 KeynotePyCon Philippines 2012 Keynote
PyCon Philippines 2012 Keynote
 
Terminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryTerminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom Discovery
 
Pipeline of Supervised learning algorithms
Pipeline of Supervised learning algorithmsPipeline of Supervised learning algorithms
Pipeline of Supervised learning algorithms
 
Supervised Learning Algorithms - Analysis of different approaches
Supervised Learning Algorithms - Analysis of different approachesSupervised Learning Algorithms - Analysis of different approaches
Supervised Learning Algorithms - Analysis of different approaches
 
Making the Crowd Wiser: (Re)combination through Teaming in Crowdsourcing
Making the Crowd Wiser: (Re)combination through Teaming in CrowdsourcingMaking the Crowd Wiser: (Re)combination through Teaming in Crowdsourcing
Making the Crowd Wiser: (Re)combination through Teaming in Crowdsourcing
 
Teaching Constraint Programming, Patrick Prosser
Teaching Constraint Programming,  Patrick ProsserTeaching Constraint Programming,  Patrick Prosser
Teaching Constraint Programming, Patrick Prosser
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
 
Chpater 6
Chpater 6Chpater 6
Chpater 6
 
Growing Intelligence by Properly Storing and Mining Call Center Data
Growing Intelligence by Properly Storing and Mining Call Center DataGrowing Intelligence by Properly Storing and Mining Call Center Data
Growing Intelligence by Properly Storing and Mining Call Center Data
 
Story points considered harmful - or why the future of estimation is really i...
Story points considered harmful - or why the future of estimation is really i...Story points considered harmful - or why the future of estimation is really i...
Story points considered harmful - or why the future of estimation is really i...
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Decision theory & decisiontrees
Decision theory & decisiontreesDecision theory & decisiontrees
Decision theory & decisiontrees
 
m2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptxm2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptx
 
support-vector-machines.ppt
support-vector-machines.pptsupport-vector-machines.ppt
support-vector-machines.ppt
 
Scott Clark, Software Engineer, Yelp at MLconf SF
Scott Clark, Software Engineer, Yelp at MLconf SFScott Clark, Software Engineer, Yelp at MLconf SF
Scott Clark, Software Engineer, Yelp at MLconf SF
 
Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applications
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Ensemble Learning: The Wisdom of Crowds (of Machines)

  • 1. Ensemble Learning: The Wisdom of Crowds (of Machines) Lior Rokach Department of Information Systems Engineering Ben-Gurion University of the Negev
  • 2. About Me Prof. Lior Rokach Department of Information Systems Engineering Faculty of Engineering Sciences Head of the Machine Learning Lab Ben-Gurion University of the Negev Email: liorrk@bgu.ac.il http://www.ise.bgu.ac.il/faculty/liorr/ PhD (2004) from Tel Aviv University
  • 3. The Condorcet Jury Theorem • If each voter has a probability p of being correct and the probability of a majority of voters being correct is M, • then p > 0.5 implies M > p. • Also M approaches 1, for all p > 0.5 as the number of voters approaches infinity. • This theorem was proposed by the Marquis of Condorcet in 1784
  • 4. Francis Galton • Galton promoted statistics and invented the concept of correlation. • In 1906 Galton visited a livestock fair and stumbled upon an intriguing contest. • An ox was on display, and the villagers were invited to guess the animal's weight. • Nearly 800 gave it a go and, not surprisingly, not one hit the exact mark: 1,198 pounds. • Astonishingly, however, the average of those 800 guesses came close - very close indeed. It was 1,197 pounds.
  • 5. The Wisdom of Crowds Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations • Under certain controlled conditions, the aggregation of information in groups, resulting in decisions that are often superior to those that can been made by any single - even experts. • Imitates our second nature to seek several opinions before making any crucial decision. We weigh the individual opinions, and combine them to reach a final decision
  • 6. Committees of Experts – ― … a medical school that has the objective that all students, given a problem, come up with an identical solution‖ • There is not much point in setting up a committee of experts from such a group - such a committee will not improve on the judgment of an individual. • Consider: – There needs to be disagreement for the committee to have the potential to be better than an individual.
  • 7. Does it always work? • Not all crowds (groups) are wise. – Example: crazed investors in a stock market bubble.
  • 8. Key Criteria • Diversity of opinion – Each person should have private information even if it's just an eccentric interpretation of the known facts. • Independence – People's opinions aren't determined by the opinions of those around them. • Decentralization – People are able to specialize and draw on local knowledge. • Aggregation – Some mechanism exists for turning private judgments into a collective decision.
  • 9. Teaser: How good are ensemble methods? Let’s look at the Netflix Prize Competition…
  • 10. Began October 2006 • Supervised learning task – Training data is a set of users and ratings (1,2,3,4,5 stars) those users have given to movies. – Construct a classifier that given a user and an unrated movie, correctly classifies that movie as either 1, 2, 3, 4, or 5 stars • $1 million prize for a 10% improvement over Netflix’s current movie recommender/classifier (MSE = 0.9514)
  • 11.
  • 12. Learning biases • Occam’s razor ―among the theories that are consistent with the data, select the simplest one‖. • Epicurus’ principle ―keep all theories that are consistent with the data,‖ [not necessarily with equal weights] E.g. Bayesian learning Ensemble learning 12
  • 13. Strong and Weak Learners • Strong (PAC) Learner – Take labeled data for training – Produce a classifier which can be arbitrarily accurate – Objective of machine learning • Weak (PAC) Learner – Take labeled data for training – Produce a classifier which is more accurate than random guessing
  • 14. Ensembles of classifiers • Given some training data Dtrain x n , yn ; n 1,, N train • Inductive learning L: Dtrain h( ), where h( ): X Y • Ensemble learning L1: Dtrain h1( ) L2: Dtrain h2( ) ... Ensemble: LT: Dtrain hT ( ) {h1( ), h2( ), ... , hT ( )} 14
  • 15. Classification by majority voting New Instance: x T=7 classifiers 1 1 1 2 1 2 1 Accumulated votes: t 5 2 0 1 2 3 0 4 1 Final class: 1 t1 t2 Alberto Suárez (2012) 15
  • 17. Boosting • Learners – Strong learners are very difficult to construct – Constructing weaker Learners is relatively easy • Strategy – Derive strong learner from weak learner – Boost weak classifiers to a strong learner
  • 18. Construct Weak Classifiers • Using Different Data Distribution – Start with uniform weighting – During each step of learning • Increase weights of the examples which are not correctly learned by the weak learner • Decrease weights of the examples which are correctly learned by the weak learner • Idea – Focus on difficult examples which are not correctly classified in the previous steps
  • 19. Combine Weak Classifiers • Weighted Voting – Construct strong classifier by weighted voting of the weak classifiers • Idea – Better weak classifier gets a larger weight – Iteratively add weak classifiers • Increase accuracy of the combined classifier through minimization of a cost function
  • 20. AdaBoost (Adaptive Boosting) (Freund and Schapire, 1997) Generate a sequence of base-learners each focusing on previous one’s errors (Freund and Schapire, 1996)
  • 22. Example Training Combined classifier
  • 23. Example of a Good Classifier + + + + +
  • 24. The initial distribution Train data x1 x2 y D1 1 5 + 0.10 2 3 + 0.10 3 2 0.10 4 6 0.10 4 7 + 0.10 5 9 + 0.10 6 5 0.10 6 7 + 0.10 8 5 0.10 8 8 0.10 1.00 Initialization
  • 25. Round 1 of 3 +O + +O +O + + + + + + h1 1 = 0.30 D2 1=0.42
  • 26. How the distribution has changed? Train data Round 1 x1 x2 y D1 h1e D2 1 5 + 0.10 0 0.00 0.07 2 3 + 0.10 0 0.00 0.07 3 2 0.10 0 0.00 0.07 4 6 0.10 0 0.00 0.07 4 7 + 0.10 1 0.10 0.17 5 9 + 0.10 1 0.10 0.17 6 5 0.10 0 0.00 0.07 6 7 + 0.10 1 0.10 0.17 8 5 0.10 0 0.00 0.07 8 8 0.10 0 0.00 0.07 1.00 0.30 1.00 0.42  Zt 0.92
  • 27. Round 2 of 3 + + +O + + + + O + + O + 2 = 0.21 h2 D2 2=0.65
  • 28. How the distribution has changed? Train data Round 1 Round 2 x1 x2 y D1 h1e D2 h2e D3 1 5 + 0.10 0 0.00 0.07 0 0.00 0.05 2 3 + 0.10 0 0.00 0.07 0 0.00 0.05 3 2 0.10 0 0.00 0.07 1 0.07 0.17 4 6 0.10 0 0.00 0.07 1 0.07 0.17 4 7 + 0.10 1 0.10 0.17 0 0.00 0.11 5 9 + 0.10 1 0.10 0.17 0 0.00 0.11 6 5 0.10 0 0.00 0.07 1 0.07 0.17 6 7 + 0.10 1 0.10 0.17 0 0.00 0.11 8 5 0.10 0 0.00 0.07 0 0.00 0.05 8 8 0.10 0 0.00 0.07 0 0.00 0.05 1.00 0.30 1.00 0.21 1.00 0.42  0.65  Zt 0.92 Zt 0.82
  • 29. Round 3 of 3 + O + + h3 + O STOP + O 3 = 0.14 3=0.92
  • 30. How the distribution has changed? Train data Round 1 Round 2 Round 3 x1 x2 y D1 h1e D2 h2e D3 h3e 1 5 + 0.10 0 0.00 0.07 0 0.00 0.05 1 0.05 2 3 + 0.10 0 0.00 0.07 0 0.00 0.05 1 0.05 3 2 0.10 0 0.00 0.07 1 0.07 0.17 0 0.00 4 6 0.10 0 0.00 0.07 1 0.07 0.17 0 0.00 4 7 + 0.10 1 0.10 0.17 0 0.00 0.11 0 0.00 5 9 + 0.10 1 0.10 0.17 0 0.00 0.11 0 0.00 6 5 0.10 0 0.00 0.07 1 0.07 0.17 0 0.00 6 7 + 0.10 1 0.10 0.17 0 0.00 0.11 0 0.00 8 5 0.10 0 0.00 0.07 0 0.00 0.05 0 0.00 8 8 0.10 0 0.00 0.07 0 0.00 0.05 1 0.05 1.00 0.30 1.00 0.21 1.00 0.14 0.42  0.65  0.92 Zt 0.92 Zt 0.82 Initialization Importance of each learner
  • 31. Final Hypothesis 0.42 + 0.65 + 0.92 Hfinal = sign[ 0.42(h1? 1|-1) + 0.65(h2? 1|-1) + 0.92(h3? 1|-1) ] + + + + +
  • 32. Training Errors vs Test Errors Performance on ‘letter’ dataset (Schapire et al. 1997) Test error Training error Training error drops to 0 on round 5 Test error continues to drop after round 5 (from 8.4% to 3.1%)
  • 33. Adaboost Variants Proposed By Friedman • LogitBoost
  • 34. Adaboost Variants Proposed By Friedman • GentleBoost
  • 35. BrownBoost • Reduce the weight given to misclassified example • Good (only) for very noisy data.
  • 36. Bagging Bootstrap AGGregatING • Employs simplest way of combining predictions that belong to the same type. • Combining can be realized with voting or averaging • Each model receives equal weight • ―Idealized‖ version of bagging: – Sample several training sets of size n (instead of just having one training set of size n) – Build a classifier for each training set – Combine the classifier’s predictions • This improves performance in almost all cases if learning scheme is unstable.
  • 37. Wagging Weighted AGGregatING • A variant of bagging in which each classifier is trained on the entire training set, but each instance is stochastically assigned a weight.
  • 38. Random Forests 1. Choose T—number of trees to grow. 2. Choose m—number of variables used to split each node. m ≪ M, where M is the number of input variables. m is hold constant while growing the forest. 3. Grow T trees. When growing each tree do the following. (a) Construct a bootstrap sample of size n sampled from Sn with replacement and grow a tree from this bootstrap sample. (b) When growing a tree at each node select m variables at random and use them to find the best split. (c) Grow the tree to a maximal extent. There is no pruning. 4. To classify point X collect votes from every tree in the forest and then use majority voting to decide on the class label.
  • 39. Variation of Random Forests • Random Split Selection (Dietterich, 2000) – Grow multiple trees – When splitting, choose split uniformly at random from – K best splits – Can be used with or without pruning • Random Subspace (Ho, 1998) – Grow multiple trees – Each tree is grown using a fixed subset of variables – Do a majority vote or averaging to combine votes from – different trees
  • 40. DECORATE (Melville & Mooney, 2003) • Change training data by adding new artificial training examples that encourage diversity in the resulting ensemble. • Improves accuracy when the training set is small, and therefore resampling and reweighting the training set has limited ability to generate diverse alternative hypotheses.
  • 41. Overview of DECORATE Current Ensemble Training Examples + - C1 - + + Base Learner + + - + - Artificial Examples
  • 42. Overview of DECORATE Current Ensemble Training Examples + - C1 - + + Base Learner C2 + - - + - - + Artificial Examples
  • 43. Overview of DECORATE Current Ensemble Training Examples + - C1 - + + Base Learner C2 - + + + C3 - Artificial Examples
  • 45. Ensemble Taxonomy (Rokach, 2009) Diversity generator Members Combiner Dependency Ensemble Cross- Ensemble Inducer size
  • 46. Combiner • Weighting methods – Majority Voting – Performance Weighting – Distribution Summation – Gating Network • Meta-Learning – Stacking – Arbiter Trees – Grading
  • 48. Stacking • Combiner f () is another learner (Wolpert, 1992)
  • 49. Members Dependency • Dependent Methods: There is an interaction between the learning runs (AdaBoost) – Model-guided Instance Selection: the classifiers that were constructed in previous iterations are used for selecting the training set in the subsequent iteration. – Incremental Batch Learning: In this method the classification produced in one iteration is given as prior knowledge (a new feature) to the learning algorithm in the subsequent iteration. • Independent Methods (Bagging)
  • 50. Cascading Use dj only if preceding ones are not confident Cascade learners in order of complexity
  • 51. Diversity • Manipulating the Inducer • Manipulating the Training Sample • Changing the target attribute representation • Partitioning the search space - Each member is trained on a different search subspace. • Hybridization - Diversity is obtained by using various base inducers or ensemble strategies.
  • 52. Measuring the Diversity • Pairwise measures calculate the average of a particular distance metric between all possible pairings of members in the ensemble, such as Q- statistic or kappa-statistic. • The non-pairwise measures either use the idea of entropy or calculate a correlation of each ensemble member with the averaged output.
  • 53. Kappa-Statistic i, j i, j i, j 1 i, j where i, j is the proportion of instances on which the classifiers i and j agree with each other on the training set, and i, j is the probability that the two classifiers agree by chance.
  • 54. How crowded should the crowd be? Ensemble Selection • Why bother? – Desired accuracy – Computational cost • Predetermine the ensemble size • Use a certain criterion to stops training • Pruning
  • 55. Cross Inducer • Inducer-dependent (like RandomForest). • Inducer-independent (like bagging)
  • 56. Multi-strategy Ensemble Learning • Combines several ensemble strategies. • MultiBoosting, an extension to AdaBoost expressed by adding wagging-like features can harness both AdaBoost's high bias and variance reduction with wagging's superior variance reduction. • produces decision committees with lower error than either AdaBoost or wagging.
  • 58. Why using Ensembles? • Statistical Reasons: Out of many classifier models with similar training / test errors, which one shall we pick? If we just pick one at random, we risk the possibility of choosing a really poor one – Combining / averaging them may prevent us from making one such unfortunate decision • Computational Reasons: Every time we run a classification algorithm, we may find different local optima – Combining their outputs may allow us to find a solution that is closer to the global minimum. • Too little data / too much data: – Generating multiple classifiers with the resampling of the available data / mutually exclusive subsets of the available data • Representational Reasons: The classifier space may not contain the solution to a given particular problem. However, an ensemble of such classifiers may – For example, linear classifiers cannot solve non-linearly separable problems, however, their combination can.
  • 60. There’s no real Paradox… • Ideally, all committee members would be right about everything! • If not, they should be wrong about different things.
  • 61. No Free Lunch Theorem in Machine Learning (Wolpert, 2001) • “Or to put it another way, for any two learning algorithms, there are just as many situations (appropriately weighted) in which algorithm one is superior to algorithm two as vice versa, according to any of the measures of "superiority"
  • 62. So why developing new algorithms? • The science of pattern recognition is mostly concerned with choosing the most appropriate algorithm for the problem at hand • This requires some a priori knowledge – data distribution, prior probabilities, complexity of the problem, the physics of the underlying phenomenon, etc. • The No Free Lunch theorem tells us that – unless we have some a priori knowledge – simple classifiers (or complex ones for that matter) are not necessarily better than others. However, given some a priori information, certain classifiers may better MATCH the characteristics of certain type of problems. • The main challenge of the patter recognition professional is then, to identify the correct match between the problem and the classifier! …which is yet another reason to arm yourself with a diverse set of PR arsenal !
  • 63. Ensemble and the No Free Lunch Theorem • Ensemble combine the strengths of each classifier to make a super-learner. • But … Ensemble only improves classification if the component classifiers perform better than chance – Can not be guaranteed a priori • Proven effective in many real-world applications
  • 64. Ensemble and Optimal Bayes Rule • Given a finite amount of data, many hypothesis are typically equally good. How can the learning algorithm select among them? • Optimal Bayes classifier recipe: take a weighted majority • vote of all hypotheses weighted by their posterior probability. • That is, put most weight on hypotheses consistent with the data. • Hence, ensemble learning may be viewed as an approximation of the Optimal Bayes rule (which is provably the best possible classifier).
  • 65. Bias and Variance Decomposition Bias – The hypothesis space made available by a particular classification method does not include sufficient hypotheses Variance – The hypothesis space made available is too large for the training data, and the selected hypothesis may not be accurate on unseen data
  • 66. Bias and Variance Decision Trees • Small trees have high bias. • Large trees have high variance. Why? from Elder, John. From Trees to Forests and Rule Sets - A Unified Overview of Ensemble Methods. 2007.
  • 67. For Any Model (Not only decision trees) • Given a target function • Model has many parameters – Generally low bias – Fits data well – Yields high variance • Model has few parameters – Generally high bias – May not fit data well – The fit does not change much for different data sets (low variance)
  • 68. Bias-Variance and Ensemble Learning • Bagging: There exists empirical and theoretical evidence that Bagging acts as variance reduction machine (i.e., it reduces the variance part of the error). • AdaBoost: Empirical evidence suggests that AdaBoost reduces both the bias and the variance part of the error. In particular, it seems that bias is mostly reduced in early iterations, while variance in later ones.
  • 70. Occam's razor • The explanation of any phenomenon should make as few assumptions as possible, eliminating those that make no difference in the observable predictions of the explanatory hypothesis or theory
  • 71. Contradiction with Occam’s Razor • Ensemble Contradicts with Occam’s Razor – More rounds -> more classifiers for voting -> more complicated – With the 0 training error, a more complicated classifier may perform worse
  • 72. Two Razors (Domingos, 1999) • First razor: Given two models with the same generalization error, the simpler one should be preferred because simplicity is desirable in itself. • On the other hand, within KDD Occam's razor is often used in a quite different sense, that can be stated as: • Second razor: Given two models with the same training-set error, the simpler one should be preferred because it is likely to have lower generalization error. • Domingos: The first one is largely uncontroversial, while the second one, taken literally, is false.
  • 73. Summary • “Two heads are better than none. One hundred heads are so much better than one” – Dearg Doom, The Tain, Horslips, 1973 • “Great minds think alike, clever minds think together‖ L. Zoref, 2011. • But they must be different, specialised • And it might be an idea to select only the best of them for the problem at hand