Lecture 6: Ensemble Methods


Published on

What is an "ensemble learner"? How can we combine different base learners into an ensemble in order to improve the overall classification performance? In this lecture, we are providing some answers to these questions.

Published in: Education, Technology

Lecture 6: Ensemble Methods

  1. 1. Lecture 6: Ensemble Methods October 2013 Machine Learning for Language Technology Marina Santini, Uppsala University Department of Linguistics and Philology
  2. 2. Where we are… Lecture 6: Ensemble Methods2  Previous lectures, various different learning methods:  Decision trees  Nearest neighbors  Linear classifiers  Structured Prediction  This lecture:  How to combine classifiers
  3. 3. Combining Multiple Learners Lecture 6: Ensemble Methods3 Thanks to E. Alpaydin and Oscar Täckström
  4. 4. Wisdom of the Crowd Lecture 6: Ensemble Methods4  Guess the weight of an ox  Average of people's votes close to true weight  Better than most individual members' votes and cattle experts' votes  Intuitively, the law of large numbers…
  5. 5. Definition Lecture 6: Ensemble Methods5  An ensemble of classifiers is a set of classifiers whose individual decisions are combined in some way to classify new examples (Dietterich, 2000)
  6. 6. Diversity vs accuracy Lecture 6: Ensemble Methods6  An ensemble of classifiers must be more accurate than any of its individual members.  The indivudual classifiers composing an ensemble must be accurate and diverse:  An accurate classifier is one that has an error rate better than random when guessing new examples  Two classifiers are diverse if they make different errors on new data points.
  7. 7. Why it can be a good idea to build an ensemble Lecture 6: Ensemble Methods7  It is possible to build good ensembles for three fundamental reasons. (Dietterich , 2000): 1. Statistical: if little data 2. Computational: enough data, but local optima produced by local search 3. Representational: when the true function f cannot be represeted by any of the hypothesis in H (weighted sums of hypotheses drawn from H might expand the space
  8. 8. Distinctions Lecture 6: Ensemble Methods8  Base learner  Arbitrary learning algorithm which could be used on its own  Ensemble  A learning algorithm composed of a set of base learners. The base learners may be organized in some structure  However, not completely clear cut  E.g. a linear classifier is a combination of multiple simple learners, in the sense that each dimension is in a simple predictor…
  9. 9. The main purpose of an ensemble: maximising individual accuracy and diversity Lecture 6: Ensemble Methods9  Different learners use different  Algorithms  Hyperparameters  Representations /Modalities/Views  Training sets  Subproblems
  10. 10. Practical Example Lecture 6: Ensemble Methods10
  11. 11. Rationale Lecture 6: Ensemble Methods11  No Free Lunch Theorem: There is no algorithm that is always the most accurate in all situations.  Generate a group of base-learners which when combined has higher accuracy.
  12. 12. Methods for Constructing Ensembles Lecture 6: Ensemble Methods12
  13. 13. Approaches… Lecture 6: Ensemble Methods13  How do we generate base-learners that complement each other?  How do we combine the outputs of base learner for maximum accuracy?  Examples: Voting Boostrap Resampling Bagging Boosting AdaBoost Stacking Cascading
  14. 14. Voting  Linear combination Lecture 6: Ensemble Methods 14
  15. 15. Fixed Combination Rules Lecture 6: Ensemble Methods15
  16. 16. Boostrap Resampling Lecture 6: Ensemble Methods16  Daume’ (2012): 150
  17. 17. Bagging (bootstrap+aggregating) Lecture 6: Ensemble Methods17  Use bootstrapping to generate L training sets  Train L base learners using an unstable learning procedure  During test, take the avarage  In bagging, generating complementary base-learners is left to chance and to the instability of the learning method. **Unstable algorithm: when small change in the training set causes a large differnce in the base learners.
  18. 18. Boosting: Weak learner vs Strong learner Lecture 6: Ensemble Methods18  In boosting, we actively try to generate complementary base-learners by training the next learner on the mistakes of the previous learners.  The original boosting algorithm (Schapire 1990) combines three weak learners to generate a strong learner.  A weak learner has error probability less than 1/2, which makes it better than random guessing on a two-class problem  A strong learner has arbitrarily small error probability.
  19. 19. Boosting (ii) [Alpaydin, 2012: 431] Lecture 6: Ensemble Methods19  Given a large training set, we randomly divide it into three.  We use X1 and train d1.  We then take X2 and feed it to d1. We take all instances misclassified by d1 and also as many instances on which d1 is correct from X2, and these together form the training set of d2.  We then take X3 and feed it to d1 and d2. The instances on which d1 and d2 disagree form the training set of d3.  During testing, given an instance, we give it to d1 and d2; if they agree, that is the response, otherwise the response of d3 is taken as the output.
  20. 20. Boosting: drawback Lecture 6: Ensemble Methods20  Though it is quite successful, the disadvantage of the original boosting method is that it requires a very large training sample.
  21. 21. Adaboost (adaptive boosting) Lecture 6: Ensemble Methods21  Use the same training set over and over and thus need not to be large.  Classifiers must be simple so they do not overfit.  Can combine an arbitrary number of base learners, not only three.
  22. 22. AdaBoost Lecture 6: Ensemble Methods22 Generate a sequence of base- learners each focusing on previous one’s errors. The porbability of a correctly classified instance is decreased, and the probability of a missclassified instance increases. This has the effect that the next classifier focuses more on instances missclassified by the previous classifier. [Alpaydin, 2012: 432-433]
  23. 23. Adaboost: Testing Lecture 6: Ensemble Methods23  Given an instance, all the classifiers decide and a weighted vote is taken.  The weights are proportional to the base learners’ accuracies on the training set.   improved accuracy  The success of Adaboost is due to its property of increasing the margin. If the margin increases, the training istances are better separated and errors are less likely. (This aim is similar to SVMs)
  24. 24. Stacking (i) Lecture 6: Ensemble Methods24  In stacked generalization, the combiner f( ) is another learner and is not restricted to being a linear combination as in voting.
  25. 25. Stacking (ii) Lecture 6: Ensemble Methods25  The combiner system should learn how the base learners make errors.  Stacking is a means of estimating and correcting for the biases of the base-learners.  Therefore, the combiner should be trained on data unused in training the base-learners
  26. 26. Cascading Lecture 6: Ensemble Methods26 Use dj only if preceding ones are not confident Cascade learners in order of complexity
  27. 27. Cascading Lecture 6: Ensemble Methods27  Cascading is a multistage method, and we use dj only if all preceding learners are not confident.  Associated with each learner is a confidence wj such that we say dj is confident of its output and can be used if wj > θj (the threshold).  Confident: misclassifications as well as the instances for which the posterior is not high enough.  Important: The idea is that an early simple classifier handles the majority of instances, and a more complex classifier is used only for a small percentage, so does not significantly increase the overall complexity.
  28. 28. Summary Lecture 6: Ensemble Methods28  It is often a good idea to combine several learning methods  We want diverse classifiers, so their errors cancel out  However, remember, ensemble methods do not get free lunch…
  29. 29. Example Lecture 6: Ensemble Methods29  in the case of arc-factored graph-based parsing, we relied on spanning tree over a dense graph over the input.  a dense graph is a graph that contains all possible arcs (wordforms)  a spanning tree is a tree that has an incoming arc for each word.
  30. 30. Example: Ensemble MST Dependency Parsing Lecture 6: Ensemble Methods30
  31. 31. Conclusions Lecture 6: Ensemble Methods31  Combining multiple learners has been a popular topic in machine learning since the early 1990s, and research has been going on ever since.  Recently, it has been noticed that ensembles do not always improve accuracy and research has started to focus on the criteria that a good ensemble should satisfy or how to form a good one.
  32. 32. Reading Lecture 6: Ensemble Methods32  Dietterich (2000)  Alpaydin (2010): Ch. 17  Daumé (2012): Ch. 11
  33. 33. Thanx for your attention! Lecture 6: Ensemble Methods33