Ensemble Learning: The Wisdom of Crowds (of Machines)

5,510 views

Published on

Published in: Technology, Education

Ensemble Learning: The Wisdom of Crowds (of Machines)

  1. 1. Ensemble Learning:The Wisdom of Crowds (of Machines) Lior Rokach Department of Information Systems Engineering Ben-Gurion University of the Negev
  2. 2. About MeProf. Lior RokachDepartment of Information Systems EngineeringFaculty of Engineering SciencesHead of the Machine Learning LabBen-Gurion University of the NegevEmail: liorrk@bgu.ac.ilhttp://www.ise.bgu.ac.il/faculty/liorr/PhD (2004) from Tel Aviv University
  3. 3. The Condorcet Jury Theorem• If each voter has a probability p of being correct and the probability of a majority of voters being correct is M,• then p > 0.5 implies M > p.• Also M approaches 1, for all p > 0.5 as the number of voters approaches infinity.• This theorem was proposed by the Marquis of Condorcet in 1784
  4. 4. Francis Galton• Galton promoted statistics and invented the concept of correlation.• In 1906 Galton visited a livestock fair and stumbled upon an intriguing contest.• An ox was on display, and the villagers were invited to guess the animals weight.• Nearly 800 gave it a go and, not surprisingly, not one hit the exact mark: 1,198 pounds.• Astonishingly, however, the average of those 800 guesses came close - very close indeed. It was 1,197 pounds.
  5. 5. The Wisdom of CrowdsWhy the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations • Under certain controlled conditions, the aggregation of information in groups, resulting in decisions that are often superior to those that can been made by any single - even experts. • Imitates our second nature to seek several opinions before making any crucial decision. We weigh the individual opinions, and combine them to reach a final decision
  6. 6. Committees of Experts – ― … a medical school that has the objective that all students, given a problem, come up with an identical solution‖• There is not much point in setting up a committee of experts from such a group - such a committee will not improve on the judgment of an individual.• Consider: – There needs to be disagreement for the committee to have the potential to be better than an individual.
  7. 7. Does it always work?• Not all crowds (groups) are wise. – Example: crazed investors in a stock market bubble.
  8. 8. Key Criteria• Diversity of opinion – Each person should have private information even if its just an eccentric interpretation of the known facts.• Independence – Peoples opinions arent determined by the opinions of those around them.• Decentralization – People are able to specialize and draw on local knowledge.• Aggregation – Some mechanism exists for turning private judgments into a collective decision.
  9. 9. Teaser: How good are ensemble methods? Let’s look at the Netflix Prize Competition…
  10. 10. Began October 2006• Supervised learning task – Training data is a set of users and ratings (1,2,3,4,5 stars) those users have given to movies. – Construct a classifier that given a user and an unrated movie, correctly classifies that movie as either 1, 2, 3, 4, or 5 stars• $1 million prize for a 10% improvement over Netflix’s current movie recommender/classifier (MSE = 0.9514)
  11. 11. Learning biases• Occam’s razor ―among the theories that are consistent with the data, select the simplest one‖.• Epicurus’ principle ―keep all theories that are consistent with the data,‖ [not necessarily with equal weights] E.g. Bayesian learning Ensemble learning 12
  12. 12. Strong and Weak Learners• Strong (PAC) Learner – Take labeled data for training – Produce a classifier which can be arbitrarily accurate – Objective of machine learning• Weak (PAC) Learner – Take labeled data for training – Produce a classifier which is more accurate than random guessing
  13. 13. Ensembles of classifiers• Given some training data Dtrain x n , yn ; n 1,, N train• Inductive learning L: Dtrain h( ), where h( ): X Y• Ensemble learning L1: Dtrain h1( ) L2: Dtrain h2( ) ... Ensemble: LT: Dtrain hT ( ) {h1( ), h2( ), ... , hT ( )} 14
  14. 14. Classification by majority voting New Instance: x T=7 classifiers 1 1 1 2 1 2 1Accumulated votes: t 5 2 0 1 2 3 0 4 1 Final class: 1 t1 t2 Alberto Suárez (2012) 15
  15. 15. Popular Ensemble Methods
  16. 16. Boosting• Learners – Strong learners are very difficult to construct – Constructing weaker Learners is relatively easy• Strategy – Derive strong learner from weak learner – Boost weak classifiers to a strong learner
  17. 17. Construct Weak Classifiers• Using Different Data Distribution – Start with uniform weighting – During each step of learning • Increase weights of the examples which are not correctly learned by the weak learner • Decrease weights of the examples which are correctly learned by the weak learner• Idea – Focus on difficult examples which are not correctly classified in the previous steps
  18. 18. Combine Weak Classifiers• Weighted Voting – Construct strong classifier by weighted voting of the weak classifiers• Idea – Better weak classifier gets a larger weight – Iteratively add weak classifiers • Increase accuracy of the combined classifier through minimization of a cost function
  19. 19. AdaBoost (Adaptive Boosting) (Freund and Schapire, 1997)Generate asequence ofbase-learnerseach focusingon previousone’s errors(Freund andSchapire,1996)
  20. 20. AdaBoost
  21. 21. ExampleTraining Combined classifier
  22. 22. Example of a Good Classifier + + + + +
  23. 23. The initial distributionTrain datax1 x2 y D1 1 5 + 0.10 2 3 + 0.10 3 2 0.10 4 6 0.10 4 7 + 0.10 5 9 + 0.10 6 5 0.10 6 7 + 0.10 8 5 0.10 8 8 0.10 1.00 Initialization
  24. 24. Round 1 of 3 +O + +O +O + ++ ++ + h1 1 = 0.30 D2 1=0.42
  25. 25. How the distribution has changed?Train data Round 1x1 x2 y D1 h1e D2 1 5 + 0.10 0 0.00 0.07 2 3 + 0.10 0 0.00 0.07 3 2 0.10 0 0.00 0.07 4 6 0.10 0 0.00 0.07 4 7 + 0.10 1 0.10 0.17 5 9 + 0.10 1 0.10 0.17 6 5 0.10 0 0.00 0.07 6 7 + 0.10 1 0.10 0.17 8 5 0.10 0 0.00 0.07 8 8 0.10 0 0.00 0.07 1.00 0.30 1.00 0.42  Zt 0.92
  26. 26. Round 2 of 3 + + +O + + ++ O ++ O + 2 = 0.21 h2 D2 2=0.65
  27. 27. How the distribution has changed?Train data Round 1 Round 2x1 x2 y D1 h1e D2 h2e D3 1 5 + 0.10 0 0.00 0.07 0 0.00 0.05 2 3 + 0.10 0 0.00 0.07 0 0.00 0.05 3 2 0.10 0 0.00 0.07 1 0.07 0.17 4 6 0.10 0 0.00 0.07 1 0.07 0.17 4 7 + 0.10 1 0.10 0.17 0 0.00 0.11 5 9 + 0.10 1 0.10 0.17 0 0.00 0.11 6 5 0.10 0 0.00 0.07 1 0.07 0.17 6 7 + 0.10 1 0.10 0.17 0 0.00 0.11 8 5 0.10 0 0.00 0.07 0 0.00 0.05 8 8 0.10 0 0.00 0.07 0 0.00 0.05 1.00 0.30 1.00 0.21 1.00 0.42  0.65  Zt 0.92 Zt 0.82
  28. 28. Round 3 of 3 + O + + h3+ O STOP+ O 3 = 0.14 3=0.92
  29. 29. How the distribution has changed?Train data Round 1 Round 2 Round 3x1 x2 y D1 h1e D2 h2e D3 h3e 1 5 + 0.10 0 0.00 0.07 0 0.00 0.05 1 0.05 2 3 + 0.10 0 0.00 0.07 0 0.00 0.05 1 0.05 3 2 0.10 0 0.00 0.07 1 0.07 0.17 0 0.00 4 6 0.10 0 0.00 0.07 1 0.07 0.17 0 0.00 4 7 + 0.10 1 0.10 0.17 0 0.00 0.11 0 0.00 5 9 + 0.10 1 0.10 0.17 0 0.00 0.11 0 0.00 6 5 0.10 0 0.00 0.07 1 0.07 0.17 0 0.00 6 7 + 0.10 1 0.10 0.17 0 0.00 0.11 0 0.00 8 5 0.10 0 0.00 0.07 0 0.00 0.05 0 0.00 8 8 0.10 0 0.00 0.07 0 0.00 0.05 1 0.05 1.00 0.30 1.00 0.21 1.00 0.14 0.42  0.65  0.92 Zt 0.92 Zt 0.82 Initialization Importance of each learner
  30. 30. Final Hypothesis 0.42 + 0.65 + 0.92Hfinal = sign[ 0.42(h1? 1|-1) + 0.65(h2? 1|-1) + 0.92(h3? 1|-1) ] + + + + +
  31. 31. Training Errors vs Test Errors Performance on ‘letter’ dataset (Schapire et al. 1997) Test errorTraining error Training error drops to 0 on round 5 Test error continues to drop after round 5 (from 8.4% to 3.1%)
  32. 32. Adaboost Variants Proposed By Friedman• LogitBoost
  33. 33. Adaboost Variants Proposed By Friedman• GentleBoost
  34. 34. BrownBoost• Reduce the weight given to misclassified example• Good (only) for very noisy data.
  35. 35. Bagging Bootstrap AGGregatING• Employs simplest way of combining predictions that belong to the same type.• Combining can be realized with voting or averaging• Each model receives equal weight• ―Idealized‖ version of bagging: – Sample several training sets of size n (instead of just having one training set of size n) – Build a classifier for each training set – Combine the classifier’s predictions• This improves performance in almost all cases if learning scheme is unstable.
  36. 36. Wagging Weighted AGGregatING• A variant of bagging in which each classifier is trained on the entire training set, but each instance is stochastically assigned a weight.
  37. 37. Random Forests1. Choose T—number of trees to grow.2. Choose m—number of variables used to split each node. m ≪ M, where M is the number of input variables. m is hold constant while growing the forest.3. Grow T trees. When growing each tree do the following. (a) Construct a bootstrap sample of size n sampled from Sn with replacement and grow a tree from this bootstrap sample. (b) When growing a tree at each node select m variables at random and use them to find the best split. (c) Grow the tree to a maximal extent. There is no pruning.4. To classify point X collect votes from every tree in the forest andthen use majority voting to decide on the class label.
  38. 38. Variation of Random Forests• Random Split Selection (Dietterich, 2000) – Grow multiple trees – When splitting, choose split uniformly at random from – K best splits – Can be used with or without pruning• Random Subspace (Ho, 1998) – Grow multiple trees – Each tree is grown using a fixed subset of variables – Do a majority vote or averaging to combine votes from – different trees
  39. 39. DECORATE (Melville & Mooney, 2003)• Change training data by adding new artificial training examples that encourage diversity in the resulting ensemble.• Improves accuracy when the training set is small, and therefore resampling and reweighting the training set has limited ability to generate diverse alternative hypotheses.
  40. 40. Overview of DECORATE Current EnsembleTraining Examples + - C1 - + + Base Learner + + - + -Artificial Examples
  41. 41. Overview of DECORATE Current EnsembleTraining Examples + - C1 - + + Base Learner C2 + - - + - - +Artificial Examples
  42. 42. Overview of DECORATE Current EnsembleTraining Examples + - C1 - + + Base Learner C2 - + + + C3 -Artificial Examples
  43. 43. Error-Correcting Output Codes
  44. 44. Ensemble Taxonomy (Rokach, 2009) Diversity generator Members CombinerDependency Ensemble Cross- Ensemble Inducer size
  45. 45. Combiner• Weighting methods – Majority Voting – Performance Weighting – Distribution Summation – Gating Network• Meta-Learning – Stacking – Arbiter Trees – Grading
  46. 46. Mixtures of Experts
  47. 47. Stacking• Combiner f () is another learner (Wolpert, 1992)
  48. 48. Members Dependency• Dependent Methods: There is an interaction between the learning runs (AdaBoost) – Model-guided Instance Selection: the classifiers that were constructed in previous iterations are used for selecting the training set in the subsequent iteration. – Incremental Batch Learning: In this method the classification produced in one iteration is given as prior knowledge (a new feature) to the learning algorithm in the subsequent iteration.• Independent Methods (Bagging)
  49. 49. CascadingUse dj only ifpreceding onesare not confidentCascade learnersin order ofcomplexity
  50. 50. Diversity• Manipulating the Inducer• Manipulating the Training Sample• Changing the target attribute representation• Partitioning the search space - Each member is trained on a different search subspace.• Hybridization - Diversity is obtained by using various base inducers or ensemble strategies.
  51. 51. Measuring the Diversity• Pairwise measures calculate the average of a particular distance metric between all possible pairings of members in the ensemble, such as Q- statistic or kappa-statistic.• The non-pairwise measures either use the idea of entropy or calculate a correlation of each ensemble member with the averaged output.
  52. 52. Kappa-Statistic i, j i, j i, j 1 i, jwhere i, j is the proportion of instances on which the classifiers i and j agree with eachother on the training set, and i, j is the probability that the two classifiers agree bychance.
  53. 53. How crowded should the crowd be? Ensemble Selection• Why bother? – Desired accuracy – Computational cost• Predetermine the ensemble size• Use a certain criterion to stops training• Pruning
  54. 54. Cross Inducer• Inducer-dependent (like RandomForest).• Inducer-independent (like bagging)
  55. 55. Multi-strategy Ensemble Learning• Combines several ensemble strategies.• MultiBoosting, an extension to AdaBoost expressed by adding wagging-like features can harness both AdaBoosts high bias and variance reduction with waggings superior variance reduction.• produces decision committees with lower error than either AdaBoost or wagging.
  56. 56. Some Insights
  57. 57. Why using Ensembles?• Statistical Reasons: Out of many classifier models with similar training / test errors, which one shall we pick? If we just pick one at random, we risk the possibility of choosing a really poor one – Combining / averaging them may prevent us from making one such unfortunate decision• Computational Reasons: Every time we run a classification algorithm, we may find different local optima – Combining their outputs may allow us to find a solution that is closer to the global minimum.• Too little data / too much data: – Generating multiple classifiers with the resampling of the available data / mutually exclusive subsets of the available data• Representational Reasons: The classifier space may not contain the solution to a given particular problem. However, an ensemble of such classifiers may – For example, linear classifiers cannot solve non-linearly separable problems, however, their combination can.
  58. 58. The Diversity Paradox
  59. 59. There’s no real Paradox…• Ideally, all committee members would be right about everything!• If not, they should be wrong about different things.
  60. 60. No Free Lunch Theorem in Machine Learning (Wolpert, 2001)• “Or to put it another way, for any two learning algorithms, there are just as many situations (appropriately weighted) in which algorithm one is superior to algorithm two as vice versa, according to any of the measures of "superiority"
  61. 61. So why developing new algorithms?• The science of pattern recognition is mostly concerned with choosing the most appropriate algorithm for the problem at hand• This requires some a priori knowledge – data distribution, prior probabilities, complexity of the problem, the physics of the underlying phenomenon, etc.• The No Free Lunch theorem tells us that – unless we have some a priori knowledge – simple classifiers (or complex ones for that matter) are not necessarily better than others. However, given some a priori information, certain classifiers may better MATCH the characteristics of certain type of problems.• The main challenge of the patter recognition professional is then, to identify the correct match between the problem and the classifier! …which is yet another reason to arm yourself with a diverse set of PR arsenal !
  62. 62. Ensemble and the No Free Lunch Theorem• Ensemble combine the strengths of each classifier to make a super-learner.• But … Ensemble only improves classification if the component classifiers perform better than chance – Can not be guaranteed a priori• Proven effective in many real-world applications
  63. 63. Ensemble and Optimal Bayes Rule• Given a finite amount of data, many hypothesis are typically equally good. How can the learning algorithm select among them?• Optimal Bayes classifier recipe: take a weighted majority• vote of all hypotheses weighted by their posterior probability.• That is, put most weight on hypotheses consistent with the data.• Hence, ensemble learning may be viewed as an approximation of the Optimal Bayes rule (which is provably the best possible classifier).
  64. 64. Bias and Variance DecompositionBias – The hypothesis space made available by a particular classification method does not include sufficient hypothesesVariance – The hypothesis space made available is too large for the training data, and the selected hypothesis may not be accurate on unseen data
  65. 65. Bias and VarianceDecision Trees• Small trees have high bias.• Large trees have highvariance. Why? from Elder, John. From Trees to Forests and Rule Sets - A Unified Overview of Ensemble Methods. 2007.
  66. 66. For Any Model (Not only decision trees)• Given a target function• Model has many parameters – Generally low bias – Fits data well – Yields high variance• Model has few parameters – Generally high bias – May not fit data well – The fit does not change much for different data sets (low variance)
  67. 67. Bias-Variance and Ensemble Learning • Bagging: There exists empirical and theoretical evidence that Bagging acts as variance reduction machine (i.e., it reduces the variance part of the error). • AdaBoost: Empirical evidence suggests that AdaBoost reduces both the bias and the variance part of the error. In particular, it seems that bias is mostly reduced in early iterations, while variance in later ones.
  68. 68. Illustration on Baggingy x
  69. 69. Occams razor• The explanation of any phenomenon should make as few assumptions as possible, eliminating those that make no difference in the observable predictions of the explanatory hypothesis or theory
  70. 70. Contradiction with Occam’s Razor• Ensemble Contradicts with Occam’s Razor – More rounds -> more classifiers for voting -> more complicated – With the 0 training error, a more complicated classifier may perform worse
  71. 71. Two Razors (Domingos, 1999)• First razor: Given two models with the same generalization error, the simpler one should be preferred because simplicity is desirable in itself.• On the other hand, within KDD Occams razor is often used in a quite different sense, that can be stated as:• Second razor: Given two models with the same training-set error, the simpler one should be preferred because it is likely to have lower generalization error.• Domingos: The first one is largely uncontroversial, while the second one, taken literally, is false.
  72. 72. Summary• “Two heads are better than none. One hundred heads are so much better than one” – Dearg Doom, The Tain, Horslips, 1973• “Great minds think alike, clever minds think together‖ L. Zoref, 2011.• But they must be different, specialised• And it might be an idea to select only the best of them for the problem at hand
  73. 73. Additional Readings

×