2013-1 Machine Learning Lecture 06 - Lucila Ohno-Machado - Ensemble Methods

504 views

Published on

Published in: Technology, Education
  • Be the first to comment

2013-1 Machine Learning Lecture 06 - Lucila Ohno-Machado - Ensemble Methods

  1. 1. Ensemble Methods:Bagging and BoostingDae-Ki Kang(Originally by Lucila Ohno-Machado)
  2. 2. Topics• Combining classifiers:Ensembles (vs. mixture of experts)• Bagging• Boosting▫ Ada-Boosting▫ Arcing• Stacked Generalization
  3. 3. Combining classifiers• Examples: classification trees and neuralnetworks, several neural networks, severalclassification trees, etc.• Average results from different models• Why?▫ Better classification performance than individualclassifiers▫ More resilience to noise• Why not?▫ Time consuming▫ Overfitting
  4. 4. Bagging• Breiman, 1996▫ Algorithms on the next slide (taken from Dr. Martin Maechler, Prof. Peter Bühlmann‘slecturehttp://stat.ethz.ch/education/semesters/FS_2008/CompStat )• Derived from bootstrap (Efron, 1993)• Create classifiers using training sets that arebootstrapped (drawn with replacement)• Average results for each case
  5. 5. Bagging Example (Opitz, 1999)Original 1 2 3 4 5 6 7 8Training set 1 2 7 8 3 7 6 3 1Training set 2 7 8 5 6 4 2 7 1Training set 3 3 6 2 7 5 6 2 2Training set 4 4 5 1 4 6 4 3 8
  6. 6. Boosting• A family of methods• Sequential production of classifiers• Each classifier is dependent on the previous one,and focuses on the previous one’s errors• Examples that are incorrectly predicted inprevious classifiers are chosen more often orweighted more heavily
  7. 7. Ada-Boosting• Freund and Schapire, 1996▫ Algorithm on the next slide• Two approaches▫ Select examples according to error in previousclassifier (more representatives of misclassifiedcases are selected) – more common▫ Weigh errors of the misclassified cases higher (allcases are incorporated, but weights are different)– does not work for some algorithms
  8. 8. Boosting Example (Opitz, 1999)Original 1 2 3 4 5 6 7 8Training set 1 2 7 8 3 7 6 3 1Training set 2 1 4 5 4 1 5 6 4Training set 3 7 1 5 8 1 8 1 4Training set 4 1 1 6 1 1 3 1 5
  9. 9. Ada-Boosting• Define ek as the sum of the probabilities for themisclassified instances for current classifier Ck• Multiply probability of selecting misclassifiedcases bybk = (1 – ek)/ ek• “Renormalize” probabilities (i.e., rescale so thatit sums to 1)• Combine classifiers C1…Ck using weighted votingwhere Ck has weight log(bk)
  10. 10. Arcing• Arcing-x4 (Breiman, 1996)• For the ith example in the training set, mi refersto the number of times that it was misclassifiedby the previous K classifiers• Probability pi of selecting example i in the nextclassifier K-1 is• Empirical determination   Nj jiimmp14411
  11. 11. Bias plus variance decomposition• Geman, 1992• Bias: how close the average classifier is to thegold standard• Variance: how often classifiers disagree• Intrinsic target noise: error of Bayes optimalclassifier
  12. 12. Empirical comparison (Opitz, 1999)• 23 data sets from UCI repository• 10-fold cross validation• Backpropagation neural nets• Classification trees• Single, Simple (multiple NNs with differentinitial weights), Bagging, Ada-boost, Arcing• # of classifiers in ensemble• “Accuracy” as evaluation measure
  13. 13. Results• Emsembles generally better than single, not sodifferent from “Simple”• Ensembles within NNs and CTs are stronglycorrelated• Ada-boosting and arcing strongly correlatedeven across different algorithms (boostingdepends more on data set than type of classifieralgorithm)• 40 networks in ensemble were sufficient• NNs generally better than CTs
  14. 14. Correlation coefficientsNeural Net Classification TreeSimple Bagging Arcing Ada Bagging Arcing AdaSimple NN 1 .88 .87 .85 -.10 .38 .37Bagging NN .88 1 .78 .78 -.11 .35 .35Arcing NN .87 .78 1 .99 .14 .61 .60Ada NN .85 .78 .99 1 .17 .62 .63Bagging CT -.1 -.11 .14 .17 1 .68 .69Arcing CT .38 .35 .61 .62 .68 1 .96Ada CT .37 .35 .60 .63 .69 .96 1
  15. 15. More results• Created data sets with different levels of noise(random selection of possible value for a featureor outcome) from the 23 sets• Created artificial data with noise• Boosting worse with more noise
  16. 16. Other work• Opitz and Shavlik▫ Genetic search for classifiers that are accurate yetdifferent• Create diverse classifiers by:▫ Using different parameters▫ Using different training sets
  17. 17. Stacked Generalization• Wolpert, 1992• Level-0 models are based on different learningmodels and use original data (level-0 data)• Level-1 models are based on results of level-0models (level-1 data are outputs of level-0models) -- also called “generalizer”
  18. 18. Empirical comparison• Ting, 1999• Compare SG to best model and to arcing andbagging• Satcked C4.5, naïve Bayes, and a nearestneighbor learner• Used multi-response linear regression asgeneralizer
  19. 19. Results• Better performance (accuracy) than best level-0model• Use of continuous estimates better than use ofpredicted class• Better than majority vote• Similar performance as arcing and bagging• Good for parallel computation (like bagging)
  20. 20. Related work• Decomposition of problem into subtasks• Mixture of experts (Jacobs, 1991)▫ Difference is that each expert here takes care of acertain input space• Hierarchical neural networks▫ Difference is that cases are routed to pre-definedexpert networks
  21. 21. Ideas for final projects• Compare single, bagging,and boosting on otherclassifiers (e.g., logisticregression, rough sets)• Use other data sets forcomparison• Use other performancemeasures• Study the effect of votingscheme• Try to find a relationshipbetween initialperformance, number ofcases, and number ofclassifiers within anensemble• Genetic search for gooddiverse classifiers• Analyze effect of prioroutlier removal onboosting

×