Why Ensembles Win                                        Data Mining Competitions                      A Predictive Analyt...
Outline                            Motivation for Ensembles                            How Ensembles are Built          ...
PAKDD Cup 2007 Results: Score                             Metric Changes Winner                                           ...
Netflix Prize                            2006 Netflix State-of-the-art (Cinematch)                             RMSE = 0.9...
Common Kinds of Ensembles                                vs. Single Models   Ensembles                       { Single Clas...
What are Model Ensembles?                          Combining                        outputs from multiple models into sin...
Creating Model Ensembles Step 1:                                                   Generate Component Models              ...
Creating Model Ensembles Step 2:                                                                     Combining Models     ...
How Model Complexity Effects Errors                             Giovanni Seni , John Elder, Ensemble Methods in Data Minin...
Commonly Used Information-                                     Theoretic Complexity Penalties                             ...
Four Keys to Effective                                                   Ensembling                            Diversity ...
Bagging                            Bagging Method                                –  Create many data sets by             ...
Boosting (Adaboost)                     Boosting              Method                         –  Creating tree using train...
Random Forest Ensembles                      Random                Forest (RF) Method                            –  Exact...
Stochastic Gradient Boosting                      Implemented   in MART (Jerry Friedman), and                       TreeN...
Ensembles of Trees: Smoothers                           Ensembles                                smooth jagged decision b...
Heterogeneous Model                                                      Ensembles on Glass Data                          ...
Direct Marketing Example:                                         Considerations or I-Miner                               ...
Direct Marketing Example: Variable                             Inclusion in Model Ensembles                            Tw...
Fraud Detection Example:                                        Deployment Stream                                         ...
Fraud Detection Example: Overall                                  Model Score on Validation Data                          ...
Are Ensembles Better?                            Accuracy? Yes                            Interpretability? No          ...
Generalized Degrees of                                                 Freedom                            Linear Regressi...
The Math of GDF                        From Giovanni Seni , John Elder, Ensemble Methods in Data Mining: Improving        ...
The Effect of GDF                          From Elder, J.F.E IV, “The Generalization Paradox of Ensembles”, Journal of    ...
Why Ensembles Win                            Performance, performance, performance                            Single mod...
Conclusion                            Ensembles can achieve significant model                             performance imp...
References                            Giovanni Seni , John Elder, Ensemble Methods in Data Mining:                       ...
References                            Tu, Zhuowen, “Ensemble Classification Methods: Bagging, Boosting,                  ...
Upcoming SlideShare
Loading in …5
×

PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

877 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
877
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

  1. 1. Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL: http://www.abbottanalytics.com Twitter: @deanabb Email: dean@abbottanalytics.comCopyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 1
  2. 2. Outline   Motivation for Ensembles   How Ensembles are Built   Do Ensembles Violate Occams Razor?   Why Do Ensembles Win?Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 2
  3. 3. PAKDD Cup 2007 Results: Score Metric Changes Winner Par4cipant   AUCROC   AUCROC   Top  Decile   Top  Decile   Modeling   Par4cipant  Affilia4on   Modeling  Technique   Affilia4on  Type  -­‐ (Trapezoid (Trapezoidal  Rule)   Response  Rate   Response   Implementa4on  -­‐>   Loca4on  -­‐>   >   al  Rule)-­‐>   Rank  -­‐>   -­‐>   Rate  Rank  -­‐>  Ensembles TreeNet  +  Logis-c  Regression   Salford  Systems   Mainland  China   Prac--oner   70.01%   1   13.00%   7   Probit  Regression   SAS   USA   Prac--oner   69.99%   2   13.13%   6   MLP  +  n-­‐Tuple  Classifier   Brazil   Prac--oner   69.62%   3   13.88%   1   TreeNet   Salford  Systems   USA   Prac--oner   69.61%   4   13.25%   4   TreeNet   Salford  Systems   Mainland  China   Prac--oner   69.42%   5   13.50%   2   Ridge  Regression   Rank   Belgium   Prac--oner   69.28%   6   12.88%   9   2-­‐Layer  Linear  Regression   USA   Prac--oner   69.14%   7   12.88%   9   Logis-c  Regression  +  Decision  Stump  +  AdaBoost  +  VFI   Mainland  China   Academia   69.10%   8   13.25%   4   Logis-c  Average  of  Single  Decision  Func-ons   Australia   Prac--oner   68.85%   9   12.13%   17   Logis-c  Regression   Weka   Singapore   Academia   68.69%   10   12.38%   16   Logis-c  Regression   Mainland  China   Prac--oner   68.58%   11   12.88%   9   Decision  Tree  +  Neural  Network  +  Logis-c  Regression   Singapore   68.54%   12   13.00%   7   Scorecard  Linear  Addi-ve  Model   Xeno   USA   Prac--oner   68.28%   13   11.75%   20   Random  Forest   Weka   USA   68.04%   14   12.50%   14   Expanding  Regression  Tree  +  RankBoost  +  Bagging   Weka   Mainland  China   Academia   68.02%   15   12.50%   14   SAS  +  Salford   Logis-c  Regression   Systems   India   Prac--oner   67.58%   16   12.00%   19   J48  +  BayesNet   Weka   Mainland  China   Academia   67.56%   17   11.63%   21   Neural  Network  +  General  Addi-ve  Model   Tiberius   USA   Prac--oner   67.54%   18   11.63%   21   Decision  Tree  +  Neural  Network   Mainland  China   Academia   67.50%   19   12.88%   9   Decision  Tree  +  Neural  Network  +  Logis-c  Regression   SAS   USA   Academia   66.71%   20   13.50%   2   Neural  Network   SAS   USA   Academia   66.36%   21   12.13%   17   Decision  Tree  +  Neural  Network  +  Logis-c  Regression   SAS   USA   Academia   65.95%   22   11.63%   21   Neural  Network   SAS   USA   Academia   65.69%   23   9.25%   32   Mul--­‐dimension  Balanced  Random  Forest   Mainland  China   Academia   65.42%   24   12.63%   13   Neural  Network   SAS   USA   Academia   65.28%   25   11.00%   26   CHAID  Decision  Tree   SPSS   Argen-na   Academia   64.53%   26   11.25%   24   Under-­‐Sampling  Based  on  Clustering  +  CART  Decision  Tree   Taiwan   Academia   64.45%   27   11.13%   25   Decision  Tree  +  Neural  Network  +  Polynomial  Regression  SAS   USA   Academia   64.26%   28   9.38%   30  Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 3
  4. 4. Netflix Prize   2006 Netflix State-of-the-art (Cinematch) RMSE = 0.9525   Prize: reduce this RMSE by 10% => 0.8572   2007: Korbell team Progress Prize winner –  107 algorithm ensemble –  Top algorithm: SVD with RMSE = 0.8914 –  2nd algorithm: Restricted Boltzmann Machine with RMSE = 0.8990 –  Mini-ensemble (SVD+RBM) has RMSE = 0.88 http://techblog.netflix.com/2012/04/netflix- recommendations-beyond-5-stars.htmlCopyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 4
  5. 5. Common Kinds of Ensembles vs. Single Models Ensembles { Single Classifiers From Zhuowen Tu, “Ensemble Classification Methods: Bagging, Boosting, and Random Forests”Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 5
  6. 6. What are Model Ensembles?   Combining outputs from multiple models into single decision   Models can be created using the same algorithm, or several different algorithms Decision Logic Ensemble PredictionCopyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 6
  7. 7. Creating Model Ensembles Step 1: Generate Component Models Can Vary Data or Single data set Model Parameters:   Case (Record) Weights — bootstrapping, sampling   Data Values — add noise, recode data   Learning Parameters — vary learning rates, pruning severity, random seeds   Variable Subsets — Multiple models vary candidate inputs, and predictions featuresCopyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 7
  8. 8. Creating Model Ensembles Step 2: Combining Models   Combining Methods Multiple models –  Estimation: Average Outputs and predictions –  Classification: Average probabilities or vote (best M of N)   Variance Reduction –  Build complex, overfit models Combine –  All models built in same manner   Bias Reduction –  Build simple models –  Subsequent models weight records with errors more (or model actual errors) Decision orCopyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. Prediction Value 8
  9. 9. How Model Complexity Effects Errors Giovanni Seni , John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842)Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 9
  10. 10. Commonly Used Information- Theoretic Complexity Penalties BIC: Baysian Information Criterion AIC: Akaike Information Criterion MDL: Minimum Description Length For a nice summary: http://en.wikipedia.org/wiki/Regularization_(mathematics)Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 10
  11. 11. Four Keys to Effective Ensembling   Diversity of opinion   Independence   Decentralization   Aggregation   From The Wisdom of Crowds, James Surowiecki 11Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 11
  12. 12. Bagging   Bagging Method –  Create many data sets by bootstrapping (can also do this with cross validation) –  Create one decision tree for each data set –  Combine decision trees by averaging (or voting) final decisions –  Primarily reduces model variance rather than bias   Results –  On average, better than any Final Answer individual tree (average)Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 12
  13. 13. Boosting (Adaboost)   Boosting Method –  Creating tree using training data set Reweight examples –  Score each data point, indicating when each where incorrect decision is made (errors) classification incorrect –  Retrain, giving rows with incorrect decisions more weight. Repeat Combine –  Final prediction is a weighted average of all models via weighted sum models-> model regularization. –  Best to create weak models—simple models (just a few splits for a decision tree) and let the boosting iterations find the complexity. –  Often used with trees or Naïve Bayes   Results –  Usually better than individual tree or BaggingCopyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 13
  14. 14. Random Forest Ensembles   Random Forest (RF) Method –  Exact same methodology as Bagging, but with a twist –  At each split, rather than using the entire set of candidate inputs, use a random subset of candidate inputs –  Generates diversity of samples and inputs (splits)   Results –  On average, better than any Final individual tree, Bagging, or even Answer Boosting (average)Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 14
  15. 15. Stochastic Gradient Boosting   Implemented in MART (Jerry Friedman), and TreeNet (Salford Systems) Predict errors in ensemble tree   Algorithm so far –  Begin with a simple model—a constant value for a model Combine –  Build a simple tree (perhaps 6 terminal nodes) models via —now there are 6 possible levels, whereas weighted sum before there was one level –  Score the model and compute errors. The score Build is the sum of all previous trees, weighted by a learning rate –  Build a new tree with the errors as the target variable.   Results –  TreeNet has won 2 KDD-Cup competitions and numerous others –  It is less prone to outliers and overfit than Adaboost Final Answer (additive model)Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 15
  16. 16. Ensembles of Trees: Smoothers   Ensembles smooth jagged decision boundaries Pictures from T.G. Dietterich. Ensemble methods in machine learning. In Multiple Classier Systems, Cagliari, Italy, 2000.Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 16
  17. 17. Heterogeneous Model Ensembles on Glass Data Max Error Min Error Avera ge Error   Model prediction diversity 40 % obtained by using different algorithms: tree, NN, RBF, 35 % Gaussian, Regression, k-NN Percent Classification Error 30 %   Combining 3-5 models on average better than best 25 % single model 20 %   Combining all 6 models not 15 % best (best is 3&4 model combination), but is close 10 %   The is an example of reducing 5% model variance through 0% ensembles, but not model bias 1 2 3 4 5 6 Number Models Combin edCopyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 17
  18. 18. Direct Marketing Example: Considerations or I-Miner From Abbott, D.W., "How to Improve Customer Acquisition Models with Ensembles", presented at Predictive Analytics World Conference, Washington, D.C., October 20, 2009. Steps: 1.  Join by record—all models applied to same data in same row order 2.  Change probability names 3.  Average probabilities 1.  Decision is avg_prob > threshold 4.  Decile Probability Ranks 18Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
  19. 19. Direct Marketing Example: Variable Inclusion in Model Ensembles   Twenty-Five different # Models with Common Variables variables represented # Models # Variables in the ten models   Only five were represented in seven or more models   Twelve were From Abbott, D.W., "How to Improve represented in one or Customer Acquisition Models with Ensembles", presented at two models Predictive Analytics World Conference, Washington, D.C., October 20, 2009. 19Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
  20. 20. Fraud Detection Example: Deployment Stream Model scoring picks up scores from each model, combines in an ensemble, and pushes scores back to databaseCopyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 20
  21. 21. Fraud Detection Example: Overall Model Score on Validation Data Total Score (from validation population) “Score” 10.0 9.5 weights 8.8 false Normalized Score 9.0 7.5 7.0 8.0 7.2 7.2 6.8 6.9 7.2 alarms 7.0 6.1 6.3 6.8 6.3 5.3 5.7 5.3 and 6.0 5.0 sensitivi 4.0 ty 3.0 2.0 1.0 1.0 Overall, ensemble g W t Te rst Te g er e 5 ge 5 st e r ve e 10 se 1 1 1 2 3 4 5 6 7 8 9 is in st tin A v A bl e s o Av ag ra st ag B m or s Be W clearly En e best, and much Model better than best From Abbott, D, and Tom Konchan, “Advanced Fraud Detection on Techniques for Vendor Payments”, Predictive Analytics Summit, testing San Diego, CA, February 24, 2011.Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. data 21
  22. 22. Are Ensembles Better?   Accuracy? Yes   Interpretability? No   Do Ensembles contradict Occam’s Razor? –  Principle: simpler models generalize better; avoid overfit! –  They are more complex than single models (RF may have hundreds of trees in the ensemble) –  Yet these more complex models perform better on held-out data –  But…are they really more complex?Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 22
  23. 23. Generalized Degrees of Freedom   Linear Regression: a degree of freedom in the model is simple a parameter –  Does not extrapolate to non-linear methods –  Number of “parameters” in non-linear methods can produce more complexity or less   Enter…Generalized Degrees of Freedom (GDF) –  GDF (Ye 1998) “randomly perturbs (adds noise to) the output variable, re-runs the modeling procedure, and measures the changes to the estimates” (for same number of parameters)Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 23
  24. 24. The Math of GDF From Giovanni Seni , John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842)Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 24
  25. 25. The Effect of GDF From Elder, J.F.E IV, “The Generalization Paradox of Ensembles”, Journal of Computational and Graphical Statistics, Volume 12, Number 4, Pages 853–864Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 25
  26. 26. Why Ensembles Win   Performance, performance, performance   Single model sometimes provide insufficient accuracy –  Neural networks become stuck in local minima –  Decision trees   Run out of data   Are greedy—can get fooled early –  Single algorithms keep pushing performance using the same ideas (basis function / algorithm), and are incapable of thinking outside of their box   Different algorithms or algorithms built using resample data achieve the same level of accuracy but on different cases—they identify different ways to get the same level of accuracyCopyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 26
  27. 27. Conclusion   Ensembles can achieve significant model performance improvements   The key to good ensembles is diversity in sampling and variable selection   Can be applied to single algorithm, or across multiple algorithms   Just do it!Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 27
  28. 28. References   Giovanni Seni , John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842)   Elder, J.F.E IV, “The Generalization Paradox of Ensembles”, Journal of Computational and Graphical Statistics, Volume 12, Number 4, Pages 853–864 DOI: 10.1198/1061860032733   Abbott, D.W., “The Benefits of Creating Ensembles of Classifiers”, Abbott Analytics, Inc., http://www.abbottanalytics.com/white-paper- classifiers.php   Abbott, D.W., “A Comparison of Algorithms at PAKDD2007”, Blog post at http://abbottanalytics.blogspot.com/2007/05/comparison-of- algorithms-at-pakdd2007.htmlCopyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 28
  29. 29. References   Tu, Zhuowen, “Ensemble Classification Methods: Bagging, Boosting, and Random Forests”, http://www.loni.ucla.edu/~ztu/courses/ 2010_CS_spring/cs269_2010_ensemble.pdf   Ye, J. (1998), “On Measuring and Correcting the Effects of Data Mining and Model Selection,” Journal of the American Statistical Association, 93, 120–131.Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 29

×