Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Introduction to Machine Learning by Lior Rokach 102188 views
- Introduction to Machine Learning by Rahul Jain 75430 views
- Publish or Perish: Towards a Ranki... by Lior Rokach 4868 views
- Ensemble Learning: The Wisdom of C... by Lior Rokach 7622 views
- Recommender Systems by Lior Rokach 5434 views
- When Cyber Security Meets Machin... by Lior Rokach 15132 views

1,998 views

Published on

Published in:
Science

No Downloads

Total views

1,998

On SlideShare

0

From Embeds

0

Number of Embeds

17

Shares

0

Downloads

0

Comments

0

Likes

6

No embeds

No notes for slide

- 1. Decision Forest After Twenty Years Lior Rokach Dept. of Information Systems Engineering
- 2. Do we need hundreds of classifiers to solve real world classification problems? (Fernández-Delgado et al., 2014) Empirically comparing 179 classification algorithms over 121 datasets “The classifier most likely to be the best is random forest (achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets)”
- 3. Accumulated votes: 2154321 Classification by majority voting New Instance: x 1 t 1 1 2 1 2 1 T=7 classifiers 0 0 Final class: 1 t2t1 3 Obtained from Alberto Suárez, 2012
- 4. The Condorcet’s Jury Theorem (Marquis of Condorcet,1784) • The most basic jury theorem in social choice • N = the number of jurors • p = the probability of an individual juror being right • µ= the probability that a jury gives the correct answer • p > 0.5 implies µ > p. • and µ 1 when N∞. p = 0.6 µ
- 5. The Wisdom of Crowds • Francis Galton promoted statistics and invented the concept of correlation. • In 1906 Galton visited a livestock fair and stumbled upon an intriguing contest. • An ox was on display, and the villagers were invited to guess the animal's weight. • Nearly 800 gave it a go and, not surprisingly, not one hit the exact mark: 1,198 pounds. • Astonishingly, however, the average of those 800 guesses came close - very close indeed. It was 1,197 pounds.
- 6. Key Criteria for Crowd to be Wise • Diversity of opinion – Each person should have private information even if it's just an eccentric interpretation of the known facts. • Independence – People's opinions aren't determined by the opinions of those around them. • Decentralization – People are able to specialize and draw on local knowledge. • Aggregation – Some mechanism exists for turning private judgments into a collective decision.
- 7. The Diversity Tradeoff of individual trees
- 8. There’s no Real Tradeoff… • Ideally, all trees would be right about everything! • If not, they should be wrong about different cases.
- 9. Top Down Induction of Decision Trees New Recipients EmailLength Email Len <1.8 ≥1.8 HamSpam 1 Error 8 Errors
- 10. Top Down Induction of Decision Trees New Recipients EmailLength Email Len <1.8 ≥1.8 Spam 1 Error Email Len <4 ≥4 Spam 1 Error Ham 3 Errors
- 11. Top Down Induction of Decision Trees New Recipients EmailLength Email Len <1.8 ≥1.8 Spam 1 Error Email Len <4 ≥4 Spam 1 Error New Recip <1 ≥1 Ham 1 Error Spam 0 Errors
- 12. Which One?Top Down Induction of Decision Trees
- 13. Why Does Decision Forest Work? • Local minima • Lack of sufficient data • Limited Representation
- 14. Bias – The tendency to consistently learn the same wrong thing because the hypothesis space considered by the learning algorithm does not include sufficient hypotheses Variance – The tendency to learn random things irrespective of the real signal due to the particular training set used Bias and Variance Decomposition Tree Size
- 15. It all started about two years ago … Iterative Methods • Reduce both Bias and Variance errors • Hard to parallelize • AdaBoost (Freund & Schapire, 1996) • Gradient Boosted Trees (Friedman, 1999) • Feature-based Partitioned Trees (Rokach, 2008) • Stochastic gradient boosted distributed decision trees (Ye et al., 2009) • Parallel Boosted Regression Trees (Tyree et al., 2011) Non-Iterative Methods • Mainly reduce variance error • Embarrassingly parallel • Random decision forests (Ho, 1995) • Bagging (Bootstrap aggregating) (Breiman, 1996) • Random Subspace Decision Forest (Ho, 1998) • Randomized Tree (Dietterich, 2000) • Random Forest (Breiman, 2001) • Switching Classes (Martínez-Muñoz and Suárez, 2005) • Rotation Forest (Rodríguez et al., 2006) • Extremely Randomized Trees (Geurts et al., 2006) • Randomly Projected Trees (Schclar and Rokach, 2009)
- 16. Random decision forests [74]-1995- -1996- -1997- -1998- -1999- -2000- -2001- -2002- -2003- -2004- -2005- -2006- AdaBoost [33] Bagging [72] Random Subspace [99] Random Forest [73] Extremely Randomized Trees [2] Rotation Forest [99] Gradient Boosted Trees [84] Iterative Methods Non-Iterative Methods
- 17. Random Forests (Breiman, 2001) 1. A bootstrap random sample of size n sampled from training set with replacement 2. Evaluate a node split on a random subset of variables 3. No pruning.
- 18. Limited Representation
- 19. 19 Rotation Forest (Rodríguez et al., 2006)
- 20. AdaBoost (Freund & Schapire, 1996) training cases correctly classified training case has large weight in this round this DT has a strong vote. boosting rounds “Best off-the-shelf classifier in the world” – Breiman (1996)
- 21. Training Errors vs Test Errors Performance on ‘letter’ dataset (Schapire et al. 1997) Training error Test error Training error drops to 0 on round 5 Test error continues to drop after round 5 (from 8.4% to 3.1%)
- 22. Decision Forest Thinning: Making the Forest Smaller • Too thick decision forest results in: – Large storage requirements – Reduced compressibility – Prolonged prediction time – Reduced predictive performance
- 23. Forest thinning • A post-processing step that aims to identify a subset of decision trees that performs at least as good as the original forest and discard any other trees as redundant members. • Collective-agreement-based Thinning (Rokach, 2009): Using best first search strategy and the above merit measure improves the accuracy of the original forest by 2% on average while using only circa 3% of its trees (results based on 30 different datasets)
- 24. Accumulated votes: 2154321 Instance-based (dynamical) Forest thinning (Rokach, 2013) New Instance: x 1 t 1 1 2 1 2 1 T=7 classifiers 0 0 Final class: 1 Do we really need to query all classifiers in the ensemble? NO t2t1
- 25. Back To a Single Tree Genuine Training Set Artificially expanded Training Set The problem: The resulted forest is far from being compact.
- 26. Decision Forest for Mitigating Learning Challenges • Class imbalance • Concept Drift • Curse of dimensionality • Multi-label classification
- 27. Beyond Classification Tasks • Regression tree (Breiman et al., 1984) • Survival tree (Bou-Hamad et al., 2011) • Clustering tree (Blockeel et al., 1998) • Recommendation tree (Gershman et al., 2010): • Markov model tree (Antwarg et al., 2012) • ….
- 28. Summary • “Two heads are better than none. One hundred heads are so much better than one” – Dearg Doom, The Tain, Horslips, 1973 • “Great minds think alike, clever minds think together” Lior Zoref, 2011. • But they must be different, specialized • And it might be an idea to select only the best of them for the problem at hand

No public clipboards found for this slide

Be the first to comment