Successfully reported this slideshow.
Upcoming SlideShare
×

# Decision Forest: Twenty Years of Research

1,998 views

Published on

A decision tree is a predictive model that recursively partitions the covariate's space into subspaces such that each subspace constitutes a basis for a different prediction function. Decision trees can be used for various learning tasks including classification, regression and survival analysis. Due to their unique benefits, decision trees have become one of the most powerful and popular approaches in data science. Decision forest aims to improve the predictive performance of a single decision tree by training multiple trees and combining their predictions.

Published in: Science
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Decision Forest: Twenty Years of Research

1. 1. Decision Forest After Twenty Years Lior Rokach Dept. of Information Systems Engineering
2. 2. Do we need hundreds of classifiers to solve real world classification problems? (Fernández-Delgado et al., 2014) Empirically comparing 179 classification algorithms over 121 datasets “The classifier most likely to be the best is random forest (achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets)”
3. 3. Accumulated votes: 2154321 Classification by majority voting New Instance: x 1 t  1 1 2 1 2 1 T=7 classifiers 0 0 Final class: 1 t2t1 3 Obtained from Alberto Suárez, 2012
4. 4. The Condorcet’s Jury Theorem (Marquis of Condorcet,1784) • The most basic jury theorem in social choice • N = the number of jurors • p = the probability of an individual juror being right • µ= the probability that a jury gives the correct answer • p > 0.5 implies µ > p. • and µ  1 when N∞. p = 0.6 µ
5. 5. The Wisdom of Crowds • Francis Galton promoted statistics and invented the concept of correlation. • In 1906 Galton visited a livestock fair and stumbled upon an intriguing contest. • An ox was on display, and the villagers were invited to guess the animal's weight. • Nearly 800 gave it a go and, not surprisingly, not one hit the exact mark: 1,198 pounds. • Astonishingly, however, the average of those 800 guesses came close - very close indeed. It was 1,197 pounds.
6. 6. Key Criteria for Crowd to be Wise • Diversity of opinion – Each person should have private information even if it's just an eccentric interpretation of the known facts. • Independence – People's opinions aren't determined by the opinions of those around them. • Decentralization – People are able to specialize and draw on local knowledge. • Aggregation – Some mechanism exists for turning private judgments into a collective decision.
7. 7. The Diversity Tradeoff of individual trees
8. 8. There’s no Real Tradeoff… • Ideally, all trees would be right about everything! • If not, they should be wrong about different cases.
9. 9. Top Down Induction of Decision Trees New Recipients EmailLength Email Len <1.8 ≥1.8 HamSpam 1 Error 8 Errors
10. 10. Top Down Induction of Decision Trees New Recipients EmailLength Email Len <1.8 ≥1.8 Spam 1 Error Email Len <4 ≥4 Spam 1 Error Ham 3 Errors
11. 11. Top Down Induction of Decision Trees New Recipients EmailLength Email Len <1.8 ≥1.8 Spam 1 Error Email Len <4 ≥4 Spam 1 Error New Recip <1 ≥1 Ham 1 Error Spam 0 Errors
12. 12. Which One?Top Down Induction of Decision Trees
13. 13. Why Does Decision Forest Work? • Local minima • Lack of sufficient data • Limited Representation
14. 14. Bias – The tendency to consistently learn the same wrong thing because the hypothesis space considered by the learning algorithm does not include sufficient hypotheses Variance – The tendency to learn random things irrespective of the real signal due to the particular training set used Bias and Variance Decomposition Tree Size
15. 15. It all started about two years ago … Iterative Methods • Reduce both Bias and Variance errors • Hard to parallelize • AdaBoost (Freund & Schapire, 1996) • Gradient Boosted Trees (Friedman, 1999) • Feature-based Partitioned Trees (Rokach, 2008) • Stochastic gradient boosted distributed decision trees (Ye et al., 2009) • Parallel Boosted Regression Trees (Tyree et al., 2011) Non-Iterative Methods • Mainly reduce variance error • Embarrassingly parallel • Random decision forests (Ho, 1995) • Bagging (Bootstrap aggregating) (Breiman, 1996) • Random Subspace Decision Forest (Ho, 1998) • Randomized Tree (Dietterich, 2000) • Random Forest (Breiman, 2001) • Switching Classes (Martínez-Muñoz and Suárez, 2005) • Rotation Forest (Rodríguez et al., 2006) • Extremely Randomized Trees (Geurts et al., 2006) • Randomly Projected Trees (Schclar and Rokach, 2009)
16. 16. Random decision forests [74]-1995- -1996- -1997- -1998- -1999- -2000- -2001- -2002- -2003- -2004- -2005- -2006- AdaBoost [33] Bagging [72] Random Subspace [99] Random Forest [73] Extremely Randomized Trees [2] Rotation Forest [99] Gradient Boosted Trees [84] Iterative Methods Non-Iterative Methods
17. 17. Random Forests (Breiman, 2001) 1. A bootstrap random sample of size n sampled from training set with replacement 2. Evaluate a node split on a random subset of variables 3. No pruning.
18. 18. Limited Representation
19. 19. 19 Rotation Forest (Rodríguez et al., 2006)
20. 20. AdaBoost (Freund & Schapire, 1996) training cases correctly classified training case has large weight in this round this DT has a strong vote. boosting rounds “Best off-the-shelf classifier in the world” – Breiman (1996)
21. 21. Training Errors vs Test Errors Performance on ‘letter’ dataset (Schapire et al. 1997) Training error Test error Training error drops to 0 on round 5 Test error continues to drop after round 5 (from 8.4% to 3.1%)
22. 22. Decision Forest Thinning: Making the Forest Smaller • Too thick decision forest results in: – Large storage requirements – Reduced compressibility – Prolonged prediction time – Reduced predictive performance
23. 23. Forest thinning • A post-processing step that aims to identify a subset of decision trees that performs at least as good as the original forest and discard any other trees as redundant members. • Collective-agreement-based Thinning (Rokach, 2009): Using best first search strategy and the above merit measure improves the accuracy of the original forest by 2% on average while using only circa 3% of its trees (results based on 30 different datasets)
24. 24. Accumulated votes: 2154321 Instance-based (dynamical) Forest thinning (Rokach, 2013) New Instance: x 1 t  1 1 2 1 2 1 T=7 classifiers 0 0 Final class: 1 Do we really need to query all classifiers in the ensemble? NO t2t1
25. 25. Back To a Single Tree Genuine Training Set Artificially expanded Training Set The problem: The resulted forest is far from being compact.
26. 26. Decision Forest for Mitigating Learning Challenges • Class imbalance • Concept Drift • Curse of dimensionality • Multi-label classification
27. 27. Beyond Classification Tasks • Regression tree (Breiman et al., 1984) • Survival tree (Bou-Hamad et al., 2011) • Clustering tree (Blockeel et al., 1998) • Recommendation tree (Gershman et al., 2010): • Markov model tree (Antwarg et al., 2012) • ….
28. 28. Summary • “Two heads are better than none. One hundred heads are so much better than one” – Dearg Doom, The Tain, Horslips, 1973 • “Great minds think alike, clever minds think together” Lior Zoref, 2011. • But they must be different, specialized • And it might be an idea to select only the best of them for the problem at hand