Decision Forest
After Twenty Years
Lior Rokach
Dept. of Information Systems Engineering
Do we need hundreds of classifiers to solve real
world classification problems?
(Fernández-Delgado et al., 2014)
Empirically comparing
179 classification algorithms
over 121 datasets
“The classifier most likely to be the best is random forest
(achieves 94.1% of the maximum accuracy
overcoming 90% in the 84.3% of the data sets)”
Accumulated votes: 2154321
Classification by majority voting
New Instance: x
1
t 
1 1 2 1 2 1
T=7 classifiers
0 0 Final class: 1
t2t1
3
Obtained from Alberto Suárez, 2012
The Condorcet’s Jury Theorem
(Marquis of Condorcet,1784)
• The most basic jury theorem in social choice
• N = the number of jurors
• p = the probability of an individual juror being right
• µ= the probability that a jury gives the correct answer
• p > 0.5 implies µ > p.
• and µ  1 when N∞.
p = 0.6
µ
The Wisdom of Crowds
• Francis Galton promoted statistics and
invented the concept of correlation.
• In 1906 Galton visited a livestock fair and
stumbled upon an intriguing contest.
• An ox was on display, and the villagers were
invited to guess the animal's weight.
• Nearly 800 gave it a go and, not surprisingly,
not one hit the exact mark: 1,198 pounds.
• Astonishingly, however, the average of those
800 guesses came close - very close indeed. It
was 1,197 pounds.
Key Criteria for Crowd to be Wise
• Diversity of opinion
– Each person should have private information even if it's just an
eccentric interpretation of the known facts.
• Independence
– People's opinions aren't determined by the opinions of those
around them.
• Decentralization
– People are able to specialize and draw on local knowledge.
• Aggregation
– Some mechanism exists for turning private judgments into a
collective decision.
The Diversity Tradeoff
of individual trees
There’s no Real Tradeoff…
• Ideally, all trees would be right about
everything!
• If not, they should be wrong about different
cases.
Top Down Induction of Decision Trees
New Recipients
EmailLength
Email Len
<1.8 ≥1.8
HamSpam
1 Error 8 Errors
Top Down Induction of Decision Trees
New Recipients
EmailLength
Email Len
<1.8 ≥1.8
Spam
1 Error
Email Len
<4 ≥4
Spam
1 Error
Ham
3 Errors
Top Down Induction of Decision Trees
New Recipients
EmailLength
Email Len
<1.8 ≥1.8
Spam
1 Error
Email Len
<4 ≥4
Spam
1 Error
New Recip
<1 ≥1
Ham
1 Error
Spam
0 Errors
Which One?Top Down Induction of Decision Trees
Why Does Decision Forest Work?
• Local minima
• Lack of sufficient data
• Limited Representation
Bias
– The tendency to consistently learn the same wrong thing because
the hypothesis space considered by the learning algorithm does not
include sufficient hypotheses
Variance
– The tendency to learn random things irrespective of the real signal
due to the particular training set used
Bias and Variance Decomposition
Tree Size
It all started about two years ago …
Iterative Methods
• Reduce both Bias and Variance errors
• Hard to parallelize
• AdaBoost (Freund & Schapire, 1996)
• Gradient Boosted Trees (Friedman, 1999)
• Feature-based Partitioned Trees (Rokach,
2008)
• Stochastic gradient boosted distributed
decision trees (Ye et al., 2009)
• Parallel Boosted Regression Trees (Tyree
et al., 2011)
Non-Iterative Methods
• Mainly reduce variance error
• Embarrassingly parallel
• Random decision forests (Ho, 1995)
• Bagging (Bootstrap aggregating) (Breiman,
1996)
• Random Subspace Decision Forest (Ho,
1998)
• Randomized Tree (Dietterich, 2000)
• Random Forest (Breiman, 2001)
• Switching Classes (Martínez-Muñoz and
Suárez, 2005)
• Rotation Forest (Rodríguez et al., 2006)
• Extremely Randomized Trees (Geurts et al.,
2006)
• Randomly Projected Trees (Schclar and
Rokach, 2009)
Random decision forests [74]-1995-
-1996-
-1997-
-1998-
-1999-
-2000-
-2001-
-2002-
-2003-
-2004-
-2005-
-2006-
AdaBoost [33]
Bagging [72]
Random Subspace [99]
Random Forest [73]
Extremely Randomized Trees [2]
Rotation Forest [99]
Gradient Boosted Trees [84]
Iterative Methods Non-Iterative Methods
Random Forests
(Breiman, 2001)
1. A bootstrap random sample of size n sampled from
training set with replacement
2. Evaluate a node split on a random subset of variables
3. No pruning.
Limited Representation
19
Rotation Forest
(Rodríguez et al., 2006)
AdaBoost
(Freund & Schapire, 1996)
training cases correctly
classified
training case
has large weight
in this round
this DT has
a strong vote.
boosting rounds
“Best off-the-shelf classifier in the world” – Breiman (1996)
Training Errors vs Test Errors
Performance on ‘letter’ dataset
(Schapire et al. 1997)
Training
error
Test
error
Training error drops to 0 on round 5
Test error continues to drop after round 5
(from 8.4% to 3.1%)
Decision Forest Thinning:
Making the Forest Smaller
• Too thick decision forest results in:
– Large storage requirements
– Reduced compressibility
– Prolonged prediction time
– Reduced predictive performance
Forest thinning
• A post-processing step that aims to identify a
subset of decision trees that performs at least
as good as the original forest and discard any
other trees as redundant members.
• Collective-agreement-based Thinning
(Rokach, 2009):
Using best first search strategy and the above merit measure improves the accuracy of
the original forest by 2% on average while using only circa 3% of its trees (results based
on 30 different datasets)
Accumulated votes: 2154321
Instance-based (dynamical) Forest thinning
(Rokach, 2013)
New Instance: x
1
t 
1 1 2 1 2 1
T=7 classifiers
0 0 Final class: 1
Do we really need to query all classifiers in the ensemble?
NO
t2t1
Back To a Single Tree
Genuine
Training
Set
Artificially expanded
Training Set
The problem: The resulted forest
is far from being compact.
Decision Forest for Mitigating
Learning Challenges
• Class imbalance
• Concept Drift
• Curse of dimensionality
• Multi-label classification
Beyond Classification Tasks
• Regression tree (Breiman et al., 1984)
• Survival tree (Bou-Hamad et al., 2011)
• Clustering tree (Blockeel et al., 1998)
• Recommendation tree (Gershman et al., 2010):
• Markov model tree (Antwarg et al., 2012)
• ….
Summary
• “Two heads are better than none. One
hundred heads are so much better than
one”
– Dearg Doom, The Tain, Horslips, 1973
• “Great minds think alike, clever minds
think together” Lior Zoref, 2011.
• But they must be different, specialized
• And it might be an idea to select only the
best of them for the problem at hand

Decision Forest: Twenty Years of Research

  • 1.
    Decision Forest After TwentyYears Lior Rokach Dept. of Information Systems Engineering
  • 2.
    Do we needhundreds of classifiers to solve real world classification problems? (Fernández-Delgado et al., 2014) Empirically comparing 179 classification algorithms over 121 datasets “The classifier most likely to be the best is random forest (achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets)”
  • 3.
    Accumulated votes: 2154321 Classificationby majority voting New Instance: x 1 t  1 1 2 1 2 1 T=7 classifiers 0 0 Final class: 1 t2t1 3 Obtained from Alberto Suárez, 2012
  • 4.
    The Condorcet’s JuryTheorem (Marquis of Condorcet,1784) • The most basic jury theorem in social choice • N = the number of jurors • p = the probability of an individual juror being right • µ= the probability that a jury gives the correct answer • p > 0.5 implies µ > p. • and µ  1 when N∞. p = 0.6 µ
  • 5.
    The Wisdom ofCrowds • Francis Galton promoted statistics and invented the concept of correlation. • In 1906 Galton visited a livestock fair and stumbled upon an intriguing contest. • An ox was on display, and the villagers were invited to guess the animal's weight. • Nearly 800 gave it a go and, not surprisingly, not one hit the exact mark: 1,198 pounds. • Astonishingly, however, the average of those 800 guesses came close - very close indeed. It was 1,197 pounds.
  • 6.
    Key Criteria forCrowd to be Wise • Diversity of opinion – Each person should have private information even if it's just an eccentric interpretation of the known facts. • Independence – People's opinions aren't determined by the opinions of those around them. • Decentralization – People are able to specialize and draw on local knowledge. • Aggregation – Some mechanism exists for turning private judgments into a collective decision.
  • 7.
    The Diversity Tradeoff ofindividual trees
  • 8.
    There’s no RealTradeoff… • Ideally, all trees would be right about everything! • If not, they should be wrong about different cases.
  • 9.
    Top Down Inductionof Decision Trees New Recipients EmailLength Email Len <1.8 ≥1.8 HamSpam 1 Error 8 Errors
  • 10.
    Top Down Inductionof Decision Trees New Recipients EmailLength Email Len <1.8 ≥1.8 Spam 1 Error Email Len <4 ≥4 Spam 1 Error Ham 3 Errors
  • 11.
    Top Down Inductionof Decision Trees New Recipients EmailLength Email Len <1.8 ≥1.8 Spam 1 Error Email Len <4 ≥4 Spam 1 Error New Recip <1 ≥1 Ham 1 Error Spam 0 Errors
  • 12.
    Which One?Top DownInduction of Decision Trees
  • 13.
    Why Does DecisionForest Work? • Local minima • Lack of sufficient data • Limited Representation
  • 14.
    Bias – The tendencyto consistently learn the same wrong thing because the hypothesis space considered by the learning algorithm does not include sufficient hypotheses Variance – The tendency to learn random things irrespective of the real signal due to the particular training set used Bias and Variance Decomposition Tree Size
  • 15.
    It all startedabout two years ago … Iterative Methods • Reduce both Bias and Variance errors • Hard to parallelize • AdaBoost (Freund & Schapire, 1996) • Gradient Boosted Trees (Friedman, 1999) • Feature-based Partitioned Trees (Rokach, 2008) • Stochastic gradient boosted distributed decision trees (Ye et al., 2009) • Parallel Boosted Regression Trees (Tyree et al., 2011) Non-Iterative Methods • Mainly reduce variance error • Embarrassingly parallel • Random decision forests (Ho, 1995) • Bagging (Bootstrap aggregating) (Breiman, 1996) • Random Subspace Decision Forest (Ho, 1998) • Randomized Tree (Dietterich, 2000) • Random Forest (Breiman, 2001) • Switching Classes (Martínez-Muñoz and Suárez, 2005) • Rotation Forest (Rodríguez et al., 2006) • Extremely Randomized Trees (Geurts et al., 2006) • Randomly Projected Trees (Schclar and Rokach, 2009)
  • 16.
    Random decision forests[74]-1995- -1996- -1997- -1998- -1999- -2000- -2001- -2002- -2003- -2004- -2005- -2006- AdaBoost [33] Bagging [72] Random Subspace [99] Random Forest [73] Extremely Randomized Trees [2] Rotation Forest [99] Gradient Boosted Trees [84] Iterative Methods Non-Iterative Methods
  • 17.
    Random Forests (Breiman, 2001) 1.A bootstrap random sample of size n sampled from training set with replacement 2. Evaluate a node split on a random subset of variables 3. No pruning.
  • 18.
  • 19.
  • 20.
    AdaBoost (Freund & Schapire,1996) training cases correctly classified training case has large weight in this round this DT has a strong vote. boosting rounds “Best off-the-shelf classifier in the world” – Breiman (1996)
  • 21.
    Training Errors vsTest Errors Performance on ‘letter’ dataset (Schapire et al. 1997) Training error Test error Training error drops to 0 on round 5 Test error continues to drop after round 5 (from 8.4% to 3.1%)
  • 22.
    Decision Forest Thinning: Makingthe Forest Smaller • Too thick decision forest results in: – Large storage requirements – Reduced compressibility – Prolonged prediction time – Reduced predictive performance
  • 23.
    Forest thinning • Apost-processing step that aims to identify a subset of decision trees that performs at least as good as the original forest and discard any other trees as redundant members. • Collective-agreement-based Thinning (Rokach, 2009): Using best first search strategy and the above merit measure improves the accuracy of the original forest by 2% on average while using only circa 3% of its trees (results based on 30 different datasets)
  • 24.
    Accumulated votes: 2154321 Instance-based(dynamical) Forest thinning (Rokach, 2013) New Instance: x 1 t  1 1 2 1 2 1 T=7 classifiers 0 0 Final class: 1 Do we really need to query all classifiers in the ensemble? NO t2t1
  • 25.
    Back To aSingle Tree Genuine Training Set Artificially expanded Training Set The problem: The resulted forest is far from being compact.
  • 26.
    Decision Forest forMitigating Learning Challenges • Class imbalance • Concept Drift • Curse of dimensionality • Multi-label classification
  • 27.
    Beyond Classification Tasks •Regression tree (Breiman et al., 1984) • Survival tree (Bou-Hamad et al., 2011) • Clustering tree (Blockeel et al., 1998) • Recommendation tree (Gershman et al., 2010): • Markov model tree (Antwarg et al., 2012) • ….
  • 28.
    Summary • “Two headsare better than none. One hundred heads are so much better than one” – Dearg Doom, The Tain, Horslips, 1973 • “Great minds think alike, clever minds think together” Lior Zoref, 2011. • But they must be different, specialized • And it might be an idea to select only the best of them for the problem at hand