Decision TreeDae-Ki Kang
Definition• Definition #1 ▫ A hierarchy of if-then’s ▫ Node – test ▫ Edge – direction of control• Definition #2 ▫ A tree t...
Decision tree for mushroom data
Algorithms• ID3 ▫ Information gain• C4.5 (=J48 in WEKA) (and See5/C5.0) ▫ Information gain ratio• Classification and regre...
Example from Tom Mitchell’s book
Naï strategy of choosing attributes       ve   (i.e. choose the next available attribute)                                 ...
How to generate decision trees?• Optimal one ▫ Equal to (or harder than) NP-Hard• Greedy one    Greedy means big question...
Impurity criteria• Entropy  Information Gain, Information Gain Ratio  ▫   Most popular  ▫   Entropy – Sum of -plogp  ▫   ...
Using IG for choosing attributes                                                                       Play=Yes           ...
Zero Occurrence• When a feature is never occurred in the training  set  zero frequency  PANIC: makes all terms  zero• Sm...
Calculating IG with Laplacian smoothing                                                                       Play=Yes    ...
Overfitting• Training set error  ▫ Error of the classifier on the training data  ▫ It is a bad idea to use up all data for...
Razors and Canon• Occams razor (Ockhams razor)  ▫ "Plurality is not to be posited without necessity"  ▫ Similar to a princ...
Example: Playing Tennis       (taken from Andrew Moore’s)                         Humidity             (9+, 5-)           ...
Predication for Nodes          What is the predication for each node?              From Andrew Moore’s slides
Predication for Nodes
Recursively Growing Trees                                              cylinders = 4                                      ...
Recursively Growing Trees       Build tree from   Build tree from   Build tree from       Build tree from       These reco...
A Two Level Tree        Recursively       growing trees
When should We Stop Growing Trees?             Should we split               this node ?
Base Cases• Base Case One: If all records in current data subset have  the same output then don’t recurse• Base Case Two: ...
Base Cases: An idea• Base Case One: If all records in current data subset have  the same output then don’t recurse• Base C...
Old Topic: Overfitting
Pruning• Prepruning (=forward pruning)• Postpruning (=backward pruning) ▫ Reduced error pruning ▫ Rule-post pruning
Pruning Decision Tree• Stop growing trees in time• Build the full decision tree as before.• But when you can grow it no mo...
Reduced Error Pruning• Split data into training and validation set• Build a full decision tree over the training set• Keep...
Original Decision Tree
Pruned Decision Tree
Reduced Error Pruning
Rule Post-Pruning• Convert tree into rules• Prune rules by removing the preconditions• Sort final rules by their estimated...
Real Value Inputs• What should we do to deal with real value inputs?     mpg    cylinders displacementhorsepower weight ac...
Information Gain• x: a real value input• t: split value• Find the split value t such that the mutual  information I(x, y: ...
Pros and Cons• Pros ▫   Easy to understand ▫   Fast learning algorithms (because they are greedy) ▫   Robust to noise ▫   ...
Generation of data from a decisiontree (based on the definition #2)• Decision tree with support for each node   Rule set ...
Extensions and further considerations• Extensions ▫   Alternating decision tree ▫   Naï Bayes Tree         ve ▫   Attribut...
Upcoming SlideShare
Loading in …5
×

2013-1 Machine Learning Lecture 02 - Decision Trees

638 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
638
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
39
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

2013-1 Machine Learning Lecture 02 - Decision Trees

  1. 1. Decision TreeDae-Ki Kang
  2. 2. Definition• Definition #1 ▫ A hierarchy of if-then’s ▫ Node – test ▫ Edge – direction of control• Definition #2 ▫ A tree that represents compression of data based on class• Manually generated decision tree is not interesting at all!
  3. 3. Decision tree for mushroom data
  4. 4. Algorithms• ID3 ▫ Information gain• C4.5 (=J48 in WEKA) (and See5/C5.0) ▫ Information gain ratio• Classification and regression tree (CART) ▫ Gini gain• Chi-squared automatic interaction detection (CHAID)
  5. 5. Example from Tom Mitchell’s book
  6. 6. Naï strategy of choosing attributes ve (i.e. choose the next available attribute) Play=Yes Play=No Outlook 3,4,5,7,9,10,11,12,13 (9) 1,2,6,8,14 (5) Sunny Rain OvercastPlay=Yes Play=No Play=Yes Play=No9,11 (2) 1,2,8 (3) 4,5,10 (3) 6,14 (2) Temp Play=Yes Play=No Temp 3,7,12,13 (4) (0) Hot Cool Hot Cool Play=Y Mild Mild
  7. 7. How to generate decision trees?• Optimal one ▫ Equal to (or harder than) NP-Hard• Greedy one  Greedy means big questions first  Strategy – divide and conquer ▫ Choose an easy-to-understand test such that divided sub-data sets by the chosen test are the easiest to deal with  Usually choose an attribute as a test  Usually adopt impurity measure to see how easy to deal with the sub-data sets• Are there any other approaches? – there are many and open
  8. 8. Impurity criteria• Entropy  Information Gain, Information Gain Ratio ▫ Most popular ▫ Entropy – Sum of -plogp ▫ IG – Entropy(S) - Sum of Entropy(sub-data t) * |t|/|S| ▫ IG favors Social Security Number or ID ▫ Information Gain Ratio• Gini index  Gini Gain (used in CART) ▫ Related with Area Under the Curve ▫ GG – 1 - Sum of fractions^2• Misclassification rate ▫ (misclassified instances)/(all instances) ▫ Problematic – lead to many indistinguishable splits (where other splits are more desirable)
  9. 9. Using IG for choosing attributes Play=Yes Play=No Outlook Sunny 3,4,5,7,9,10,11,12,13 (9) 1,2,6,8,14 (5)Play=Yes Play=No Overcast Rain9,11 (2) 1,2,8 (3) Play=Yes Play=No Play=Yes Play=No 4,5,10 (3) 6,14 (2) 3,7,12,13 (4) (0)IG(S) = Entropy(S) – Sum(|S_i|/|S|*Entropy(S_i))IG(Outlook)= Entropy(Outlook) -|Sunny|/|Outlook|*Entropy(Sunny) -|Overcast|/|Outlook|*Entropy(Overcast) -|Rain|/|Outlook|*Entropy(Rain)Entropy(Outlook) =-(9/14)*log(9/14)-(5/14)*log(5/14)|Sunny|/|Outlook|*Entropy(Sunny) = 5/14*(-(2/5)*log(2/5)-(3/5)*log(3/5))|Overcast|/|Outlook|*Entropy(Overcast) = 4/14*(-(4/4)*log(4/4)-(0/4)*log(0/4))|Rain|/|Outlook|*Entropy(Rain) = 5/14*(-(3/5)*log(3/5)-(2/5)*log(2/5))
  10. 10. Zero Occurrence• When a feature is never occurred in the training set  zero frequency  PANIC: makes all terms zero• Smoothing the distribution ▫ Laplacian Smoothing ▫ Dirichlet Priors Smoothing ▫ and many more (Absolute Discouting, Jelinek- Mercer smoothing, Katz smoothing, Good-Turing smoothing, etc.)
  11. 11. Calculating IG with Laplacian smoothing Play=Yes Play=No Outlook Sunny 3,4,5,7,9,10,11,12,13 (9) 1,2,6,8,14 (5)Play=Yes Play=No Overcast Rain9,11 (2) 1,2,8 (3) Play=Yes Play=No Play=Yes Play=No 4,5,10 (3) 6,14 (2) 3,7,12,13 (4) (0)IG(S) = Entropy(S) – Sum(|S_i|/|S|*Entropy(S_i))IG(Outlook)= Entropy(Outlook) -|Sunny|/|Outlook|*Entropy(Sunny) -|Overcast|/|Outlook|*Entropy(Overcast) -|Rain|/|Outlook|*Entropy(Rain)Entropy(Outlook) =-(10/16)*log(10/16)-(6/16)*log(6/16)|Sunny|/|Outlook|*Entropy(Sunny) = 6/17*(-(3/7)*log(3/7)-(4/7)*log(4/7))|Overcast|/|Outlook|*Entropy(Overcast) = 5/17*(-(5/6)*log(5/6)-(1/6)*log(1/6))|Rain|/|Outlook|*Entropy(Rain) = 6/17*(-(4/7)*log(4/7)-(3/7)*log(3/7))
  12. 12. Overfitting• Training set error ▫ Error of the classifier on the training data ▫ It is a bad idea to use up all data for training.  You will be out of data to evaluate the learning algorithm.• Test set error ▫ Error of the classifier on the test data ▫ Jackknife – Use n-1 examples to learn and 1 to test. Repeat n times. ▫ x-folds stratified cross-validation – Divide data into x-folds with the same proportion of class. x-1 folds to train and 1 fold to test. Repeat x times.• Overfitting ▫ The input data is incomplete (Quine) ▫ The input data do not reflect all possible cases. ▫ The input data can include noise. ▫ I.e. fit the classifier tightly to the input data is a bad idea.• Occam’s razor ▫ Old axiom used to prove the existence of God. ▫ “plurality should not be posited without necessity”
  13. 13. Razors and Canon• Occams razor (Ockhams razor) ▫ "Plurality is not to be posited without necessity" ▫ Similar to a principle of parsimony ▫ If two hypothesis have almost equal prediction power, we prefer more concise one.• Hanlons razor ▫ Never attribute to malice that which is adequately explained by stupidity.• Morgans Canon ▫ In no case is an animal activity to be interpreted in terms of higher psychological processes if it can be fairly interpreted in terms of processes which stand lower in the scale of psychological evolution and development.
  14. 14. Example: Playing Tennis (taken from Andrew Moore’s) Humidity (9+, 5-) Wind (9+, 5-) High Norm Weak Strong (3+, 4-) (6+, 1-) (3+, 3-) (6+, 2-) P(h, p) P(n, p) P( w, p) P ( s, p )I h  P(h, p) log  P(n, p) log  I w  P( w, p) log  P( s, p) log  P ( h) P ( p ) P ( n) P ( p ) P( w) P( p) P( s) P( p) P(h, p) P(n, p) P( w, p) P( s, p) P(h, p) log  P(n, p) log P( w, p) log  P( s, p) log P(h) P(p) P(n) P(p) P( w) P(p) P( s ) P(p)  0.151  0.048
  15. 15. Predication for Nodes What is the predication for each node? From Andrew Moore’s slides
  16. 16. Predication for Nodes
  17. 17. Recursively Growing Trees cylinders = 4 cylinders = 5 cylinders = 6Original Partition itDataset according cylinders = 8 to the value of the attribute we split on From Andrew Moore slides
  18. 18. Recursively Growing Trees Build tree from Build tree from Build tree from Build tree from These records.. These records.. These records.. These records.. cylinders = 4 cylinders = 5 cylinders = 6 cylinders = 8 From Andrew Moore slides
  19. 19. A Two Level Tree Recursively growing trees
  20. 20. When should We Stop Growing Trees? Should we split this node ?
  21. 21. Base Cases• Base Case One: If all records in current data subset have the same output then don’t recurse• Base Case Two: If all records have exactly the same set of input attributes then don’t recurse
  22. 22. Base Cases: An idea• Base Case One: If all records in current data subset have the same output then don’t recurse• Base Case Two: If all records have exactly the same set of input attributes then don’t recurse Proposed Base Case 3: If all attributes have zero information gain then don’t recurse Is this a good idea?
  23. 23. Old Topic: Overfitting
  24. 24. Pruning• Prepruning (=forward pruning)• Postpruning (=backward pruning) ▫ Reduced error pruning ▫ Rule-post pruning
  25. 25. Pruning Decision Tree• Stop growing trees in time• Build the full decision tree as before.• But when you can grow it no more, start to prune: ▫ Reduced error pruning ▫ Rule post-pruning
  26. 26. Reduced Error Pruning• Split data into training and validation set• Build a full decision tree over the training set• Keep removing node that maximally increases validation set accuracy
  27. 27. Original Decision Tree
  28. 28. Pruned Decision Tree
  29. 29. Reduced Error Pruning
  30. 30. Rule Post-Pruning• Convert tree into rules• Prune rules by removing the preconditions• Sort final rules by their estimated accuracyMost widely used method (e.g., C4.5)Other methods: statistical significance test (chi- square)
  31. 31. Real Value Inputs• What should we do to deal with real value inputs? mpg cylinders displacementhorsepower weight acceleration modelyear maker good 4 97 75 2265 18.2 77 asia bad 6 199 90 2648 15 70 america bad 4 121 110 2600 12.8 77 europe bad 8 350 175 4100 13 73 america bad 6 198 95 3102 16.5 74 america bad 4 108 94 2379 16.5 73 asia bad 4 113 95 2228 14 71 asia bad 8 302 139 3570 12.8 78 america : : : : : : : : : : : : : : : : : : : : : : : : good 4 120 79 2625 18.6 82 america bad 8 455 225 4425 10 70 america good 4 107 86 2464 15.5 76 europe bad 5 131 103 2830 15.9 78 europe
  32. 32. Information Gain• x: a real value input• t: split value• Find the split value t such that the mutual information I(x, y: t) between x and the class label y is maximized.
  33. 33. Pros and Cons• Pros ▫ Easy to understand ▫ Fast learning algorithms (because they are greedy) ▫ Robust to noise ▫ Good accuracy• Cons ▫ Unstable ▫ Hard to represent some functions (Parity, XOR, etc.) ▫ Duplication in subtrees ▫ Cannot be used to express all first order logic because the test cannot refer to two or more different objects
  34. 34. Generation of data from a decisiontree (based on the definition #2)• Decision tree with support for each node  Rule set ▫ support = # of training instances assigned for a node• Rule set  Instances• In this way, one can combine multiple decision trees by combining rule sets• cf. Bayesian classifiers  Fractional instances
  35. 35. Extensions and further considerations• Extensions ▫ Alternating decision tree ▫ Naï Bayes Tree ve ▫ Attribute Value Taxonomy guided Decision Tree ▫ Recursive Naï Bayes ve ▫ and many more• Further Researches ▫ Decision graph ▫ Bottom up generation of decision tree ▫ Evolutionary construction of decision tree ▫ Integrating two decision trees ▫ and many more

×