Your SlideShare is downloading. ×
Christof Monz
Informatics Institute
University of Amsterdam
Data Mining
Week 2: Decision Tree Learning
Today’s Class
Chris...
Decision Tree Learning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
2
Main algorithms introduced by Quinlan ...
Decision Tree
Christof Monz
Data Mining - Week 2: Decision Tree Learning
4
Nodes check attribute values
Leaves are final cl...
Appropriate Problems for DTs
Christof Monz
Data Mining - Week 2: Decision Tree Learning
6
Attributes have discrete values ...
Entropy
Christof Monz
Data Mining - Week 2: Decision Tree Learning
8
The better an attribute discriminated the classes
in ...
Entropy
Christof Monz
Data Mining - Week 2: Decision Tree Learning
10
Information Gain
Christof Monz
Data Mining - Week 2:...
Information Gain Example
Christof Monz
Data Mining - Week 2: Decision Tree Learning
12
S = [9+,5−]
values(wind) = {weak,st...
ID3 DT Learning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
14
The ID3 algorithm computes the information
g...
ID3 Algorithm
Christof Monz
Data Mining - Week 2: Decision Tree Learning
16
ID3(S’,S,node,attr)
if (for all s in S: class(...
Hypothesis Search Space
Christof Monz
Data Mining - Week 2: Decision Tree Learning
18
The hypothesis space searched by ID3...
Evaluation
Christof Monz
Data Mining - Week 2: Decision Tree Learning
20
How good is the learned decision tree?
Split the ...
Cross-Validation
Christof Monz
Data Mining - Week 2: Decision Tree Learning
22
What if availabe data is rather small?
Re-r...
Occam’s Razor
Christof Monz
Data Mining - Week 2: Decision Tree Learning
24
Occam’s Razor (OR): Prefer the simplest
hypoth...
Overfitting
Christof Monz
Data Mining - Week 2: Decision Tree Learning
26
Overfitting
Christof Monz
Data Mining - Week 2: De...
Reduced Error Pruning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
28
Split the training set into two sets:
...
Rule Post-Pruning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
30
Instead of pruning entire subtrees rule
po...
Information Gain Revisited
Christof Monz
Data Mining - Week 2: Decision Tree Learning
32
H(S) = ∑
c∈C
−pclog2pc
gain(S,A) ...
Missing Attribute Values
Christof Monz
Data Mining - Week 2: Decision Tree Learning
34
In real-world data it is not unusua...
Attributes with Different Costs
Christof Monz
Data Mining - Week 2: Decision Tree Learning
36
In real-world scenarios there...
Predicting Continuous Values
Christof Monz
Data Mining - Week 2: Decision Tree Learning
38
Stopping criterion in nominal c...
Model Tree Learning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
40
Suppose we have leave node n, regression...
Converting Nominal Attributes
Christof Monz
Data Mining - Week 2: Decision Tree Learning
42
Sort nominal values of attribu...
Upcoming SlideShare
Loading in...5
×

Dm week02 decision-trees-handout

669

Published on

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
669
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
62
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Dm week02 decision-trees-handout"

  1. 1. Christof Monz Informatics Institute University of Amsterdam Data Mining Week 2: Decision Tree Learning Today’s Class Christof Monz Data Mining - Week 2: Decision Tree Learning 1 Decision Trees Decision tree learning algorithms Learning bias Overfitting Pruning Extensions to learning with real values
  2. 2. Decision Tree Learning Christof Monz Data Mining - Week 2: Decision Tree Learning 2 Main algorithms introduced by Quinlan in the 1980s A decision tree is a set of hierarchically nested classification rules Each rule is a node in the tree investigates a specific attribute Branches correspond to the values of the attributes Example Data Christof Monz Data Mining - Week 2: Decision Tree Learning 3 When to play tennis? (training data) day outlook temperature humidity wind play 1 sunny hot high weak no 2 sunny hot high strong no 3 overcast hot high weak yes 4 rain mild high weak yes 5 rain cool normal weak yes 6 rain cool normal strong no 7 overcast cool normal strong yes 8 sunny mild high weak no 9 sunny cool normal weak yes 10 rain mild normal weak yes 11 sunny mild normal strong yes 12 overcast mild high strong yes 13 overcast hot normal weak yes 14 rain mild high strong no
  3. 3. Decision Tree Christof Monz Data Mining - Week 2: Decision Tree Learning 4 Nodes check attribute values Leaves are final classifications Decision Tree Christof Monz Data Mining - Week 2: Decision Tree Learning 5 Decision trees can be represented as logical expressions in disjunctive normal form Each path from the root corresponds to a conjunction of attribute-value equations All paths of the tree are combined by disjunction (outlook=sunny ∧ humidity=normal) ∨ (outlook=overcast) ∨ (outlook=rain ∧ wind=weak)
  4. 4. Appropriate Problems for DTs Christof Monz Data Mining - Week 2: Decision Tree Learning 6 Attributes have discrete values (real-value extension discussed later) The class values are discrete (real-value extension discussed later) Training data may contain errors Training data may contain instances with missing/unknown attribute values Learning Decision Trees Christof Monz Data Mining - Week 2: Decision Tree Learning 7 Many different decision trees can be learned for a given training set A number of criteria apply • The tree should be as accurate as possible • The tree should be as simple as possible • The tree should generalize as good as possible Basic questions • Which attributes should be included in the tree? • In which order should they be used in the tree? Standard decision tree learning algorithms: ID3 and C4.5
  5. 5. Entropy Christof Monz Data Mining - Week 2: Decision Tree Learning 8 The better an attribute discriminated the classes in the data, the higher it should be in the tree How do we quantify the degree of discrimination? One way to do this is to use entropy Entropy measures the uncertainty/ambiguity in the data H(S) = −p⊕log2p⊕ −p log2p where p⊕/p is the probability of a positive/negative class occurring in S Entropy Christof Monz Data Mining - Week 2: Decision Tree Learning 9 In general, the entropy of a subset of S of the training examples with respect to the target class is defined as: H(S) = ∑ c∈C −pclog2pc where C is the set of possible classes and pc is the probability of an instance in S to belong to class c Note, we define 0log20 = 0
  6. 6. Entropy Christof Monz Data Mining - Week 2: Decision Tree Learning 10 Information Gain Christof Monz Data Mining - Week 2: Decision Tree Learning 11 Information gain is the reduction in entropy gain(S,A) = H(S)− ∑ v∈values(A) |Sv | |S| H(Sv ) where values(A) is the set of possible values of attribute A and Sv is the subset of S for which attribute A has value v gain(S,A) is the number of bits saved when encoding an arbitrary member of S by knowing the value of attribute A
  7. 7. Information Gain Example Christof Monz Data Mining - Week 2: Decision Tree Learning 12 S = [9+,5−] values(wind) = {weak,strong} Sweak = [6+,2−] Sstrong = [3+,3−] gain(S,wind) = H(S)− ∑ v∈{weak,strong} |Sv | |S| H(Sv ) = H(S)−(8/14)H(Sweak)−(6/14)H(Sstrong) = 0.94 −(8/14)0.811 −(6/14)1.0 = 0.048 Comparing Information Gains Christof Monz Data Mining - Week 2: Decision Tree Learning 13
  8. 8. ID3 DT Learning Christof Monz Data Mining - Week 2: Decision Tree Learning 14 The ID3 algorithm computes the information gain for each node in the tree and each attribute, and chooses the attribute with the highest gain For instance, at the root (S) the gains are: • gain(S,outlook) = 0.246 gain(S,humidity) = 0.151 gain(S,wind) = 0.048 gain(S,temperature) = 0.029 • Hence outlook outlook is chosen for the top node ID3 then iteratively selects the attribute with the highest gain for each daughter of the previous node, . . . ID3 DT Learning Christof Monz Data Mining - Week 2: Decision Tree Learning 15
  9. 9. ID3 Algorithm Christof Monz Data Mining - Week 2: Decision Tree Learning 16 ID3(S’,S,node,attr) if (for all s in S: class(s)=c) return leaf node with class c; else if (attr is empty) return leaf node with most frequent class in S else if (S is empty) return leaf node with most frequent class in S’ else a=argmax (a’∈attr) gain(S,a) attribute(node)=a; for each v∈values(a) new(node v); new edge(node,node v);label(node,node v)=v; ID3(S,S v,attr-{a}) Initial call: ID3(/0,S,root,A) Hypothesis Search Space Christof Monz Data Mining - Week 2: Decision Tree Learning 17
  10. 10. Hypothesis Search Space Christof Monz Data Mining - Week 2: Decision Tree Learning 18 The hypothesis space searched by ID3 is the set of possible decision trees Hill-climbing (greedy) search guided purely by information gain measure • Only one hypothesis is considered for further extension • No back-tracking to hypotheses dismissed earlier All (relevant) training examples are used to guide search Due to greedy search, ID3 can get stuck in a local optimum Inductive Bias of ID3 Christof Monz Data Mining - Week 2: Decision Tree Learning 19 ID3 has a preference for small trees (in particular short trees) ID3 has a preference for trees with high information gain attributes near the root Note, a bias is a preference for some hypotheses, rather than a restriction of the hypothesis space Some form of bias is required in order to generalize beyond the training data
  11. 11. Evaluation Christof Monz Data Mining - Week 2: Decision Tree Learning 20 How good is the learned decision tree? Split the available data into a training set and a test set Sometimes data comes already with a pre-defined split Rule of thumb: use 80% for training and 20% for testing Test set should be big enough to draw stable conclusions Evaluation Christof Monz Data Mining - Week 2: Decision Tree Learning 21
  12. 12. Cross-Validation Christof Monz Data Mining - Week 2: Decision Tree Learning 22 What if availabe data is rather small? Re-run training and testing on n different portions of the data Known as n-fold cross-validation Compute the accuraccies of all combined test-portions from each fold Allows one also to report variation across different folds Stratified cross validation makes sure that the different folds contain the same proportions of class labels Cross-Validation Christof Monz Data Mining - Week 2: Decision Tree Learning 23
  13. 13. Occam’s Razor Christof Monz Data Mining - Week 2: Decision Tree Learning 24 Occam’s Razor (OR): Prefer the simplest hypothesis that fits the data Pro OR: A long hypothesis fitting the data rather describes the data and it does not model the underlying principle that generated the data Pro OR: A short hypothesis fitting the data is unlikely to be coincidence Con OR: There are numerous ways to define the size of hypotheses Overfitting Christof Monz Data Mining - Week 2: Decision Tree Learning 25 Definition: Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data if there exists some alternative hypothesis h ∈ H, such that h has a smaller error than h over the training data, but h has a smaller error than h on the entire distribution of data. Roughly speaking, a hypothesis h overfits if it does not generalize as well beyond the training data as another hypothesis h
  14. 14. Overfitting Christof Monz Data Mining - Week 2: Decision Tree Learning 26 Overfitting Christof Monz Data Mining - Week 2: Decision Tree Learning 27 Reasons for overfitting • The training set is too small • The training data is not representative of the real distribution • The training data contains errors (measurement errors, human annotation errors, . . . ) Overfitting is an significant issue in practical data mining applications Two approaches to avoid reduce overfitting • Stop-early tree growth (before it classifies the data perfectly) • Post-pruning of tree (after the complete tree has been learned)
  15. 15. Reduced Error Pruning Christof Monz Data Mining - Week 2: Decision Tree Learning 28 Split the training set into two sets: • training subset (approx. 80%) • validation set (approx. 20%) A decision tree T is learned from the training subset Prune the node in T that leads to the highest improvement on the validation set (and repeat for all nodes until accuracy drops) Pruning a node in a tree substitutes the node and its subtree by the most common class under the node Effect of Pruning Christof Monz Data Mining - Week 2: Decision Tree Learning 29
  16. 16. Rule Post-Pruning Christof Monz Data Mining - Week 2: Decision Tree Learning 30 Instead of pruning entire subtrees rule post-pruning affects only parts of the decision chain Convert tree into set of rules • Each path is represented as a rule of the form: If a1 = v1 ∧...∧an = vn then class = c • For example: If outlook = sunny∧humidity = high then play = no Remove conjuncts in order of improvements on the validation set until no further improvements All paths are pruned independently of each other Continuous-Valued Attributes Christof Monz Data Mining - Week 2: Decision Tree Learning 31 Considering real-valued attributes (like temperature) as discrete values is clearly inappropriate Use threshold: if(value(a)<c) then . . . else . . . Threshold c can be determined by computing the maximum information gain for different candidate thresholds Note: Numeric attributes can be repeated along the same path
  17. 17. Information Gain Revisited Christof Monz Data Mining - Week 2: Decision Tree Learning 32 H(S) = ∑ c∈C −pclog2pc gain(S,A) = H(S)− ∑ v∈values(A) |Sv | |S| H(Sv ) Information gain favors attributes with many values over those with few. Why? Extension: Measure how broadly and uniformly the attribute splits the data: split(S,A) = − ∑ v∈values(A) |Sv | |S| log2 |Sv | |S| split is the entropy of the attribute-value distribution in S Information Gain Revisited Christof Monz Data Mining - Week 2: Decision Tree Learning 33 Information gain and split can be combined: gain ratio(S,A) = gain(S,A) split(S,A) If |value(A)| = n and A completely determines the class, then split(S,A) = log2n If |Sv | ≈ |S| for one v then split(S,A) becomes small and boosts the gain ratio Heuristic: Compute information gain first, remove attributes with below-average gain and then select the attribute with the highest gain-ratio
  18. 18. Missing Attribute Values Christof Monz Data Mining - Week 2: Decision Tree Learning 34 In real-world data it is not unusual that some instances have missing attribute values How to compute gain(S,A) for instances with missing values? Assume instance x with class(x) = c and a(x) =? • Take the most frequent value of a of all instances in S (with the same class) • Take the average value of a of all instances in S (with the same class) for numeric attributes Missing Attribute Values Christof Monz Data Mining - Week 2: Decision Tree Learning 35 Instead of choosing the single most frequent value, use fractional instances E.g., if p(a(x) = 1|S) = 0.6 and p(a(x) = 0|S) = 0.4 then 0.6 (0.4) fractional instances with missing values for a are passed down the a = 1 (a = 0) branch Entropy computation has to be adapted accordingly
  19. 19. Attributes with Different Costs Christof Monz Data Mining - Week 2: Decision Tree Learning 36 In real-world scenarios there can be costs associated with computing the values of attributes (medical tests, computing time, . . . ) Considering costs might favors usage of lower-cost attributes Suggested measures include: • gain(S,A) cost(A) • 2gain(S,A)−1 (cost(A)+1)w where w ∈ [0,1] is a weight determining the importance of cost Predicting Continuous Values Christof Monz Data Mining - Week 2: Decision Tree Learning 37 So far we have focused on predicting discrete classes (i.e. nominal classification) What has to change when predicting real values? Splitting criterion redefined • Information gain: gain(S,A) = H(S)− ∑ v∈values(A) |Sv | |S| H(Sv ) Standard deviation reduction (SDR) sdr(S,A) = std dev(S)− ∑ v∈val(A) |Sv | |S| std dev(Sv ) where std dev(S) = ∑s∈S 1 |S|(val(s)−avg val(S))2
  20. 20. Predicting Continuous Values Christof Monz Data Mining - Week 2: Decision Tree Learning 38 Stopping criterion in nominal classification: Stop when all leaves in S have the same class Too fine-grained for real-value prediction Stop when standard deviation of node n is less then some predefined ratio of standard deviation of the original instance set: Stop if std dev(Sn) std dev(Sall ) < θ, where, e.g., θ = 0.05 Predicting Continuous Values Christof Monz Data Mining - Week 2: Decision Tree Learning 39 If we decide not to split further on node n, what should the predicted value be? Simple solution: the average target value of the instances underneath node n: class(n) = avg val(Sn) This approach is used in regression trees More sophisticated: associate linear regression models with all leaf nodes (model trees)
  21. 21. Model Tree Learning Christof Monz Data Mining - Week 2: Decision Tree Learning 40 Suppose we have leave node n, regression trees use the average target value of the instances under n More fine-grained approach is to apply linear regression to all instances under n: class(n) = a +b1x1 +b2x2 +···bmxm where x1,x2,...,xm are the values of the attributes that lead to n in the tree a and bi are estimated just like in linear regression Problem: Not all attributes are numerical! Converting Nominal Attributes Christof Monz Data Mining - Week 2: Decision Tree Learning 41 Assume a nominal attribute such as outlook = {sunny,overcast,rain} We can convert this into numerical values simply by choosing equi-distant values from a specific interval: outlook = {1,0.5,0} This assumes an intuitive ordering of the values: sunny > overcast > rain Direct ordering of values not always possible: city = {london,new york,tokyo} london > new york > tokyo ???
  22. 22. Converting Nominal Attributes Christof Monz Data Mining - Week 2: Decision Tree Learning 42 Sort nominal values of attribute A by their average target values Introduce k −1 synthetic binary attributes, if nominal attribute A has k values The ith binary attribute checks whether the ith nominal value in the ordering holds For instance, if avg trg val(new york) < avg trg val(london) < avg trg val(tokyo) then the k −1 synthetic binary attributes are: is new york and is new york OR london Recap Christof Monz Data Mining - Week 2: Decision Tree Learning 43 Elements of a decision tree Information gain ID3 algorithm Bias of ID3 Overfitting and Pruning Attributes with many values (gain ratio) Attributes with continuous values Attributes with missing values Predicting continuous classes: • Regression trees • Model trees

×