Upcoming SlideShare
×

# Dm week02 decision-trees-handout

944 views

Published on

Published in: Education, Technology
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
944
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
63
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Dm week02 decision-trees-handout

1. 1. Christof Monz Informatics Institute University of Amsterdam Data Mining Week 2: Decision Tree Learning Today’s Class Christof Monz Data Mining - Week 2: Decision Tree Learning 1 Decision Trees Decision tree learning algorithms Learning bias Overﬁtting Pruning Extensions to learning with real values
2. 2. Decision Tree Learning Christof Monz Data Mining - Week 2: Decision Tree Learning 2 Main algorithms introduced by Quinlan in the 1980s A decision tree is a set of hierarchically nested classiﬁcation rules Each rule is a node in the tree investigates a speciﬁc attribute Branches correspond to the values of the attributes Example Data Christof Monz Data Mining - Week 2: Decision Tree Learning 3 When to play tennis? (training data) day outlook temperature humidity wind play 1 sunny hot high weak no 2 sunny hot high strong no 3 overcast hot high weak yes 4 rain mild high weak yes 5 rain cool normal weak yes 6 rain cool normal strong no 7 overcast cool normal strong yes 8 sunny mild high weak no 9 sunny cool normal weak yes 10 rain mild normal weak yes 11 sunny mild normal strong yes 12 overcast mild high strong yes 13 overcast hot normal weak yes 14 rain mild high strong no
3. 3. Decision Tree Christof Monz Data Mining - Week 2: Decision Tree Learning 4 Nodes check attribute values Leaves are ﬁnal classiﬁcations Decision Tree Christof Monz Data Mining - Week 2: Decision Tree Learning 5 Decision trees can be represented as logical expressions in disjunctive normal form Each path from the root corresponds to a conjunction of attribute-value equations All paths of the tree are combined by disjunction (outlook=sunny ∧ humidity=normal) ∨ (outlook=overcast) ∨ (outlook=rain ∧ wind=weak)
4. 4. Appropriate Problems for DTs Christof Monz Data Mining - Week 2: Decision Tree Learning 6 Attributes have discrete values (real-value extension discussed later) The class values are discrete (real-value extension discussed later) Training data may contain errors Training data may contain instances with missing/unknown attribute values Learning Decision Trees Christof Monz Data Mining - Week 2: Decision Tree Learning 7 Many diﬀerent decision trees can be learned for a given training set A number of criteria apply • The tree should be as accurate as possible • The tree should be as simple as possible • The tree should generalize as good as possible Basic questions • Which attributes should be included in the tree? • In which order should they be used in the tree? Standard decision tree learning algorithms: ID3 and C4.5
5. 5. Entropy Christof Monz Data Mining - Week 2: Decision Tree Learning 8 The better an attribute discriminated the classes in the data, the higher it should be in the tree How do we quantify the degree of discrimination? One way to do this is to use entropy Entropy measures the uncertainty/ambiguity in the data H(S) = −p⊕log2p⊕ −p log2p where p⊕/p is the probability of a positive/negative class occurring in S Entropy Christof Monz Data Mining - Week 2: Decision Tree Learning 9 In general, the entropy of a subset of S of the training examples with respect to the target class is deﬁned as: H(S) = ∑ c∈C −pclog2pc where C is the set of possible classes and pc is the probability of an instance in S to belong to class c Note, we deﬁne 0log20 = 0
6. 6. Entropy Christof Monz Data Mining - Week 2: Decision Tree Learning 10 Information Gain Christof Monz Data Mining - Week 2: Decision Tree Learning 11 Information gain is the reduction in entropy gain(S,A) = H(S)− ∑ v∈values(A) |Sv | |S| H(Sv ) where values(A) is the set of possible values of attribute A and Sv is the subset of S for which attribute A has value v gain(S,A) is the number of bits saved when encoding an arbitrary member of S by knowing the value of attribute A
7. 7. Information Gain Example Christof Monz Data Mining - Week 2: Decision Tree Learning 12 S = [9+,5−] values(wind) = {weak,strong} Sweak = [6+,2−] Sstrong = [3+,3−] gain(S,wind) = H(S)− ∑ v∈{weak,strong} |Sv | |S| H(Sv ) = H(S)−(8/14)H(Sweak)−(6/14)H(Sstrong) = 0.94 −(8/14)0.811 −(6/14)1.0 = 0.048 Comparing Information Gains Christof Monz Data Mining - Week 2: Decision Tree Learning 13
8. 8. ID3 DT Learning Christof Monz Data Mining - Week 2: Decision Tree Learning 14 The ID3 algorithm computes the information gain for each node in the tree and each attribute, and chooses the attribute with the highest gain For instance, at the root (S) the gains are: • gain(S,outlook) = 0.246 gain(S,humidity) = 0.151 gain(S,wind) = 0.048 gain(S,temperature) = 0.029 • Hence outlook outlook is chosen for the top node ID3 then iteratively selects the attribute with the highest gain for each daughter of the previous node, . . . ID3 DT Learning Christof Monz Data Mining - Week 2: Decision Tree Learning 15
9. 9. ID3 Algorithm Christof Monz Data Mining - Week 2: Decision Tree Learning 16 ID3(S’,S,node,attr) if (for all s in S: class(s)=c) return leaf node with class c; else if (attr is empty) return leaf node with most frequent class in S else if (S is empty) return leaf node with most frequent class in S’ else a=argmax (a’∈attr) gain(S,a) attribute(node)=a; for each v∈values(a) new(node v); new edge(node,node v);label(node,node v)=v; ID3(S,S v,attr-{a}) Initial call: ID3(/0,S,root,A) Hypothesis Search Space Christof Monz Data Mining - Week 2: Decision Tree Learning 17
10. 10. Hypothesis Search Space Christof Monz Data Mining - Week 2: Decision Tree Learning 18 The hypothesis space searched by ID3 is the set of possible decision trees Hill-climbing (greedy) search guided purely by information gain measure • Only one hypothesis is considered for further extension • No back-tracking to hypotheses dismissed earlier All (relevant) training examples are used to guide search Due to greedy search, ID3 can get stuck in a local optimum Inductive Bias of ID3 Christof Monz Data Mining - Week 2: Decision Tree Learning 19 ID3 has a preference for small trees (in particular short trees) ID3 has a preference for trees with high information gain attributes near the root Note, a bias is a preference for some hypotheses, rather than a restriction of the hypothesis space Some form of bias is required in order to generalize beyond the training data
11. 11. Evaluation Christof Monz Data Mining - Week 2: Decision Tree Learning 20 How good is the learned decision tree? Split the available data into a training set and a test set Sometimes data comes already with a pre-deﬁned split Rule of thumb: use 80% for training and 20% for testing Test set should be big enough to draw stable conclusions Evaluation Christof Monz Data Mining - Week 2: Decision Tree Learning 21
12. 12. Cross-Validation Christof Monz Data Mining - Week 2: Decision Tree Learning 22 What if availabe data is rather small? Re-run training and testing on n diﬀerent portions of the data Known as n-fold cross-validation Compute the accuraccies of all combined test-portions from each fold Allows one also to report variation across diﬀerent folds Stratiﬁed cross validation makes sure that the diﬀerent folds contain the same proportions of class labels Cross-Validation Christof Monz Data Mining - Week 2: Decision Tree Learning 23
13. 13. Occam’s Razor Christof Monz Data Mining - Week 2: Decision Tree Learning 24 Occam’s Razor (OR): Prefer the simplest hypothesis that ﬁts the data Pro OR: A long hypothesis ﬁtting the data rather describes the data and it does not model the underlying principle that generated the data Pro OR: A short hypothesis ﬁtting the data is unlikely to be coincidence Con OR: There are numerous ways to deﬁne the size of hypotheses Overﬁtting Christof Monz Data Mining - Week 2: Decision Tree Learning 25 Deﬁnition: Given a hypothesis space H, a hypothesis h ∈ H is said to overﬁt the training data if there exists some alternative hypothesis h ∈ H, such that h has a smaller error than h over the training data, but h has a smaller error than h on the entire distribution of data. Roughly speaking, a hypothesis h overﬁts if it does not generalize as well beyond the training data as another hypothesis h
14. 14. Overﬁtting Christof Monz Data Mining - Week 2: Decision Tree Learning 26 Overﬁtting Christof Monz Data Mining - Week 2: Decision Tree Learning 27 Reasons for overﬁtting • The training set is too small • The training data is not representative of the real distribution • The training data contains errors (measurement errors, human annotation errors, . . . ) Overﬁtting is an signiﬁcant issue in practical data mining applications Two approaches to avoid reduce overﬁtting • Stop-early tree growth (before it classiﬁes the data perfectly) • Post-pruning of tree (after the complete tree has been learned)
15. 15. Reduced Error Pruning Christof Monz Data Mining - Week 2: Decision Tree Learning 28 Split the training set into two sets: • training subset (approx. 80%) • validation set (approx. 20%) A decision tree T is learned from the training subset Prune the node in T that leads to the highest improvement on the validation set (and repeat for all nodes until accuracy drops) Pruning a node in a tree substitutes the node and its subtree by the most common class under the node Eﬀect of Pruning Christof Monz Data Mining - Week 2: Decision Tree Learning 29
16. 16. Rule Post-Pruning Christof Monz Data Mining - Week 2: Decision Tree Learning 30 Instead of pruning entire subtrees rule post-pruning aﬀects only parts of the decision chain Convert tree into set of rules • Each path is represented as a rule of the form: If a1 = v1 ∧...∧an = vn then class = c • For example: If outlook = sunny∧humidity = high then play = no Remove conjuncts in order of improvements on the validation set until no further improvements All paths are pruned independently of each other Continuous-Valued Attributes Christof Monz Data Mining - Week 2: Decision Tree Learning 31 Considering real-valued attributes (like temperature) as discrete values is clearly inappropriate Use threshold: if(value(a)<c) then . . . else . . . Threshold c can be determined by computing the maximum information gain for diﬀerent candidate thresholds Note: Numeric attributes can be repeated along the same path
17. 17. Information Gain Revisited Christof Monz Data Mining - Week 2: Decision Tree Learning 32 H(S) = ∑ c∈C −pclog2pc gain(S,A) = H(S)− ∑ v∈values(A) |Sv | |S| H(Sv ) Information gain favors attributes with many values over those with few. Why? Extension: Measure how broadly and uniformly the attribute splits the data: split(S,A) = − ∑ v∈values(A) |Sv | |S| log2 |Sv | |S| split is the entropy of the attribute-value distribution in S Information Gain Revisited Christof Monz Data Mining - Week 2: Decision Tree Learning 33 Information gain and split can be combined: gain ratio(S,A) = gain(S,A) split(S,A) If |value(A)| = n and A completely determines the class, then split(S,A) = log2n If |Sv | ≈ |S| for one v then split(S,A) becomes small and boosts the gain ratio Heuristic: Compute information gain ﬁrst, remove attributes with below-average gain and then select the attribute with the highest gain-ratio
18. 18. Missing Attribute Values Christof Monz Data Mining - Week 2: Decision Tree Learning 34 In real-world data it is not unusual that some instances have missing attribute values How to compute gain(S,A) for instances with missing values? Assume instance x with class(x) = c and a(x) =? • Take the most frequent value of a of all instances in S (with the same class) • Take the average value of a of all instances in S (with the same class) for numeric attributes Missing Attribute Values Christof Monz Data Mining - Week 2: Decision Tree Learning 35 Instead of choosing the single most frequent value, use fractional instances E.g., if p(a(x) = 1|S) = 0.6 and p(a(x) = 0|S) = 0.4 then 0.6 (0.4) fractional instances with missing values for a are passed down the a = 1 (a = 0) branch Entropy computation has to be adapted accordingly
19. 19. Attributes with Diﬀerent Costs Christof Monz Data Mining - Week 2: Decision Tree Learning 36 In real-world scenarios there can be costs associated with computing the values of attributes (medical tests, computing time, . . . ) Considering costs might favors usage of lower-cost attributes Suggested measures include: • gain(S,A) cost(A) • 2gain(S,A)−1 (cost(A)+1)w where w ∈ [0,1] is a weight determining the importance of cost Predicting Continuous Values Christof Monz Data Mining - Week 2: Decision Tree Learning 37 So far we have focused on predicting discrete classes (i.e. nominal classiﬁcation) What has to change when predicting real values? Splitting criterion redeﬁned • Information gain: gain(S,A) = H(S)− ∑ v∈values(A) |Sv | |S| H(Sv ) Standard deviation reduction (SDR) sdr(S,A) = std dev(S)− ∑ v∈val(A) |Sv | |S| std dev(Sv ) where std dev(S) = ∑s∈S 1 |S|(val(s)−avg val(S))2
20. 20. Predicting Continuous Values Christof Monz Data Mining - Week 2: Decision Tree Learning 38 Stopping criterion in nominal classiﬁcation: Stop when all leaves in S have the same class Too ﬁne-grained for real-value prediction Stop when standard deviation of node n is less then some predeﬁned ratio of standard deviation of the original instance set: Stop if std dev(Sn) std dev(Sall ) < θ, where, e.g., θ = 0.05 Predicting Continuous Values Christof Monz Data Mining - Week 2: Decision Tree Learning 39 If we decide not to split further on node n, what should the predicted value be? Simple solution: the average target value of the instances underneath node n: class(n) = avg val(Sn) This approach is used in regression trees More sophisticated: associate linear regression models with all leaf nodes (model trees)
21. 21. Model Tree Learning Christof Monz Data Mining - Week 2: Decision Tree Learning 40 Suppose we have leave node n, regression trees use the average target value of the instances under n More ﬁne-grained approach is to apply linear regression to all instances under n: class(n) = a +b1x1 +b2x2 +···bmxm where x1,x2,...,xm are the values of the attributes that lead to n in the tree a and bi are estimated just like in linear regression Problem: Not all attributes are numerical! Converting Nominal Attributes Christof Monz Data Mining - Week 2: Decision Tree Learning 41 Assume a nominal attribute such as outlook = {sunny,overcast,rain} We can convert this into numerical values simply by choosing equi-distant values from a speciﬁc interval: outlook = {1,0.5,0} This assumes an intuitive ordering of the values: sunny > overcast > rain Direct ordering of values not always possible: city = {london,new york,tokyo} london > new york > tokyo ???
22. 22. Converting Nominal Attributes Christof Monz Data Mining - Week 2: Decision Tree Learning 42 Sort nominal values of attribute A by their average target values Introduce k −1 synthetic binary attributes, if nominal attribute A has k values The ith binary attribute checks whether the ith nominal value in the ordering holds For instance, if avg trg val(new york) < avg trg val(london) < avg trg val(tokyo) then the k −1 synthetic binary attributes are: is new york and is new york OR london Recap Christof Monz Data Mining - Week 2: Decision Tree Learning 43 Elements of a decision tree Information gain ID3 algorithm Bias of ID3 Overﬁtting and Pruning Attributes with many values (gain ratio) Attributes with continuous values Attributes with missing values Predicting continuous classes: • Regression trees • Model trees