Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this document? Why not share!

Dm week02 decision-trees-handout

on

  • 877 views

 

Statistics

Views

Total Views
877
Slideshare-icon Views on SlideShare
877
Embed Views
0

Actions

Likes
1
Downloads
60
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Dm week02 decision-trees-handout Dm week02 decision-trees-handout Document Transcript

    • Christof Monz Informatics Institute University of Amsterdam Data Mining Week 2: Decision Tree Learning Today’s Class Decision Trees Decision tree learning algorithms Learning bias Overfitting Pruning Extensions to learning with real values Christof Monz Data Mining - Week 2: Decision Tree Learning 1
    • Decision Tree Learning Main algorithms introduced by Quinlan in the 1980s A decision tree is a set of hierarchically nested classification rules Each rule is a node in the tree investigates a specific attribute Branches correspond to the values of the attributes Christof Monz Data Mining - Week 2: Decision Tree Learning 2 Example Data When to play tennis? (training data) day outlook temperature humidity wind play 1 sunny hot high weak no 2 sunny hot high strong no 3 overcast hot high weak yes 4 rain mild high weak yes 5 rain cool normal weak yes 6 rain cool normal strong no 7 overcast cool normal strong yes 8 sunny mild high weak no 9 sunny cool normal weak yes 10 rain mild normal weak yes 11 sunny mild normal strong yes 12 overcast mild high strong yes 13 overcast hot normal weak yes 14 rain mild high strong no Christof Monz Data Mining - Week 2: Decision Tree Learning 3
    • Decision Tree Nodes check attribute values Leaves are final classifications Christof Monz Data Mining - Week 2: Decision Tree Learning 4 Decision Tree Decision trees can be represented as logical expressions in disjunctive normal form Each path from the root corresponds to a conjunction of attribute-value equations All paths of the tree are combined by disjunction (outlook=sunny ∧ humidity=normal) ∨ (outlook=overcast) ∨ (outlook=rain ∧ wind=weak) Christof Monz Data Mining - Week 2: Decision Tree Learning 5
    • Appropriate Problems for DTs Attributes have discrete values (real-value extension discussed later) The class values are discrete (real-value extension discussed later) Training data may contain errors Training data may contain instances with missing/unknown attribute values Christof Monz Data Mining - Week 2: Decision Tree Learning 6 Learning Decision Trees Many different decision trees can be learned for a given training set A number of criteria apply • The tree should be as accurate as possible • The tree should be as simple as possible • The tree should generalize as good as possible Basic questions • Which attributes should be included in the tree? • In which order should they be used in the tree? Standard decision tree learning algorithms: ID3 and C4.5 Christof Monz Data Mining - Week 2: Decision Tree Learning 7
    • Entropy The better an attribute discriminated the classes in the data, the higher it should be in the tree How do we quantify the degree of discrimination? One way to do this is to use entropy Entropy measures the uncertainty/ambiguity in the data H (S ) = −p⊕ log2 p⊕ − p log2 p where p⊕ /p is the probability of a positive/negative class occurring in S Christof Monz Data Mining - Week 2: Decision Tree Learning 8 Entropy In general, the entropy of a subset of S of the training examples with respect to the target class is defined as: H (S ) = ∑ −pc log2 pc c ∈C where C is the set of possible classes and pc is the probability of an instance in S to belong to class c Note, we define 0 log2 0 = 0 Christof Monz Data Mining - Week 2: Decision Tree Learning 9
    • Entropy Christof Monz Data Mining - Week 2: Decision Tree Learning 10 Information Gain Information gain is the reduction in entropy |Sv | gain(S , A) = H (S ) − ∑ |S | H (Sv ) v ∈values(A) where values(A) is the set of possible values of attribute A and Sv is the subset of S for which attribute A has value v gain(S , A) is the number of bits saved when encoding an arbitrary member of S by knowing the value of attribute A Christof Monz Data Mining - Week 2: Decision Tree Learning 11
    • Information Gain Example S = [9+, 5−] values(wind) = {weak, strong} Sweak = [6+, 2−] Sstrong = [3+, 3−] |Sv | gain(S , wind) = H (S ) − ∑ H (Sv ) v ∈{weak,strong} |S | = H (S ) − (8/14)H (Sweak ) − (6/14)H (Sstrong ) = 0.94 − (8/14)0.811 − (6/14)1.0 = 0.048 Christof Monz Data Mining - Week 2: Decision Tree Learning 12 Comparing Information Gains Christof Monz Data Mining - Week 2: Decision Tree Learning 13
    • ID3 DT Learning The ID3 algorithm computes the information gain for each node in the tree and each attribute, and chooses the attribute with the highest gain For instance, at the root (S ) the gains are: • gain(S , outlook) = 0.246 gain(S , humidity) = 0.151 gain(S , wind) = 0.048 gain(S , temperature) = 0.029 • Hence outlook outlook is chosen for the top node ID3 then iteratively selects the attribute with the highest gain for each daughter of the previous node, . . . Christof Monz Data Mining - Week 2: Decision Tree Learning 14 ID3 DT Learning Christof Monz Data Mining - Week 2: Decision Tree Learning 15
    • ID3 Algorithm ID3(S’,S,node,attr) if (for all s in S: class(s)=c) return leaf node with class c; else if (attr is empty) return leaf node with most frequent class in S else if (S is empty) return leaf node with most frequent class in S’ else a=argmax (a’∈attr) gain(S,a) attribute(node)=a; for each v∈values(a) new(node v); new edge(node,node v);label(node,node v)=v; ID3(S,S v,attr-{a}) Initial call: ID3(0,S,root,A) / Christof Monz Data Mining - Week 2: Decision Tree Learning 16 Hypothesis Search Space Christof Monz Data Mining - Week 2: Decision Tree Learning 17
    • Hypothesis Search Space The hypothesis space searched by ID3 is the set of possible decision trees Hill-climbing (greedy) search guided purely by information gain measure • Only one hypothesis is considered for further extension • No back-tracking to hypotheses dismissed earlier All (relevant) training examples are used to guide search Due to greedy search, ID3 can get stuck in a local optimum Christof Monz Data Mining - Week 2: Decision Tree Learning 18 Inductive Bias of ID3 ID3 has a preference for small trees (in particular short trees) ID3 has a preference for trees with high information gain attributes near the root Note, a bias is a preference for some hypotheses, rather than a restriction of the hypothesis space Some form of bias is required in order to generalize beyond the training data Christof Monz Data Mining - Week 2: Decision Tree Learning 19
    • Evaluation How good is the learned decision tree? Split the available data into a training set and a test set Sometimes data comes already with a pre-defined split Rule of thumb: use 80% for training and 20% for testing Test set should be big enough to draw stable conclusions Christof Monz Data Mining - Week 2: Decision Tree Learning 20 Evaluation Christof Monz Data Mining - Week 2: Decision Tree Learning 21
    • Cross-Validation What if availabe data is rather small? Re-run training and testing on n different portions of the data Known as n-fold cross-validation Compute the accuraccies of all combined test-portions from each fold Allows one also to report variation across different folds Stratified cross validation makes sure that the different folds contain the same proportions of class labels Christof Monz Data Mining - Week 2: Decision Tree Learning 22 Cross-Validation Christof Monz Data Mining - Week 2: Decision Tree Learning 23
    • Occam’s Razor Occam’s Razor (OR): Prefer the simplest hypothesis that fits the data Pro OR: A long hypothesis fitting the data rather describes the data and it does not model the underlying principle that generated the data Pro OR: A short hypothesis fitting the data is unlikely to be coincidence Con OR: There are numerous ways to define the size of hypotheses Christof Monz Data Mining - Week 2: Decision Tree Learning 24 Overfitting Definition: Given a hypothesis space H , a hypothesis h ∈ H is said to overfit the training data if there exists some alternative hypothesis h ∈ H , such that h has a smaller error than h over the training data, but h has a smaller error than h on the entire distribution of data. Roughly speaking, a hypothesis h overfits if it does not generalize as well beyond the training data as another hypothesis h Christof Monz Data Mining - Week 2: Decision Tree Learning 25
    • Overfitting Christof Monz Data Mining - Week 2: Decision Tree Learning 26 Overfitting Reasons for overfitting • The training set is too small • The training data is not representative of the real distribution • The training data contains errors (measurement errors, human annotation errors, . . . ) Overfitting is an significant issue in practical data mining applications Two approaches to avoid reduce overfitting • Stop-early tree growth (before it classifies the data perfectly) • Post-pruning of tree (after the complete tree has been learned) Christof Monz Data Mining - Week 2: Decision Tree Learning 27
    • Reduced Error Pruning Split the training set into two sets: • training subset (approx. 80%) • validation set (approx. 20%) A decision tree T is learned from the training subset Prune the node in T that leads to the highest improvement on the validation set (and repeat for all nodes until accuracy drops) Pruning a node in a tree substitutes the node and its subtree by the most common class under the node Christof Monz Data Mining - Week 2: Decision Tree Learning 28 Effect of Pruning Christof Monz Data Mining - Week 2: Decision Tree Learning 29
    • Rule Post-Pruning Instead of pruning entire subtrees rule post-pruning affects only parts of the decision chain Convert tree into set of rules • Each path is represented as a rule of the form: If a1 = v1 ∧ . . . ∧ an = vn then class = c • For example: If outlook = sunny ∧ humidity = high then play = no Remove conjuncts in order of improvements on the validation set until no further improvements All paths are pruned independently of each other Christof Monz Data Mining - Week 2: Decision Tree Learning 30 Continuous-Valued Attributes Considering real-valued attributes (like temperature) as discrete values is clearly inappropriate Use threshold: if(value(a)<c) then . . . else . . . Threshold c can be determined by computing the maximum information gain for different candidate thresholds Note: Numeric attributes can be repeated along the same path Christof Monz Data Mining - Week 2: Decision Tree Learning 31
    • Information Gain Revisited H (S ) = ∑ −pc log2 pc c ∈C |Sv | gain(S , A) = H (S ) − ∑ |S | H (Sv ) v ∈values(A) Information gain favors attributes with many values over those with few. Why? Extension: Measure how broadly and uniformly the attribute splits the data: |Sv | |S | split (S , A) = − ∑ |S | log2 |Sv| v ∈values(A) split is the entropy of the attribute-value distribution in S Christof Monz Data Mining - Week 2: Decision Tree Learning 32 Information Gain Revisited Information gain and split can be combined: gain(S ,A) gain ratio(S , A) = split (S ,A) If |value(A)| = n and A completely determines the class, then split (S , A) = log2 n If |Sv | ≈ |S | for one v then split (S , A) becomes small and boosts the gain ratio Heuristic: Compute information gain first, remove attributes with below-average gain and then select the attribute with the highest gain-ratio Christof Monz Data Mining - Week 2: Decision Tree Learning 33
    • Missing Attribute Values In real-world data it is not unusual that some instances have missing attribute values How to compute gain(S , A) for instances with missing values? Assume instance x with class(x ) = c and a(x ) =? • Take the most frequent value of a of all instances in S (with the same class) • Take the average value of a of all instances in S (with the same class) for numeric attributes Christof Monz Data Mining - Week 2: Decision Tree Learning 34 Missing Attribute Values Instead of choosing the single most frequent value, use fractional instances E.g., if p(a(x ) = 1|S ) = 0.6 and p(a(x ) = 0|S ) = 0.4 then 0.6 (0.4) fractional instances with missing values for a are passed down the a = 1 (a = 0) branch Entropy computation has to be adapted accordingly Christof Monz Data Mining - Week 2: Decision Tree Learning 35
    • Attributes with Different Costs In real-world scenarios there can be costs associated with computing the values of attributes (medical tests, computing time, . . . ) Considering costs might favors usage of lower-cost attributes Suggested measures include: gain(S ,A) • cost (A) 2gain(S ,A) −1 • (cost (A)+1)w where w ∈ [0, 1] is a weight determining the importance of cost Christof Monz Data Mining - Week 2: Decision Tree Learning 36 Predicting Continuous Values So far we have focused on predicting discrete classes (i.e. nominal classification) What has to change when predicting real values? Splitting criterion redefined • Information gain: |Sv | gain(S , A) = H (S ) − ∑ |S | H (Sv ) v ∈values(A) Standard deviation reduction (SDR) |Sv | sdr (S , A) = std dev (S )− ∑ |S | std dev (Sv ) v ∈val (A) 1 where std dev (S ) = ∑s∈S |S | (val (s) − avg val (S ))2 Christof Monz Data Mining - Week 2: Decision Tree Learning 37
    • Predicting Continuous Values Stopping criterion in nominal classification: Stop when all leaves in S have the same class Too fine-grained for real-value prediction Stop when standard deviation of node n is less then some predefined ratio of standard deviation of the original instance set: std dev (Sn ) Stop if std dev (Sall ) < θ, where, e.g., θ = 0.05 Christof Monz Data Mining - Week 2: Decision Tree Learning 38 Predicting Continuous Values If we decide not to split further on node n, what should the predicted value be? Simple solution: the average target value of the instances underneath node n: class(n) = avg val (Sn ) This approach is used in regression trees More sophisticated: associate linear regression models with all leaf nodes (model trees) Christof Monz Data Mining - Week 2: Decision Tree Learning 39
    • Model Tree Learning Suppose we have leave node n, regression trees use the average target value of the instances under n More fine-grained approach is to apply linear regression to all instances under n: class(n) = a + b1 x1 + b2 x2 + · · · bm xm where x1 , x2 , . . . , xm are the values of the attributes that lead to n in the tree a and bi are estimated just like in linear regression Problem: Not all attributes are numerical! Christof Monz Data Mining - Week 2: Decision Tree Learning 40 Converting Nominal Attributes Assume a nominal attribute such as outlook = {sunny, overcast, rain} We can convert this into numerical values simply by choosing equi-distant values from a specific interval: outlook = {1, 0.5, 0} This assumes an intuitive ordering of the values: sunny > overcast > rain Direct ordering of values not always possible: city = {london, new york, tokyo} london > new york > tokyo ??? Christof Monz Data Mining - Week 2: Decision Tree Learning 41
    • Converting Nominal Attributes Sort nominal values of attribute A by their average target values Introduce k − 1 synthetic binary attributes, if nominal attribute A has k values The i th binary attribute checks whether the i th nominal value in the ordering holds For instance, if avg trg val(new york ) < avg trg val(london) < avg trg val(tokyo) then the k − 1 synthetic binary attributes are: is new york and is new york OR london Christof Monz Data Mining - Week 2: Decision Tree Learning 42 Recap Elements of a decision tree Information gain ID3 algorithm Bias of ID3 Overfitting and Pruning Attributes with many values (gain ratio) Attributes with continuous values Attributes with missing values Predicting continuous classes: • Regression trees • Model trees Christof Monz Data Mining - Week 2: Decision Tree Learning 43