3.
Tree leaning Task Induction Deduction Learn Model (tree) Model (Tree) Test Set Learning algorithm Training Set Apply Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10
4.
Example of a Decision Tree Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Splitting Attributes Training Data Model: Decision Tree categorical categorical continuous class
5.
Another Example of Decision Tree categorical categorical continuous class MarSt Refund TaxInc YES NO NO Yes No Married Single, Divorced < 80K > 80K There could be more than one tree that fits the same data! NO
6.
Decision Tree Classification Task Decision Tree
7.
Apply Model to Test Data Test Data Start from the root of tree. Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
8.
Apply Model to Test Data Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
9.
Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data
10.
Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data
11.
Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data
12.
Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data Assign Cheat to “No”
13.
Decision Tree Classification Task Decision Tree
Tree-based classifiers for instances represented as feature-vectors. Nodes test features, there is one branch for each value of the feature, and leaves specify the category.
Can represent arbitrary conjunction and disjunction. Can represent any classification function over discrete feature vectors.
Can be rewritten as a set of rules, i.e. disjunctive normal form (DNF).
red circle -> pos
red circle -> A
blue -> B; red square -> B
green -> C; red triangle -> C
color red blue green shape circle square triangle neg pos pos neg neg color red blue green shape circle square triangle B C A B C
Continuous (real-valued) features can be handled by allowing nodes to split a real valued feature into two ranges based on a threshold (e.g. length < 3 and length 3)
Classification trees have discrete class labels at the leaves , regression trees allow real-valued outputs at the leaves .
Algorithms for finding consistent trees are efficient for processing large amounts of training data for data mining tasks.
Methods developed for handling noisy training data (both class and feature noise).
Methods developed for handling missing feature values.
Recursively build a tree top-down by divide and conquer.
shape circle square triangle <big, red , circle>: + <small, red, circle>: + <small, red , square>: color red blue green <big, red , circle>: + <small, red , circle>: + pos <small, red , square>: neg pos <big, blue, circle>: neg neg Example: <big, red, circle>: + <small, red, circle>: + <small, red, square>: <big, blue, circle>:
21.
Decision Tree Induction Pseudocode DTree( examples , features ) returns a tree If all examples are in one category, return a leaf node with that category label. Else if the set of features is empty, return a leaf node with the category label that is the most common in examples. Else pick a feature F and create a node R for it For each possible value v i of F : Let examples i be the subset of examples that have value v i for F Add an out-going edge E to node R labeled with the value v i. If examples i is empty then attach a leaf node to edge E labeled with the category that is the most common in examples . else call DTree( examples i , features – { F }) and attach the resulting tree as the subtree under edge E. Return the subtree rooted at R.
22.
The Basic Decision Tree Learning Algorithm Top-Down Induction of Decision Trees . This approach is exemplified by the ID3 algorithm and its successor C4.5
Hunt and colleagues use exhaustive search decision-tree methods (CLS) to model human concept learning in the 1960’s.
In the late 70’s, Quinlan developed ID3 with the information gain heuristic to learn expert systems from examples.
Simulataneously, Breiman and Friedman and colleagues develop CART (Classification and Regression Trees), similar to ID3.
In the 1980’s a variety of improvements are introduced to handle noise, continuous features, missing features, and improved splitting criteria. Various expert-system development tools results.
Quinlan’s updated decision-tree package (C4.5) released in 1993.
Let D t be the set of training records that reach a node t
General Procedure:
If D t contains records that belong the same class y t , then t is a leaf node labeled as y t
If D t is an empty set, then t is a leaf node labeled by the default class, y d
If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
D t ?
31.
Hunt’s Algorithm Cheat=No Refund Cheat=No Cheat=No Yes No Refund Cheat=No Yes No Marital Status Cheat=No Cheat Single, Divorced Married Taxable Income Cheat=No < 80K >= 80K Refund Cheat=No Yes No Marital Status Cheat=No Cheat Single, Divorced Married
Goal is to have the resulting tree be as small as possible , per Occam’s razor.
Finding a minimal decision tree (nodes, leaves, or depth) is an NP-hard optimization problem.
Top-down divide-and-conquer method does a greedy search for a simple tree but does not guarantee to find the smallest.
General lesson in ML: “Greed is good.”
Want to pick a feature that creates subsets of examples that are relatively “ pure” in a single class so they are “closer” to being leaf nodes.
There are a variety of heuristics for picking a good test, a popular one is based on information gain that originated with the ID3 system of Quinlan (1979).
43.
Training Examples for PlayTennis Day Outlook 온도 Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No
p 3 =3, n 3 =2, E(p 3 , n 3 )=0.971 p 2 =4, n 2 =0, E(p 2 , n 2 )=0 p 1 =2, n 1 =3, E(p 1 , n 1 )=0.971 Outlook sunny overcast rain P True Normal Mild 11 P False Normal Cool 9 N False High Mild 8 N True High Hot 2 N False High Hot 1 P False Normal Hot 13 P True High Mild 12 P true Normal Cool 7 P False High Hot 3 N True High Mild 14 P False Normal Mild 10 N True Normal Cool 6 P False Normal Cool 5 P False High Mild 4
47.
ID3 amount of information required to specify class of an example given that it reaches node 0.94 bits 0.97 bits * 5/14 gain: 0.25 bits maximal information gain play don’t play 0.0 bits * 4/14 0.97 bits * 5/14 0.98 bits * 7/14 0.59 bits * 7/14 0.92 bits * 6/14 0.81 bits * 4/14 0.81 bits * 8/14 1.0 bits * 4/14 1.0 bits * 6/14 outlook sunny overcast rainy + = 0.69 bits + = 0.79 bits + = 0.91 bits + = 0.89 bits gain: 0.15 bits gain: 0.03 bits gain: 0.05 bits humidity temperature windy high normal hot mild cool false true
48.
ID3 outlook sunny overcast rainy maximal information gain 0.97 bits play don’t play 0.0 bits * 3/5 humidity temperature windy high normal hot mild cool false true + = 0.0 bits gain: 0.97 bits + = 0.40 bits gain: 0.57 bits + = 0.95 bits gain: 0.02 bits 0.0 bits * 2/5 0.0 bits * 2/5 1.0 bits * 2/5 0.0 bits * 1/5 0.92 bits * 3/5 1.0 bits * 2/5
49.
ID3 outlook sunny overcast rainy humidity high normal 0.97 bits play don’t play 1.0 bits *2/5 temperature windy hot mild cool false true + = 0.95 bits gain: 0.02 bits + = 0.95 bits gain: 0.02 bits + = 0.0 bits gain: 0.97 bits humidity high normal 0.92 bits * 3/5 0.92 bits * 3/5 1.0 bits * 2/5 0.0 bits * 3/5 0.0 bits * 2/5
50.
ID3 outlook sunny overcast rainy windy false true humidity high normal Yes No No Yes Yes play don’t play
51.
Hypothesis Space Search in Decision Tree Learning
ID3 can be characterized as searching a space of hypothesis for one that fits the training examples
The hypothesis space searched by ID3 is the set of possible decisions
ID3 performs a simple-to-complex , hill-climbing search through this hypothesis space, locally-optimal solution .
52.
Hypothesis Space Search in Decision Tree Learning
Hypothesis space of all decision tree is a complete space of finite discrete-valued functions, relative to the available attributes
Occam’s Razor: Prefer the simplest hypothesis that fits the data
William of Ockham (AD 1285? – 1347?)
55.
Complex DT : Simple DT temperature cool mild hot outlook outlook outlook sunny o’cast rain P N windy true false N P sunny o’cast rain windy P humid true false P N high normal windy P true false N P true false N humid high normal P outlook sunny o’cast rain N P null outlook sunny overcast rain humidity P windy high normal N P true false N P
56.
Inductive Bias in Decision Tree Learning Occam’s Razor
Entropy of split = 0 (since each leaf node is “pure”, having only one case.
Information gain is maximal for ID code
ID code No Yes No No Yes D1 D2 D3 … D13 D14 Day Outlook 온도 Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No
Sort instances by the values of the numeric attribute
Time complexity for sorting: O ( n log n )
Q. Does this have to be repeated at each node of the tree?
A: No! Sort order for children can be derived from sort order for parent
Time complexity of derivation: O ( n )
Drawback: need to create and store an array of sorted indices for each numeric attribute
witten & eibe
68.
Weather data – nominal values witten & eibe … … … … … Yes False Normal Mild Rainy Yes False High Hot Overcast No True High Hot Sunny No False High Hot Sunny Play Windy Humidity Temperature Outlook If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes
Entropy only needs to be evaluated between points of different classes (Fayyad & Irani, 1992)
Potential optimal breakpoints Breakpoints between values of the same class cannot be optimal value class 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No X
Missing values affect decision tree construction in three different ways:
Affects how impurity measures are computed
Affects how to distribute instance with missing value to child nodes
Affects how a test instance with missing value is classified
77.
Distribute Instances Refund Yes No Refund Yes No Probability that Refund=Yes is 3/9 Probability that Refund=No is 6/9 Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9
Missing values affect decision tree construction in three different ways:
Affects how impurity measures are computed
Affects how to distribute instance with missing value to child nodes
Affects how a test instance with missing value is classified
79.
Classify Instances Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K New record: Probability that Marital Status = Married is 3.67/6.67 Probability that Marital Status ={Single,Divorced} is 3/6.67 6.67 1 2 3.67 Total 2.67 1 1 6/9 Class=Yes 4 0 1 3 Class=No Total Divorced Single Married
Learning a tree that classifies the training data perfectly may not lead to the tree with the best generalization to unseen data.
There may be noise in the training data that the tree is erroneously fitting.
The algorithm may be making poor decisions towards the leaves of the tree that are based on very little data and may not reflect reliable trends.
hypothesis complexity accuracy on training data on test data
94.
Overfitting Example voltage (V) current (I) Testing Ohms Law: V = IR (I = (1/R)V) Ohm was wrong, we have found a more accurate function! Experimentally measure 10 points Fit a curve to the Resulting data. Perfect fit to training data with an 9 th degree polynomial (can fit n points exactly with an n -1 degree polynomial)
95.
Overfitting Example voltage (V) current (I) Testing Ohms Law: V = IR (I = (1/R)V) Better generalization with a linear function that fits training data less accurately.
shape circle square triangle color red blue green pos neg pos neg <big, blue, circle>: <medium, blue, circle>: +
Noise can also cause different instances of the same feature vector to have different classes. Impossible to fit this data and must label leaf with the majority class.
<big, red, circle>: neg (but really pos )
Conflicting examples can also arise if the features are incomplete and inadequate to determine the class or if the target concept is non-deterministic.
Prepruning : Stop growing tree as some point during top-down construction when there is no longer sufficient data to make reliable decisions.
Postpruning : Grow the full tree, then remove subtrees that do not have sufficient evidence.
Label leaf resulting from pruning with the majority class of the remaining data, or a class probability distribution .
Method for determining which subtrees to prune:
Cross-validation : Reserve some training data as a hold-out set ( validation set ) to evaluate utility of subtrees.
Statistical test : Use a statistical test on the training data to determine if any observed regularity can be dismisses as likely due to random chance.
Minimum description length (MDL): Determine if the additional complexity of the hypothesis is less complex than just explicitly remembering any exceptions resulting from pruning.
Be the first to comment