2. Classification by Decision Tree
Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
– The topmost node in the tree is the root node.
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
3. Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
No Yes
Each internal node
tests an attribute
Each branch corresponds to an
attribute value node
Each leaf node assigns a classification
(1) Which to start? (root)
(2) Which node to
proceed?
(3) When to stop/ come to conclusion?
Decision trees classify instances or examples by starting at the root of the tree
and moving through it until a leaf node.
4. Decision Tree for Conjunction
Outlook
Sunny Overcast Rain
Wind
Strong Weak
No Yes
No
Outlook=Sunny Wind=Weak
No
5. Decision Tree for Disjunction
Outlook
Sunny Overcast Rain
Yes
Outlook=Sunny Wind=Weak
Wind
Strong Weak
No Yes
Wind
Strong Weak
No Yes
6. Decision Tree for XOR
Outlook
Sunny Overcast Rain
Wind
Strong Weak
Yes No
Outlook=Sunny XOR Wind=Weak
Wind
Strong Weak
No Yes
Wind
Strong Weak
No Yes
7. Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
• decision trees represent disjunctions of conjunctions
(Outlook=Sunny Humidity=Normal)
(Outlook=Overcast)
(Outlook=Rain Wind=Weak)
Decision Tree
8. When to consider Decision Trees
• Instances describable by attribute-value pairs
• Target function is discrete valued
• Disjunctive hypothesis may be required
• Possibly noisy training data
• Missing attribute values
• Examples:
– Medical diagnosis
– Credit risk analysis
– Object classification for robot manipulator (Tan
1993)
9. A simple example
• You want to guess the outcome of next week's game
between the MallRats and the Chinooks.
• Available knowledge / Attribute
– was the game at Home or Away
– was the starting time 5pm, 7pm or 9pm.
– Did Joe play center, or forward.
– whether that opponent's center was tall or not.
– …..
11. What we know ?
• The game will be away, at 9pm, and that Joe will play
center on offense…
• A classification problem
• Generalizing the learned rule to new examples
• What you don't know, of course, is who will win this game.
• Of course, it is reasonable to assume that this future game will
resemble the past games. Note, however, there are no previous games
that match these specific values -- ie, no previous game was exactly
[Where=Away, When=9pm, FredStarts=No, JoeOffense=Center,
JoeDefends=Forward, OppC=Tall].
We therefore need to generalize -- by using the known examples to infer
the likely outcome of this new situation. But how?
12. Use a Decision Tree to determine who should win the game
As we did not indicate the outcome of this game we call this an
"unlabeled instance"; the goal of a classifier is finding the class label for
such unlabeled instances.
An instance that also includes the outcome is called a "labeled instance" ---
eg, the first row of the table
corresponds to the labeled instance
14. Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Training Data Model: Decision Tree
15. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Start at the root of tree
16. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
17. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
18. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
19. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
20. Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”
21. Principle
‒ Basic algorithm (adopted by ID3, C4.5 and CART): a greedy algorithm
‒ Tree is constructed in a top-down recursive divide-and-conquer manner
‒ Attributes are categorical (if continuous-valued, they are discretized in
advance)
‒ Choose the best attribute(s) to split the remaining instances and make
that attribute a decision node
Iterations
‒ At start, all the training tuples are at the root
‒ Tuples are partitioned recursively based on selected attributes
‒ Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Stopping conditions
‒ All samples for a given node belong to the same class
‒ There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
‒ There are no samples left
Decision Tree Algorithm
26. How to choose An Attribute?
• An attribute selection measure is a heuristic for selecting the splitting
criterion that “best” separates a given data partition, D, of class
labeled training tuples into individual classes.
Ideally
‒ Each resulting partition would be pure
‒ A pure partition is a partition containing tuples that all belong to the
same class
• Attribute selection measures (splitting rules)
‒ Determine how the tuples at a given node are to be split
‒ Provide ranking for each attribute describing the tuples
‒ The attribute with highest score is chosen
‒ Determine a split point or a splitting subset
• Methods
– Information gain (ID3 (Iterative Dichotomiser 3) /C4.5)
– Gain ratio
– Gini Index (IBM IntelligentMiner)
Attribute Selection Measures
27. Before Describing Information Gain
Entropy is a measure of the average information content one
is missing when one does not know the value of the random
variable.
– Shannon's metric of "Entropy" of information is a foundational
concept of information theory.
– The entropy of a variable is the "amount of information"
contained in the variable.
High Entropy
– X is from a uniform like distribution
– Flat histogram
– Values sampled from it are less predictable
Low Entropy
– X is from a varied (peaks and valleys) distribution
– Histogram has many lows and highs
– Values sampled from it are more predictable
30. Assume there are two classes, P and N
Let the set of examples D contain p elements of class P
and n elements of class N
The amount of information, needed to decide if an
arbitrary example in D belongs to P or N is defined as
Info(D) =
np
n
np
n
np
p
np
p
npI
22 loglog),(
Information Gain Approach
log2x=log10x/log102
33. • Assume that using attribute A a set D will be
partitioned into sets {D1, D2 , …, Dv}
– If D contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify
objects in all subtrees Si is
• The encoding information that would be gained by
branching on A
1
),()(
i
ii
ii
npI
np
np
AE
)(),()( AEnpIAGain
Information Gain in Attribute
37. Extracting Classification Rules from
Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer
= “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
38. Avoid Overfitting in
Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies
due to noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not
split a node if this would result in the goodness
measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data
to decide which is the “best pruned tree”
39. Approaches to Determine the Final
Tree Size
• Separate training (2/3) and testing (1/3) sets
• Use cross validation, e.g., 10-fold cross
validation
• Use all the data for training
– but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a node
may improve the entire distribution
• Use minimum description length (MDL) principle:
– halting growth of the tree when the encoding is
40. Enhancements to basic decision
tree induction
• Allow for continuous-valued attributes
– Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
• Handle missing attribute values
– Assign the most common value of the attribute
– Assign probability to each of the possible values
• Attribute construction
– Create new attributes based on existing ones that are
sparsely represented
– This reduces fragmentation, repetition, and replication
41. Sore Throat Fever Swollen Glands Congestion Headache Diagnosis
YES YES YES YES YES Strep Throat
NO NO NO YES YES Allergy
YES YES NO YES NO Cold
YES NO YES NO NO Strep Throat
NO YES NO YES NO Cold
NO NO NO YES NO Allergy
NO NO YES NO NO Strep Throat
YES NO NO YES YES Allergy
NO YES NO YES YES Cold
YES YES NO YES YES Cold
Exercise: For the following Medical Diagnosis Data, create a
decision tree.
43. Finding Splitting Attribute
• Select Attribute with highest Gain
Sore
Throat=
Strep
Throat
Allergy Cold
YES 2 1 2
NO 1 2 2
Information Gain x P
Information Gain x P
+ = Entropy
Sore
Throat=
2 2 2
2 2 1 1 2 2
( ) log log log
5 5 5 5 5 5
( ) 1.52
Info YES
Info YES
2 2 2
1 1 2 2 2 2
( ) log log log
5 5 5 5 5 5
( ) 1.52
Info NO
Info NO
Entropy (E(Sore Throat)= P(YES)x1.52 + P(NO)x1.52
= (5/10)x1.52 + (5/10)x1.52 = 1.52
Gain (Sore Throat)= Info(S)-E(Sore Throat)
= 1.562-1.52 = 0.05
44. • Gain for each Attribute
Attribute Gain
Sore Throat 0.05
Fever 0.72
Swallen Glands 0.88
Congestion 0.45
Headache 0.05
Decision Tree
Swallen
Glands
YesNo
Diagnosis=Strep Throat
Fever
YesNo
Diagnosis=ColdDiagnosis=Allergy
IF Swallen Glands = “YES”, THEN Diagnosis=Strep Throat
IF Swallen Glands = “NO” AND Fever = “YES”, THEN Diagnosis=Cold
IF Swallen Glands = “NO” AND Fever = “NO”, THEN Diagnosis=Allergy