1. 1
It is a method that induces concepts from examples
(inductive learning)
Most widely used & practical learning method
The learning is supervised: i.e. the classes or categories of the
data instances are known
It represents concepts as decision trees (which can be
rewritten as if-then rules)
The target function can be Boolean or discrete valued
DECISION TREES
Introduction
3. 3
Outlook
Humidity Wind
Sunny
Overcast
Rain
High Normal Strong Weak
A Decision Tree for the concept PlayTennis
An unknown observation is classified by testing its attributes
and reaching a leaf node
DECISION TREES
Example
4. 4
1. Each node corresponds to an attribute
2. Each branch corresponds to an attribute value
3. Each leaf node assigns a classification
DECISION TREES
Decision Tree Representation
5. 5
Decision trees represent a disjunction of conjunctions of
constraints on the attribute values of instances
Each path from the tree root to a leaf corresponds to a
conjunction of attribute tests (one rule for classification)
The tree itself corresponds to a disjunction of these
conjunctions (set of rules for classification)
DECISION TREES
Decision Tree Representation
7. 7
Most algorithms for growing decision trees are variants of a
basic algorithm
An example of this core algorithm is the ID3 algorithm
developed by Quinlan (1986)
It employs a top-down, greedy search through the space of
possible decision trees
DECISION TREES
Basic Decision Tree Learning Algorithm
8. 8
First of all we select the best attribute to be tested at the root
of the tree
For making this selection each attribute is evaluated using a
statistical test to determine how well it alone classifies the
training examples
DECISION TREES
Basic Decision Tree Learning Algorithm
11. 11
The selection process is then repeated using the training
examples associated with each descendant node to select the
best attribute to test at that point in the tree
DECISION TREES
Basic Decision Tree Learning Algorithm
13. 13
This forms a greedy search for an acceptable decision tree, in
which the algorithm never backtracks to reconsider earlier
choices
DECISION TREES
Basic Decision Tree Learning Algorithm
14. 14
The central choice in the ID3 algorithm is selecting which
attribute to test at each node in the tree
We would like to select the attribute which is most useful for
classifying examples
For this we need a good quantitative measure
For this purpose a statistical property, called information
gain is used
DECISION TREES
Which Attribute is the Best Classifier?
15. 15
In order to define information gain precisely, we begin by
defining entropy
Entropy is a measure commonly used in information theory,
called.
Entropy characterizes the impurity of an arbitrary collection
of examples
DECISION TREES
Which Attribute is the Best Classifier?: Definition of Entropy
16. 16
Suppose there are 14 samples and 9 samples are positive and
5 negative then probabilities can be calculated as:
A= 9, B = 5
p(A) = 9/14
p(B) = 5/14
The entropy of X is
DECISION TREES
Which Attribute is the Best Classifier?: Definition of Entropy
17. 17
This formula is called Entropy H
H(X) =
High Entropy means that the examples have more equal
probability of occurrence (and therefore not easily
predictable)
Low Entropy means easy predictability
DECISION TREES
Which Attribute is the Best Classifier?: Definition of Entropy
18. Entropy
• Entropy is 0 if all members of S belong to the
same class.
• For example, if all members are positive (p+ = I), then
p-, is 0, and
• Entropy(S) = -1 . log2(1) - 0 . log2 0
• = -1 . 0 - 0 . log2 0
• = 0.
• Note the entropy is 1 when the collection contains an equal number of
positive and negative examples.
• If the collection contains unequal numbers of positive and negative
examples, entropy is between 0 and 1.
18
19. 19
Information Gain is the expected reduction in entropy
caused by partitioning the examples according to an
attribute’s value
Info Gain (Y | X) = H(Y) – H(Y | X) = 1.0 – 0.5 = 0.5
For transmitting Y, how much bits would be saved if both
side of the line knew X
In general, we write Gain (S, A)
Where S is the collection of examples & A is an attribute
DECISION TREES
Which Attribute is the Best Classifier?: Information Gain
21. 21
The collection of examples has 9 positive values and 5
negative ones
DECISION TREES
Which Attribute is the Best Classifier?: Information Gain
Eight (6 positive and 2 negative ones) of these examples
have the attribute value Wind = Weak
Six (3 positive and 3 negative ones) of these examples have
the attribute value Wind = Strong
22. 22
The information gain obtained by separating the
examples according to the attribute Wind is calculated as:
DECISION TREES
Which Attribute is the Best Classifier?: Information Gain
23. 23
We calculate the Info Gain for each attribute and select
the attribute having the highest Info Gain
DECISION TREES
Which Attribute is the Best Classifier?: Information Gain
24. 24
Make decision tree by selecting tests which minimize
disorder (maximize gain)
DECISION TREES
Select Attributes which Minimize Disorder
25. 25
Make decision tree by selecting tests which minimize
disorder (maximize gain)
DECISION TREES
Select Attributes which Minimize Disorder
26. 26
DECISION TREES
Select Attributes which Minimize Disorder
The formula can be converted from log2 to log10
logx(M) = log10M . logx10
= log10M/log10x
Hence log2(Y) = log10(Y)/log10(2)
29. 29
DECISION TREES
Example
The process of selecting a new attribute is now repeated for
each (non-terminal) descendant node, this time using only
training examples associated with that node
Attributes that have been incorporated higher in the tree are
excluded, so that any given attribute can appear at most once
along any path through the tree
30. 30
DECISION TREES
Example
This process continues for each new leaf node until either:
1. Every attribute has already been included along this path
through the tree
2. The training examples associated with a leaf node have
zero entropy
32. 32
Next Step: Make rules from the decision tree
After making the identification tree, we trace each path from
the root node to leaf node, recording the test outcomes as
antecedents and the leaf node classification as the consequent
For our example we have:
If the Outlook is Sunny and the Humidity is High then No
If the Outlook is Sunny and the Humidity is Normal then Yes
...
DECISION TREES
From Decision Trees to Rules
33. 33
ID3 can be characterized as
searching a space of
hypotheses for one that fits
the training examples
The space searched is the set
of possible decision trees
ID3 performs a simple-to-
complex, hill-climbing
search through this
hypothesis space
DECISION TREES
Hypothesis Space Search
34. 34
It begins with an empty tree,
then considers more and
more elaborate hypothesis
in search of a decision tree
that correctly classifies the
training data
The evaluation function that
guides this hill-climbing
search is the information
gain measure
DECISION TREES
Hypothesis Space Search
35. 35
Some points to note:
• The hypothesis space of all decision trees is a complete
space. Hence the target function is guaranteed to be
present in it.
DECISION TREES
Hypothesis Space Search
36. 36
• ID3 maintains only a single current hypothesis as it
searches through the space of decision trees.
By determining only a single hypothesis, ID3 loses the
capabilities that follow from explicitly representing all
consistent hypotheses.
For example, it does not have the ability to determine
how many alternative decision trees are consistent with
the training data, or to pose new instance queries that
optimally resolve among these competing hypotheses
DECISION TREES
Hypothesis Space Search
37. 37
• ID3 performs no backtracking, therefore it is
susceptible to converging to locally optimal solutions
• ID3 uses all training examples at each step to refine its
current hypothesis. This makes it less sensitive to
errors in individual training examples.
However, this requires that all the training examples
are present right from the beginning and the learning
cannot be done incrementally with time
DECISION TREES
Hypothesis Space Search