Dm week02 decision-trees-handout

Christof Monz
Informatics Institute
University of Amsterdam
Data Mining
Week 2: Decision Tree Learning
Today’s Class
Christof Monz
Data Mining - Week 2: Decision Tree Learning
1
Decision Trees
Decision tree learning algorithms
Learning bias
Overﬁtting
Pruning
Extensions to learning with real values

Decision Tree Learning
Christof Monz
2
Main algorithms introduced by Quinlan in the
1980s
A decision tree is a set of hierarchically nested
classiﬁcation rules
Each rule is a node in the tree investigates a
speciﬁc attribute
Branches correspond to the values of the
attributes
Example Data
Christof Monz
3
When to play tennis? (training data)
day outlook temperature humidity wind play
1 sunny hot high weak no
2 sunny hot high strong no
3 overcast hot high weak yes
4 rain mild high weak yes
5 rain cool normal weak yes
6 rain cool normal strong no
7 overcast cool normal strong yes
8 sunny mild high weak no
9 sunny cool normal weak yes
10 rain mild normal weak yes
11 sunny mild normal strong yes
12 overcast mild high strong yes
13 overcast hot normal weak yes
14 rain mild high strong no

Decision Tree
Christof Monz
4
Nodes check attribute values
Leaves are ﬁnal classiﬁcations
Decision Tree
Christof Monz
5
Decision trees can be represented as logical
expressions in disjunctive normal form
Each path from the root corresponds to a
conjunction of attribute-value equations
All paths of the tree are combined by disjunction
(outlook=sunny ∧ humidity=normal)
∨ (outlook=overcast)
∨ (outlook=rain ∧ wind=weak)

Appropriate Problems for DTs
Christof Monz
6
Attributes have discrete values (real-value
extension discussed later)
The class values are discrete (real-value
extension discussed later)
Training data may contain errors
Training data may contain instances with
missing/unknown attribute values
Learning Decision Trees
Christof Monz
7
Many diﬀerent decision trees can be learned for
a given training set
A number of criteria apply
• The tree should be as accurate as possible
• The tree should be as simple as possible
• The tree should generalize as good as possible
Basic questions
• Which attributes should be included in the tree?
• In which order should they be used in the tree?
Standard decision tree learning algorithms: ID3
and C4.5

Entropy
Christof Monz
8
The better an attribute discriminated the classes
in the data, the higher it should be in the tree
How do we quantify the degree of
discrimination?
One way to do this is to use entropy
Entropy measures the uncertainty/ambiguity in
the data
H(S) = −p⊕log2p⊕ −p log2p
where p⊕/p is the probability of a
positive/negative class occurring in S
Entropy
Christof Monz
9
In general, the entropy of a subset of S of the
training examples with respect to the target
class is deﬁned as:
H(S) = ∑
c∈C
−pclog2pc
where C is the set of possible classes and pc is
the probability of an instance in S to belong to
class c
Note, we deﬁne 0log20 = 0

Entropy
Christof Monz
10
Information Gain
Christof Monz
11
Information gain is the reduction in entropy
gain(S,A) = H(S)− ∑
v∈values(A)
|Sv |
|S| H(Sv )
where values(A) is the set of possible values of
attribute A and Sv is the subset of S for which
attribute A has value v
gain(S,A) is the number of bits saved when
encoding an arbitrary member of S by knowing
the value of attribute A

Information Gain Example
Christof Monz
12
S = [9+,5−]
values(wind) = {weak,strong}
Sweak = [6+,2−]
Sstrong = [3+,3−]
gain(S,wind) = H(S)− ∑
v∈{weak,strong}
|Sv |
|S|
H(Sv )
= H(S)−(8/14)H(Sweak)−(6/14)H(Sstrong)
= 0.94 −(8/14)0.811 −(6/14)1.0
= 0.048
Comparing Information Gains
Christof Monz
13

ID3 DT Learning
Christof Monz
14
The ID3 algorithm computes the information
gain for each node in the tree and each
attribute, and chooses the attribute with the
highest gain
For instance, at the root (S) the gains are:
• gain(S,outlook) = 0.246
gain(S,humidity) = 0.151
gain(S,wind) = 0.048
gain(S,temperature) = 0.029
• Hence outlook outlook is chosen for the top node
ID3 then iteratively selects the attribute with
the highest gain for each daughter of the
previous node, . . .
ID3 DT Learning
Christof Monz
15

ID3 Algorithm
Christof Monz
16
ID3(S’,S,node,attr)
if (for all s in S: class(s)=c)
return leaf node with class c;
else if (attr is empty)
return leaf node with most frequent class in S
else if (S is empty)
return leaf node with most frequent class in S’
else
a=argmax (a’∈attr) gain(S,a)
attribute(node)=a;
for each v∈values(a)
new(node v); new edge(node,node v);label(node,node v)=v;
ID3(S,S v,attr-{a})
Initial call: ID3(/0,S,root,A)
Hypothesis Search Space
Christof Monz
17

Hypothesis Search Space
Christof Monz
18
The hypothesis space searched by ID3 is the set
of possible decision trees
Hill-climbing (greedy) search guided purely by
information gain measure
• Only one hypothesis is considered for further extension
• No back-tracking to hypotheses dismissed earlier
All (relevant) training examples are used to
guide search
Due to greedy search, ID3 can get stuck in a
local optimum
Inductive Bias of ID3
Christof Monz
19
ID3 has a preference for small trees (in
particular short trees)
ID3 has a preference for trees with high
information gain attributes near the root
Note, a bias is a preference for some hypotheses,
rather than a restriction of the hypothesis space
Some form of bias is required in order to
generalize beyond the training data

Evaluation
Christof Monz
20
How good is the learned decision tree?
Split the available data into a training set and
a test set
Sometimes data comes already with a
pre-deﬁned split
Rule of thumb: use 80% for training and 20%
for testing
Test set should be big enough to draw stable
conclusions
Evaluation
Christof Monz
21

Cross-Validation
Christof Monz
22
What if availabe data is rather small?
Re-run training and testing on n different
portions of the data
Known as n-fold cross-validation
Compute the accuraccies of all combined
test-portions from each fold
Allows one also to report variation across
different folds
Stratified cross validation makes sure that the
different folds contain the same proportions of
class labels
Cross-Validation
Christof Monz
23

Occam’s Razor
Christof Monz
24
Occam’s Razor (OR): Prefer the simplest
hypothesis that fits the data
Pro OR: A long hypothesis fitting the data
rather describes the data and it does not model
the underlying principle that generated the data
Pro OR: A short hypothesis fitting the data is
unlikely to be coincidence
Con OR: There are numerous ways to define the
size of hypotheses
Overfitting
Christof Monz
25
Definition: Given a hypothesis space H, a
hypothesis h ∈ H is said to overfit the training
data if there exists some alternative hypothesis
h ∈ H, such that h has a smaller error than h
over the training data, but h has a smaller error
than h on the entire distribution of data.
Roughly speaking, a hypothesis h overfits if it
does not generalize as well beyond the training
data as another hypothesis h

Overfitting
Christof Monz
26
Overfitting
Christof Monz
27
Reasons for overfitting
• The training set is too small
• The training data is not representative of the real
distribution
• The training data contains errors (measurement errors,
human annotation errors, . . . )
Overfitting is an significant issue in practical
data mining applications
Two approaches to avoid reduce overfitting
• Stop-early tree growth (before it classifies the data
perfectly)
• Post-pruning of tree (after the complete tree has been
learned)

Reduced Error Pruning
Christof Monz
28
Split the training set into two sets:
• training subset (approx. 80%)
• validation set (approx. 20%)
A decision tree T is learned from the training
subset
Prune the node in T that leads to the highest
improvement on the validation set (and repeat
for all nodes until accuracy drops)
Pruning a node in a tree substitutes the node
and its subtree by the most common class under
the node
Eﬀect of Pruning
Christof Monz
29

Rule Post-Pruning
Christof Monz
30
Instead of pruning entire subtrees rule
post-pruning aﬀects only parts of the decision
chain
Convert tree into set of rules
• Each path is represented as a rule of the form:
If a1 = v1 ∧...∧an = vn then class = c
• For example:
If outlook = sunny∧humidity = high then play = no
Remove conjuncts in order of improvements on
the validation set until no further improvements
All paths are pruned independently of each other
Continuous-Valued Attributes
Christof Monz
31
Considering real-valued attributes (like
temperature) as discrete values is clearly
inappropriate
Use threshold: if(value(a)<c) then . . . else . . .
Threshold c can be determined by computing
the maximum information gain for diﬀerent
candidate thresholds
Note: Numeric attributes can be repeated along
the same path

Information Gain Revisited
Christof Monz
32
H(S) = ∑
c∈C
−pclog2pc
v∈values(A)
|Sv |
|S| H(Sv )
Information gain favors attributes with many
values over those with few. Why?
Extension: Measure how broadly and uniformly
the attribute splits the data:
split(S,A) = − ∑
v∈values(A)
|Sv |
|S| log2
|Sv |
|S|
split is the entropy of the attribute-value
distribution in S
Information Gain Revisited
Christof Monz
33
Information gain and split can be combined:
gain ratio(S,A) =
gain(S,A)
split(S,A)
If |value(A)| = n and A completely determines
the class, then split(S,A) = log2n
If |Sv | ≈ |S| for one v then split(S,A) becomes
small and boosts the gain ratio
Heuristic: Compute information gain ﬁrst,
remove attributes with below-average gain and
then select the attribute with the highest
gain-ratio

Missing Attribute Values
Christof Monz
34
In real-world data it is not unusual that some
instances have missing attribute values
How to compute gain(S,A) for instances with
missing values?
Assume instance x with class(x) = c and
a(x) =?
• Take the most frequent value of a of all instances in S
(with the same class)
• Take the average value of a of all instances in S (with
the same class) for numeric attributes
Missing Attribute Values
Christof Monz
35
Instead of choosing the single most frequent
value, use fractional instances
E.g., if p(a(x) = 1|S) = 0.6 and
p(a(x) = 0|S) = 0.4 then 0.6 (0.4) fractional
instances with missing values for a are passed
down the a = 1 (a = 0) branch
Entropy computation has to be adapted
accordingly

Attributes with Different Costs
Christof Monz
36
In real-world scenarios there can be costs
associated with computing the values of
attributes (medical tests, computing time, . . . )
Considering costs might favors usage of
lower-cost attributes
Suggested measures include:
• gain(S,A)
cost(A)
• 2gain(S,A)−1
(cost(A)+1)w
where w ∈ [0,1] is a weight determining the importance
of cost
Predicting Continuous Values
Christof Monz
37
So far we have focused on predicting discrete
classes (i.e. nominal classification)
What has to change when predicting real
values?
Splitting criterion redefined
• Information gain:
v∈values(A)
|Sv |
|S| H(Sv )
Standard deviation reduction (SDR)
sdr(S,A) = std dev(S)− ∑
v∈val(A)
|Sv |
|S| std dev(Sv )
where std dev(S) = ∑s∈S
1
|S|(val(s)−avg val(S))2

Christof Monz
38
Stopping criterion in nominal classification:
Stop when all leaves in S have the same class
Too fine-grained for real-value prediction
Stop when standard deviation of node n is less
then some predefined ratio of standard deviation
of the original instance set:
Stop if
std dev(Sn)
std dev(Sall )
< θ, where, e.g., θ = 0.05
Christof Monz
39
If we decide not to split further on node n, what
should the predicted value be?
Simple solution: the average target value of the
instances underneath node n:
class(n) = avg val(Sn)
This approach is used in regression trees
More sophisticated: associate linear regression
models with all leaf nodes (model trees)

Model Tree Learning
Christof Monz
40
Suppose we have leave node n, regression trees
use the average target value of the instances
under n
More ﬁne-grained approach is to apply linear
regression to all instances under n:
class(n) = a +b1x1 +b2x2 +···bmxm
where x1,x2,...,xm are the values of the
attributes that lead to n in the tree
a and bi are estimated just like in linear
regression
Problem: Not all attributes are numerical!
Converting Nominal Attributes
Christof Monz
41
Assume a nominal attribute such as
outlook = {sunny,overcast,rain}
We can convert this into numerical values
simply by choosing equi-distant values from a
speciﬁc interval: outlook = {1,0.5,0}
This assumes an intuitive ordering of the values:
sunny > overcast > rain
Direct ordering of values not always possible:
city = {london,new york,tokyo}
london > new york > tokyo ???

Converting Nominal Attributes
Christof Monz
42
Sort nominal values of attribute A by their
average target values
Introduce k −1 synthetic binary attributes, if
nominal attribute A has k values
The ith binary attribute checks whether the ith
nominal value in the ordering holds
For instance, if avg trg val(new york) <
avg trg val(london) < avg trg val(tokyo) then
the k −1 synthetic binary attributes are:
is new york and is new york OR london
Recap
Christof Monz
43
Elements of a decision tree
Information gain
ID3 algorithm
Bias of ID3
Overﬁtting and Pruning
Attributes with many values (gain ratio)
Attributes with continuous values
Attributes with missing values
Predicting continuous classes:
• Regression trees
• Model trees

Dm week02 decision-trees-handout

Recommended

Recommended

More Related Content

Similar to Dm week02 decision-trees-handout

Similar to Dm week02 decision-trees-handout (20)

More from okeee

More from okeee (20)

Recently uploaded

Recently uploaded (20)

Dm week02 decision-trees-handout