SlideShare a Scribd company logo
1 of 22
Download to read offline
Christof Monz
Informatics Institute
University of Amsterdam
Data Mining
Week 2: Decision Tree Learning
Today’s Class
Christof Monz
Data Mining - Week 2: Decision Tree Learning
1
Decision Trees
Decision tree learning algorithms
Learning bias
Overfitting
Pruning
Extensions to learning with real values
Decision Tree Learning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
2
Main algorithms introduced by Quinlan in the
1980s
A decision tree is a set of hierarchically nested
classification rules
Each rule is a node in the tree investigates a
specific attribute
Branches correspond to the values of the
attributes
Example Data
Christof Monz
Data Mining - Week 2: Decision Tree Learning
3
When to play tennis? (training data)
day outlook temperature humidity wind play
1 sunny hot high weak no
2 sunny hot high strong no
3 overcast hot high weak yes
4 rain mild high weak yes
5 rain cool normal weak yes
6 rain cool normal strong no
7 overcast cool normal strong yes
8 sunny mild high weak no
9 sunny cool normal weak yes
10 rain mild normal weak yes
11 sunny mild normal strong yes
12 overcast mild high strong yes
13 overcast hot normal weak yes
14 rain mild high strong no
Decision Tree
Christof Monz
Data Mining - Week 2: Decision Tree Learning
4
Nodes check attribute values
Leaves are final classifications
Decision Tree
Christof Monz
Data Mining - Week 2: Decision Tree Learning
5
Decision trees can be represented as logical
expressions in disjunctive normal form
Each path from the root corresponds to a
conjunction of attribute-value equations
All paths of the tree are combined by disjunction
(outlook=sunny ∧ humidity=normal)
∨ (outlook=overcast)
∨ (outlook=rain ∧ wind=weak)
Appropriate Problems for DTs
Christof Monz
Data Mining - Week 2: Decision Tree Learning
6
Attributes have discrete values (real-value
extension discussed later)
The class values are discrete (real-value
extension discussed later)
Training data may contain errors
Training data may contain instances with
missing/unknown attribute values
Learning Decision Trees
Christof Monz
Data Mining - Week 2: Decision Tree Learning
7
Many different decision trees can be learned for
a given training set
A number of criteria apply
• The tree should be as accurate as possible
• The tree should be as simple as possible
• The tree should generalize as good as possible
Basic questions
• Which attributes should be included in the tree?
• In which order should they be used in the tree?
Standard decision tree learning algorithms: ID3
and C4.5
Entropy
Christof Monz
Data Mining - Week 2: Decision Tree Learning
8
The better an attribute discriminated the classes
in the data, the higher it should be in the tree
How do we quantify the degree of
discrimination?
One way to do this is to use entropy
Entropy measures the uncertainty/ambiguity in
the data
H(S) = −p⊕log2p⊕ −p log2p
where p⊕/p is the probability of a
positive/negative class occurring in S
Entropy
Christof Monz
Data Mining - Week 2: Decision Tree Learning
9
In general, the entropy of a subset of S of the
training examples with respect to the target
class is defined as:
H(S) = ∑
c∈C
−pclog2pc
where C is the set of possible classes and pc is
the probability of an instance in S to belong to
class c
Note, we define 0log20 = 0
Entropy
Christof Monz
Data Mining - Week 2: Decision Tree Learning
10
Information Gain
Christof Monz
Data Mining - Week 2: Decision Tree Learning
11
Information gain is the reduction in entropy
gain(S,A) = H(S)− ∑
v∈values(A)
|Sv |
|S| H(Sv )
where values(A) is the set of possible values of
attribute A and Sv is the subset of S for which
attribute A has value v
gain(S,A) is the number of bits saved when
encoding an arbitrary member of S by knowing
the value of attribute A
Information Gain Example
Christof Monz
Data Mining - Week 2: Decision Tree Learning
12
S = [9+,5−]
values(wind) = {weak,strong}
Sweak = [6+,2−]
Sstrong = [3+,3−]
gain(S,wind) = H(S)− ∑
v∈{weak,strong}
|Sv |
|S|
H(Sv )
= H(S)−(8/14)H(Sweak)−(6/14)H(Sstrong)
= 0.94 −(8/14)0.811 −(6/14)1.0
= 0.048
Comparing Information Gains
Christof Monz
Data Mining - Week 2: Decision Tree Learning
13
ID3 DT Learning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
14
The ID3 algorithm computes the information
gain for each node in the tree and each
attribute, and chooses the attribute with the
highest gain
For instance, at the root (S) the gains are:
• gain(S,outlook) = 0.246
gain(S,humidity) = 0.151
gain(S,wind) = 0.048
gain(S,temperature) = 0.029
• Hence outlook outlook is chosen for the top node
ID3 then iteratively selects the attribute with
the highest gain for each daughter of the
previous node, . . .
ID3 DT Learning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
15
ID3 Algorithm
Christof Monz
Data Mining - Week 2: Decision Tree Learning
16
ID3(S’,S,node,attr)
if (for all s in S: class(s)=c)
return leaf node with class c;
else if (attr is empty)
return leaf node with most frequent class in S
else if (S is empty)
return leaf node with most frequent class in S’
else
a=argmax (a’∈attr) gain(S,a)
attribute(node)=a;
for each v∈values(a)
new(node v); new edge(node,node v);label(node,node v)=v;
ID3(S,S v,attr-{a})
Initial call: ID3(/0,S,root,A)
Hypothesis Search Space
Christof Monz
Data Mining - Week 2: Decision Tree Learning
17
Hypothesis Search Space
Christof Monz
Data Mining - Week 2: Decision Tree Learning
18
The hypothesis space searched by ID3 is the set
of possible decision trees
Hill-climbing (greedy) search guided purely by
information gain measure
• Only one hypothesis is considered for further extension
• No back-tracking to hypotheses dismissed earlier
All (relevant) training examples are used to
guide search
Due to greedy search, ID3 can get stuck in a
local optimum
Inductive Bias of ID3
Christof Monz
Data Mining - Week 2: Decision Tree Learning
19
ID3 has a preference for small trees (in
particular short trees)
ID3 has a preference for trees with high
information gain attributes near the root
Note, a bias is a preference for some hypotheses,
rather than a restriction of the hypothesis space
Some form of bias is required in order to
generalize beyond the training data
Evaluation
Christof Monz
Data Mining - Week 2: Decision Tree Learning
20
How good is the learned decision tree?
Split the available data into a training set and
a test set
Sometimes data comes already with a
pre-defined split
Rule of thumb: use 80% for training and 20%
for testing
Test set should be big enough to draw stable
conclusions
Evaluation
Christof Monz
Data Mining - Week 2: Decision Tree Learning
21
Cross-Validation
Christof Monz
Data Mining - Week 2: Decision Tree Learning
22
What if availabe data is rather small?
Re-run training and testing on n different
portions of the data
Known as n-fold cross-validation
Compute the accuraccies of all combined
test-portions from each fold
Allows one also to report variation across
different folds
Stratified cross validation makes sure that the
different folds contain the same proportions of
class labels
Cross-Validation
Christof Monz
Data Mining - Week 2: Decision Tree Learning
23
Occam’s Razor
Christof Monz
Data Mining - Week 2: Decision Tree Learning
24
Occam’s Razor (OR): Prefer the simplest
hypothesis that fits the data
Pro OR: A long hypothesis fitting the data
rather describes the data and it does not model
the underlying principle that generated the data
Pro OR: A short hypothesis fitting the data is
unlikely to be coincidence
Con OR: There are numerous ways to define the
size of hypotheses
Overfitting
Christof Monz
Data Mining - Week 2: Decision Tree Learning
25
Definition: Given a hypothesis space H, a
hypothesis h ∈ H is said to overfit the training
data if there exists some alternative hypothesis
h ∈ H, such that h has a smaller error than h
over the training data, but h has a smaller error
than h on the entire distribution of data.
Roughly speaking, a hypothesis h overfits if it
does not generalize as well beyond the training
data as another hypothesis h
Overfitting
Christof Monz
Data Mining - Week 2: Decision Tree Learning
26
Overfitting
Christof Monz
Data Mining - Week 2: Decision Tree Learning
27
Reasons for overfitting
• The training set is too small
• The training data is not representative of the real
distribution
• The training data contains errors (measurement errors,
human annotation errors, . . . )
Overfitting is an significant issue in practical
data mining applications
Two approaches to avoid reduce overfitting
• Stop-early tree growth (before it classifies the data
perfectly)
• Post-pruning of tree (after the complete tree has been
learned)
Reduced Error Pruning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
28
Split the training set into two sets:
• training subset (approx. 80%)
• validation set (approx. 20%)
A decision tree T is learned from the training
subset
Prune the node in T that leads to the highest
improvement on the validation set (and repeat
for all nodes until accuracy drops)
Pruning a node in a tree substitutes the node
and its subtree by the most common class under
the node
Effect of Pruning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
29
Rule Post-Pruning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
30
Instead of pruning entire subtrees rule
post-pruning affects only parts of the decision
chain
Convert tree into set of rules
• Each path is represented as a rule of the form:
If a1 = v1 ∧...∧an = vn then class = c
• For example:
If outlook = sunny∧humidity = high then play = no
Remove conjuncts in order of improvements on
the validation set until no further improvements
All paths are pruned independently of each other
Continuous-Valued Attributes
Christof Monz
Data Mining - Week 2: Decision Tree Learning
31
Considering real-valued attributes (like
temperature) as discrete values is clearly
inappropriate
Use threshold: if(value(a)<c) then . . . else . . .
Threshold c can be determined by computing
the maximum information gain for different
candidate thresholds
Note: Numeric attributes can be repeated along
the same path
Information Gain Revisited
Christof Monz
Data Mining - Week 2: Decision Tree Learning
32
H(S) = ∑
c∈C
−pclog2pc
gain(S,A) = H(S)− ∑
v∈values(A)
|Sv |
|S| H(Sv )
Information gain favors attributes with many
values over those with few. Why?
Extension: Measure how broadly and uniformly
the attribute splits the data:
split(S,A) = − ∑
v∈values(A)
|Sv |
|S| log2
|Sv |
|S|
split is the entropy of the attribute-value
distribution in S
Information Gain Revisited
Christof Monz
Data Mining - Week 2: Decision Tree Learning
33
Information gain and split can be combined:
gain ratio(S,A) =
gain(S,A)
split(S,A)
If |value(A)| = n and A completely determines
the class, then split(S,A) = log2n
If |Sv | ≈ |S| for one v then split(S,A) becomes
small and boosts the gain ratio
Heuristic: Compute information gain first,
remove attributes with below-average gain and
then select the attribute with the highest
gain-ratio
Missing Attribute Values
Christof Monz
Data Mining - Week 2: Decision Tree Learning
34
In real-world data it is not unusual that some
instances have missing attribute values
How to compute gain(S,A) for instances with
missing values?
Assume instance x with class(x) = c and
a(x) =?
• Take the most frequent value of a of all instances in S
(with the same class)
• Take the average value of a of all instances in S (with
the same class) for numeric attributes
Missing Attribute Values
Christof Monz
Data Mining - Week 2: Decision Tree Learning
35
Instead of choosing the single most frequent
value, use fractional instances
E.g., if p(a(x) = 1|S) = 0.6 and
p(a(x) = 0|S) = 0.4 then 0.6 (0.4) fractional
instances with missing values for a are passed
down the a = 1 (a = 0) branch
Entropy computation has to be adapted
accordingly
Attributes with Different Costs
Christof Monz
Data Mining - Week 2: Decision Tree Learning
36
In real-world scenarios there can be costs
associated with computing the values of
attributes (medical tests, computing time, . . . )
Considering costs might favors usage of
lower-cost attributes
Suggested measures include:
• gain(S,A)
cost(A)
• 2gain(S,A)−1
(cost(A)+1)w
where w ∈ [0,1] is a weight determining the importance
of cost
Predicting Continuous Values
Christof Monz
Data Mining - Week 2: Decision Tree Learning
37
So far we have focused on predicting discrete
classes (i.e. nominal classification)
What has to change when predicting real
values?
Splitting criterion redefined
• Information gain:
gain(S,A) = H(S)− ∑
v∈values(A)
|Sv |
|S| H(Sv )
Standard deviation reduction (SDR)
sdr(S,A) = std dev(S)− ∑
v∈val(A)
|Sv |
|S| std dev(Sv )
where std dev(S) = ∑s∈S
1
|S|(val(s)−avg val(S))2
Predicting Continuous Values
Christof Monz
Data Mining - Week 2: Decision Tree Learning
38
Stopping criterion in nominal classification:
Stop when all leaves in S have the same class
Too fine-grained for real-value prediction
Stop when standard deviation of node n is less
then some predefined ratio of standard deviation
of the original instance set:
Stop if
std dev(Sn)
std dev(Sall )
< θ, where, e.g., θ = 0.05
Predicting Continuous Values
Christof Monz
Data Mining - Week 2: Decision Tree Learning
39
If we decide not to split further on node n, what
should the predicted value be?
Simple solution: the average target value of the
instances underneath node n:
class(n) = avg val(Sn)
This approach is used in regression trees
More sophisticated: associate linear regression
models with all leaf nodes (model trees)
Model Tree Learning
Christof Monz
Data Mining - Week 2: Decision Tree Learning
40
Suppose we have leave node n, regression trees
use the average target value of the instances
under n
More fine-grained approach is to apply linear
regression to all instances under n:
class(n) = a +b1x1 +b2x2 +···bmxm
where x1,x2,...,xm are the values of the
attributes that lead to n in the tree
a and bi are estimated just like in linear
regression
Problem: Not all attributes are numerical!
Converting Nominal Attributes
Christof Monz
Data Mining - Week 2: Decision Tree Learning
41
Assume a nominal attribute such as
outlook = {sunny,overcast,rain}
We can convert this into numerical values
simply by choosing equi-distant values from a
specific interval: outlook = {1,0.5,0}
This assumes an intuitive ordering of the values:
sunny > overcast > rain
Direct ordering of values not always possible:
city = {london,new york,tokyo}
london > new york > tokyo ???
Converting Nominal Attributes
Christof Monz
Data Mining - Week 2: Decision Tree Learning
42
Sort nominal values of attribute A by their
average target values
Introduce k −1 synthetic binary attributes, if
nominal attribute A has k values
The ith binary attribute checks whether the ith
nominal value in the ordering holds
For instance, if avg trg val(new york) <
avg trg val(london) < avg trg val(tokyo) then
the k −1 synthetic binary attributes are:
is new york and is new york OR london
Recap
Christof Monz
Data Mining - Week 2: Decision Tree Learning
43
Elements of a decision tree
Information gain
ID3 algorithm
Bias of ID3
Overfitting and Pruning
Attributes with many values (gain ratio)
Attributes with continuous values
Attributes with missing values
Predicting continuous classes:
• Regression trees
• Model trees

More Related Content

Similar to Dm week02 decision-trees-handout

Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
butest
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
Xueping Peng
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
butest
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
butest
 
Dm week01 intro.handout
Dm week01 intro.handoutDm week01 intro.handout
Dm week01 intro.handout
okeee
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
Nandhini S
 
NOTE STATISTIC BA301
NOTE STATISTIC BA301NOTE STATISTIC BA301
NOTE STATISTIC BA301
faijmsk
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
butest
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
butest
 

Similar to Dm week02 decision-trees-handout (20)

Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Decision tree learning
Decision tree learningDecision tree learning
Decision tree learning
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
 
Chapter8
Chapter8Chapter8
Chapter8
 
pattern mninng.ppt
pattern mninng.pptpattern mninng.ppt
pattern mninng.ppt
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest Neighbors
 
Isolation Forest
Isolation ForestIsolation Forest
Isolation Forest
 
Clustering
ClusteringClustering
Clustering
 
Clustering
ClusteringClustering
Clustering
 
Dm week01 intro.handout
Dm week01 intro.handoutDm week01 intro.handout
Dm week01 intro.handout
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
 
NOTE STATISTIC BA301
NOTE STATISTIC BA301NOTE STATISTIC BA301
NOTE STATISTIC BA301
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Dbm630 lecture06
Dbm630 lecture06Dbm630 lecture06
Dbm630 lecture06
 

More from okeee

Week02 answer
Week02 answerWeek02 answer
Week02 answer
okeee
 
Dm uitwerkingen wc4
Dm uitwerkingen wc4Dm uitwerkingen wc4
Dm uitwerkingen wc4
okeee
 
Dm uitwerkingen wc2
Dm uitwerkingen wc2Dm uitwerkingen wc2
Dm uitwerkingen wc2
okeee
 
Dm uitwerkingen wc1
Dm uitwerkingen wc1Dm uitwerkingen wc1
Dm uitwerkingen wc1
okeee
 
Dm uitwerkingen wc3
Dm uitwerkingen wc3Dm uitwerkingen wc3
Dm uitwerkingen wc3
okeee
 
Dm uitwerkingen wc1
Dm uitwerkingen wc1Dm uitwerkingen wc1
Dm uitwerkingen wc1
okeee
 
Dm part03 neural-networks-handout
Dm part03 neural-networks-handoutDm part03 neural-networks-handout
Dm part03 neural-networks-handout
okeee
 
Dm part03 neural-networks-homework
Dm part03 neural-networks-homeworkDm part03 neural-networks-homework
Dm part03 neural-networks-homework
okeee
 
10[1].1.1.115.9508
10[1].1.1.115.950810[1].1.1.115.9508
10[1].1.1.115.9508
okeee
 
Hcm p137 hilliges
Hcm p137 hilligesHcm p137 hilliges
Hcm p137 hilliges
okeee
 
Prob18
Prob18Prob18
Prob18
okeee
 
Overfit10
Overfit10Overfit10
Overfit10
okeee
 
Decision tree.10.11
Decision tree.10.11Decision tree.10.11
Decision tree.10.11
okeee
 
Dm week01 linreg.handout
Dm week01 linreg.handoutDm week01 linreg.handout
Dm week01 linreg.handout
okeee
 
Dm week01 prob-refresher.handout
Dm week01 prob-refresher.handoutDm week01 prob-refresher.handout
Dm week01 prob-refresher.handout
okeee
 
Dm week01 homework(1)
Dm week01 homework(1)Dm week01 homework(1)
Dm week01 homework(1)
okeee
 
Chapter7 huizing
Chapter7 huizingChapter7 huizing
Chapter7 huizing
okeee
 
Chapter8 choo
Chapter8 chooChapter8 choo
Chapter8 choo
okeee
 
Chapter6 huizing
Chapter6 huizingChapter6 huizing
Chapter6 huizing
okeee
 
Kbms text-image
Kbms text-imageKbms text-image
Kbms text-image
okeee
 

More from okeee (20)

Week02 answer
Week02 answerWeek02 answer
Week02 answer
 
Dm uitwerkingen wc4
Dm uitwerkingen wc4Dm uitwerkingen wc4
Dm uitwerkingen wc4
 
Dm uitwerkingen wc2
Dm uitwerkingen wc2Dm uitwerkingen wc2
Dm uitwerkingen wc2
 
Dm uitwerkingen wc1
Dm uitwerkingen wc1Dm uitwerkingen wc1
Dm uitwerkingen wc1
 
Dm uitwerkingen wc3
Dm uitwerkingen wc3Dm uitwerkingen wc3
Dm uitwerkingen wc3
 
Dm uitwerkingen wc1
Dm uitwerkingen wc1Dm uitwerkingen wc1
Dm uitwerkingen wc1
 
Dm part03 neural-networks-handout
Dm part03 neural-networks-handoutDm part03 neural-networks-handout
Dm part03 neural-networks-handout
 
Dm part03 neural-networks-homework
Dm part03 neural-networks-homeworkDm part03 neural-networks-homework
Dm part03 neural-networks-homework
 
10[1].1.1.115.9508
10[1].1.1.115.950810[1].1.1.115.9508
10[1].1.1.115.9508
 
Hcm p137 hilliges
Hcm p137 hilligesHcm p137 hilliges
Hcm p137 hilliges
 
Prob18
Prob18Prob18
Prob18
 
Overfit10
Overfit10Overfit10
Overfit10
 
Decision tree.10.11
Decision tree.10.11Decision tree.10.11
Decision tree.10.11
 
Dm week01 linreg.handout
Dm week01 linreg.handoutDm week01 linreg.handout
Dm week01 linreg.handout
 
Dm week01 prob-refresher.handout
Dm week01 prob-refresher.handoutDm week01 prob-refresher.handout
Dm week01 prob-refresher.handout
 
Dm week01 homework(1)
Dm week01 homework(1)Dm week01 homework(1)
Dm week01 homework(1)
 
Chapter7 huizing
Chapter7 huizingChapter7 huizing
Chapter7 huizing
 
Chapter8 choo
Chapter8 chooChapter8 choo
Chapter8 choo
 
Chapter6 huizing
Chapter6 huizingChapter6 huizing
Chapter6 huizing
 
Kbms text-image
Kbms text-imageKbms text-image
Kbms text-image
 

Recently uploaded

Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 

Recently uploaded (20)

Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 

Dm week02 decision-trees-handout

  • 1. Christof Monz Informatics Institute University of Amsterdam Data Mining Week 2: Decision Tree Learning Today’s Class Christof Monz Data Mining - Week 2: Decision Tree Learning 1 Decision Trees Decision tree learning algorithms Learning bias Overfitting Pruning Extensions to learning with real values
  • 2. Decision Tree Learning Christof Monz Data Mining - Week 2: Decision Tree Learning 2 Main algorithms introduced by Quinlan in the 1980s A decision tree is a set of hierarchically nested classification rules Each rule is a node in the tree investigates a specific attribute Branches correspond to the values of the attributes Example Data Christof Monz Data Mining - Week 2: Decision Tree Learning 3 When to play tennis? (training data) day outlook temperature humidity wind play 1 sunny hot high weak no 2 sunny hot high strong no 3 overcast hot high weak yes 4 rain mild high weak yes 5 rain cool normal weak yes 6 rain cool normal strong no 7 overcast cool normal strong yes 8 sunny mild high weak no 9 sunny cool normal weak yes 10 rain mild normal weak yes 11 sunny mild normal strong yes 12 overcast mild high strong yes 13 overcast hot normal weak yes 14 rain mild high strong no
  • 3. Decision Tree Christof Monz Data Mining - Week 2: Decision Tree Learning 4 Nodes check attribute values Leaves are final classifications Decision Tree Christof Monz Data Mining - Week 2: Decision Tree Learning 5 Decision trees can be represented as logical expressions in disjunctive normal form Each path from the root corresponds to a conjunction of attribute-value equations All paths of the tree are combined by disjunction (outlook=sunny ∧ humidity=normal) ∨ (outlook=overcast) ∨ (outlook=rain ∧ wind=weak)
  • 4. Appropriate Problems for DTs Christof Monz Data Mining - Week 2: Decision Tree Learning 6 Attributes have discrete values (real-value extension discussed later) The class values are discrete (real-value extension discussed later) Training data may contain errors Training data may contain instances with missing/unknown attribute values Learning Decision Trees Christof Monz Data Mining - Week 2: Decision Tree Learning 7 Many different decision trees can be learned for a given training set A number of criteria apply • The tree should be as accurate as possible • The tree should be as simple as possible • The tree should generalize as good as possible Basic questions • Which attributes should be included in the tree? • In which order should they be used in the tree? Standard decision tree learning algorithms: ID3 and C4.5
  • 5. Entropy Christof Monz Data Mining - Week 2: Decision Tree Learning 8 The better an attribute discriminated the classes in the data, the higher it should be in the tree How do we quantify the degree of discrimination? One way to do this is to use entropy Entropy measures the uncertainty/ambiguity in the data H(S) = −p⊕log2p⊕ −p log2p where p⊕/p is the probability of a positive/negative class occurring in S Entropy Christof Monz Data Mining - Week 2: Decision Tree Learning 9 In general, the entropy of a subset of S of the training examples with respect to the target class is defined as: H(S) = ∑ c∈C −pclog2pc where C is the set of possible classes and pc is the probability of an instance in S to belong to class c Note, we define 0log20 = 0
  • 6. Entropy Christof Monz Data Mining - Week 2: Decision Tree Learning 10 Information Gain Christof Monz Data Mining - Week 2: Decision Tree Learning 11 Information gain is the reduction in entropy gain(S,A) = H(S)− ∑ v∈values(A) |Sv | |S| H(Sv ) where values(A) is the set of possible values of attribute A and Sv is the subset of S for which attribute A has value v gain(S,A) is the number of bits saved when encoding an arbitrary member of S by knowing the value of attribute A
  • 7. Information Gain Example Christof Monz Data Mining - Week 2: Decision Tree Learning 12 S = [9+,5−] values(wind) = {weak,strong} Sweak = [6+,2−] Sstrong = [3+,3−] gain(S,wind) = H(S)− ∑ v∈{weak,strong} |Sv | |S| H(Sv ) = H(S)−(8/14)H(Sweak)−(6/14)H(Sstrong) = 0.94 −(8/14)0.811 −(6/14)1.0 = 0.048 Comparing Information Gains Christof Monz Data Mining - Week 2: Decision Tree Learning 13
  • 8. ID3 DT Learning Christof Monz Data Mining - Week 2: Decision Tree Learning 14 The ID3 algorithm computes the information gain for each node in the tree and each attribute, and chooses the attribute with the highest gain For instance, at the root (S) the gains are: • gain(S,outlook) = 0.246 gain(S,humidity) = 0.151 gain(S,wind) = 0.048 gain(S,temperature) = 0.029 • Hence outlook outlook is chosen for the top node ID3 then iteratively selects the attribute with the highest gain for each daughter of the previous node, . . . ID3 DT Learning Christof Monz Data Mining - Week 2: Decision Tree Learning 15
  • 9. ID3 Algorithm Christof Monz Data Mining - Week 2: Decision Tree Learning 16 ID3(S’,S,node,attr) if (for all s in S: class(s)=c) return leaf node with class c; else if (attr is empty) return leaf node with most frequent class in S else if (S is empty) return leaf node with most frequent class in S’ else a=argmax (a’∈attr) gain(S,a) attribute(node)=a; for each v∈values(a) new(node v); new edge(node,node v);label(node,node v)=v; ID3(S,S v,attr-{a}) Initial call: ID3(/0,S,root,A) Hypothesis Search Space Christof Monz Data Mining - Week 2: Decision Tree Learning 17
  • 10. Hypothesis Search Space Christof Monz Data Mining - Week 2: Decision Tree Learning 18 The hypothesis space searched by ID3 is the set of possible decision trees Hill-climbing (greedy) search guided purely by information gain measure • Only one hypothesis is considered for further extension • No back-tracking to hypotheses dismissed earlier All (relevant) training examples are used to guide search Due to greedy search, ID3 can get stuck in a local optimum Inductive Bias of ID3 Christof Monz Data Mining - Week 2: Decision Tree Learning 19 ID3 has a preference for small trees (in particular short trees) ID3 has a preference for trees with high information gain attributes near the root Note, a bias is a preference for some hypotheses, rather than a restriction of the hypothesis space Some form of bias is required in order to generalize beyond the training data
  • 11. Evaluation Christof Monz Data Mining - Week 2: Decision Tree Learning 20 How good is the learned decision tree? Split the available data into a training set and a test set Sometimes data comes already with a pre-defined split Rule of thumb: use 80% for training and 20% for testing Test set should be big enough to draw stable conclusions Evaluation Christof Monz Data Mining - Week 2: Decision Tree Learning 21
  • 12. Cross-Validation Christof Monz Data Mining - Week 2: Decision Tree Learning 22 What if availabe data is rather small? Re-run training and testing on n different portions of the data Known as n-fold cross-validation Compute the accuraccies of all combined test-portions from each fold Allows one also to report variation across different folds Stratified cross validation makes sure that the different folds contain the same proportions of class labels Cross-Validation Christof Monz Data Mining - Week 2: Decision Tree Learning 23
  • 13. Occam’s Razor Christof Monz Data Mining - Week 2: Decision Tree Learning 24 Occam’s Razor (OR): Prefer the simplest hypothesis that fits the data Pro OR: A long hypothesis fitting the data rather describes the data and it does not model the underlying principle that generated the data Pro OR: A short hypothesis fitting the data is unlikely to be coincidence Con OR: There are numerous ways to define the size of hypotheses Overfitting Christof Monz Data Mining - Week 2: Decision Tree Learning 25 Definition: Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data if there exists some alternative hypothesis h ∈ H, such that h has a smaller error than h over the training data, but h has a smaller error than h on the entire distribution of data. Roughly speaking, a hypothesis h overfits if it does not generalize as well beyond the training data as another hypothesis h
  • 14. Overfitting Christof Monz Data Mining - Week 2: Decision Tree Learning 26 Overfitting Christof Monz Data Mining - Week 2: Decision Tree Learning 27 Reasons for overfitting • The training set is too small • The training data is not representative of the real distribution • The training data contains errors (measurement errors, human annotation errors, . . . ) Overfitting is an significant issue in practical data mining applications Two approaches to avoid reduce overfitting • Stop-early tree growth (before it classifies the data perfectly) • Post-pruning of tree (after the complete tree has been learned)
  • 15. Reduced Error Pruning Christof Monz Data Mining - Week 2: Decision Tree Learning 28 Split the training set into two sets: • training subset (approx. 80%) • validation set (approx. 20%) A decision tree T is learned from the training subset Prune the node in T that leads to the highest improvement on the validation set (and repeat for all nodes until accuracy drops) Pruning a node in a tree substitutes the node and its subtree by the most common class under the node Effect of Pruning Christof Monz Data Mining - Week 2: Decision Tree Learning 29
  • 16. Rule Post-Pruning Christof Monz Data Mining - Week 2: Decision Tree Learning 30 Instead of pruning entire subtrees rule post-pruning affects only parts of the decision chain Convert tree into set of rules • Each path is represented as a rule of the form: If a1 = v1 ∧...∧an = vn then class = c • For example: If outlook = sunny∧humidity = high then play = no Remove conjuncts in order of improvements on the validation set until no further improvements All paths are pruned independently of each other Continuous-Valued Attributes Christof Monz Data Mining - Week 2: Decision Tree Learning 31 Considering real-valued attributes (like temperature) as discrete values is clearly inappropriate Use threshold: if(value(a)<c) then . . . else . . . Threshold c can be determined by computing the maximum information gain for different candidate thresholds Note: Numeric attributes can be repeated along the same path
  • 17. Information Gain Revisited Christof Monz Data Mining - Week 2: Decision Tree Learning 32 H(S) = ∑ c∈C −pclog2pc gain(S,A) = H(S)− ∑ v∈values(A) |Sv | |S| H(Sv ) Information gain favors attributes with many values over those with few. Why? Extension: Measure how broadly and uniformly the attribute splits the data: split(S,A) = − ∑ v∈values(A) |Sv | |S| log2 |Sv | |S| split is the entropy of the attribute-value distribution in S Information Gain Revisited Christof Monz Data Mining - Week 2: Decision Tree Learning 33 Information gain and split can be combined: gain ratio(S,A) = gain(S,A) split(S,A) If |value(A)| = n and A completely determines the class, then split(S,A) = log2n If |Sv | ≈ |S| for one v then split(S,A) becomes small and boosts the gain ratio Heuristic: Compute information gain first, remove attributes with below-average gain and then select the attribute with the highest gain-ratio
  • 18. Missing Attribute Values Christof Monz Data Mining - Week 2: Decision Tree Learning 34 In real-world data it is not unusual that some instances have missing attribute values How to compute gain(S,A) for instances with missing values? Assume instance x with class(x) = c and a(x) =? • Take the most frequent value of a of all instances in S (with the same class) • Take the average value of a of all instances in S (with the same class) for numeric attributes Missing Attribute Values Christof Monz Data Mining - Week 2: Decision Tree Learning 35 Instead of choosing the single most frequent value, use fractional instances E.g., if p(a(x) = 1|S) = 0.6 and p(a(x) = 0|S) = 0.4 then 0.6 (0.4) fractional instances with missing values for a are passed down the a = 1 (a = 0) branch Entropy computation has to be adapted accordingly
  • 19. Attributes with Different Costs Christof Monz Data Mining - Week 2: Decision Tree Learning 36 In real-world scenarios there can be costs associated with computing the values of attributes (medical tests, computing time, . . . ) Considering costs might favors usage of lower-cost attributes Suggested measures include: • gain(S,A) cost(A) • 2gain(S,A)−1 (cost(A)+1)w where w ∈ [0,1] is a weight determining the importance of cost Predicting Continuous Values Christof Monz Data Mining - Week 2: Decision Tree Learning 37 So far we have focused on predicting discrete classes (i.e. nominal classification) What has to change when predicting real values? Splitting criterion redefined • Information gain: gain(S,A) = H(S)− ∑ v∈values(A) |Sv | |S| H(Sv ) Standard deviation reduction (SDR) sdr(S,A) = std dev(S)− ∑ v∈val(A) |Sv | |S| std dev(Sv ) where std dev(S) = ∑s∈S 1 |S|(val(s)−avg val(S))2
  • 20. Predicting Continuous Values Christof Monz Data Mining - Week 2: Decision Tree Learning 38 Stopping criterion in nominal classification: Stop when all leaves in S have the same class Too fine-grained for real-value prediction Stop when standard deviation of node n is less then some predefined ratio of standard deviation of the original instance set: Stop if std dev(Sn) std dev(Sall ) < θ, where, e.g., θ = 0.05 Predicting Continuous Values Christof Monz Data Mining - Week 2: Decision Tree Learning 39 If we decide not to split further on node n, what should the predicted value be? Simple solution: the average target value of the instances underneath node n: class(n) = avg val(Sn) This approach is used in regression trees More sophisticated: associate linear regression models with all leaf nodes (model trees)
  • 21. Model Tree Learning Christof Monz Data Mining - Week 2: Decision Tree Learning 40 Suppose we have leave node n, regression trees use the average target value of the instances under n More fine-grained approach is to apply linear regression to all instances under n: class(n) = a +b1x1 +b2x2 +···bmxm where x1,x2,...,xm are the values of the attributes that lead to n in the tree a and bi are estimated just like in linear regression Problem: Not all attributes are numerical! Converting Nominal Attributes Christof Monz Data Mining - Week 2: Decision Tree Learning 41 Assume a nominal attribute such as outlook = {sunny,overcast,rain} We can convert this into numerical values simply by choosing equi-distant values from a specific interval: outlook = {1,0.5,0} This assumes an intuitive ordering of the values: sunny > overcast > rain Direct ordering of values not always possible: city = {london,new york,tokyo} london > new york > tokyo ???
  • 22. Converting Nominal Attributes Christof Monz Data Mining - Week 2: Decision Tree Learning 42 Sort nominal values of attribute A by their average target values Introduce k −1 synthetic binary attributes, if nominal attribute A has k values The ith binary attribute checks whether the ith nominal value in the ordering holds For instance, if avg trg val(new york) < avg trg val(london) < avg trg val(tokyo) then the k −1 synthetic binary attributes are: is new york and is new york OR london Recap Christof Monz Data Mining - Week 2: Decision Tree Learning 43 Elements of a decision tree Information gain ID3 algorithm Bias of ID3 Overfitting and Pruning Attributes with many values (gain ratio) Attributes with continuous values Attributes with missing values Predicting continuous classes: • Regression trees • Model trees