Lecture6 - C4.5

Introduction to Machine
Learning
Lecture 6

Albert Orriols i Puig
aorriols@salle.url.edu
i l @ ll ld

Artificial Intelligence – Machine Learning
Enginyeria i Arquitectura La Salle
gy q
Universitat Ramon Llull

Recap of Lecture 4
ID3 is a strong system that
gy
Uses hill-climbing search based on the information gain
measure to sea c through the space o dec s o trees
easu e o search oug e of decision ees
Outputs a single hypothesis.
Never b kt k It converges to locally optimal solutions.
N backtracks. tl ll ti l l ti
Uses all training examples at each step, contrary to methods
that
th t make decisions i
k d ii incrementally.
t ll
Uses statistical properties of all examples: the search is less
sensitive t errors i i di id l t i i examples.
iti to in individual training l
Can handle noisy data by modifying its termination criterion to
accept hypotheses that imperfectly fit the data.
th th th t i f tl th d t

Slide 2
Artificial Intelligence Machine Learning

Recap of Lecture 4

However, ID3 has some drawbacks
It
I can only deal with nominal d
l d l ih i l data
It is not able to deal with noisy data sets
It may be not robust in presence of noise

Slide 3

Today’s Agenda

Going from ID3 to C4.5
How C4.5 enhances C4.5 to
Be robust in the presence of noise. Avoid overfitting
Deal with continuous attributes
Deal with missing data
Convert trees to rules

Slide 4

What’s Overfitting?
Overfitting = Given a hypothesis space H, a hypothesis hєH is said to
overfit the training data if there exists some alternative hypothesis h’єH,
such that
h has smaller error than h’ over the training examples, but
h examples
1.
1

h’ has a smaller error than h over the entire distribution of instances.
2.

Slide 5

Why May my System Overfit?
In domains with noise or uncertainty
y
the system may try to decrease the training error by completely
fitting a the training e a p es
g all e a g examples

The learner overfits
to correctly classify
the noisy instances Noisy instances

Occam’s razor: Prefer the
simplest hypothesis that fits
the data with high accuracy

Slide 6

How to Avoid Overfitting?
Ok, my system may overfit… Can I avoid it?
, yy y
Sure! Do not include branches that fit data too specifically
How?
H?
Pre-prune: Stop growing a branch when information becomes
1.
unreliable
li bl
Post-prune: Take a fully-grown decision tree and discard
2.
unreliable parts
li bl

Slide 7

Pre-pruning
Based on statistical significance test
g
Stop growing the tree when there is no statistically significant
assoc a o between any attribute and e class at particular
association be ee a y a bu e a d the c ass a a pa cu a
node
Use all available da a for training a d app y the s a s ca test
a a a ab e data o a g and apply e statistical es
to estimate whether expanding/pruning a node is to produce an
improvement beyond the training set
Most popular test: chi-squared test
ID3 used chi-squared test in addition to information gain
Only statistically significant attributes were allowed to be
selected by information gain procedure

Slide 8

Pre-pruning
Early stopping: Pre-pruning may stop the growth process prematurely
Classic example: XOR/Parity-problem
No individual attribute exhibits any significant association to the class
Structure is only visible in fully expanded tree
Pre-pruning won t
Pre pruning won’t expand the root node
But: XOR-type problems rare in practice
And: pre-pruning faster than post-pruning

x1 x2 Class
1 0 0 0
01 10
2 0 1 1
3 1 0 1
4 1 1 0 00 10

Slide 9

Post-pruning
First, build the full tree
,
Then, prune it
Fully-grown
Fully grown tree shows all attribute interactions
Problem: some subtrees might be due to chance effects
Two pruning operations:
Subtree replacement
1.

Subtree raising
2.

Possible strategies:
error estimation
significance t ti
i ifi testing
MDL principle

Slide 10

Subtree Replacement
Bottom up approach
p pp
Consider replacing a tree after considering all its subtrees
Ex: labor negotiations

Slide 11

Subtree Replacement
Algorithm:
1. Split the data into training and validation set
2. Do until further pruning is harmful:
a. Evaluate impact on the validation set of pruning
each possible node
b. Select th
b S l t the node whose removal most i
d h l t increases
the validation set accuracy

Slide 12

Subtree Raising
Delete node
Redistribute instances
Slower than subtree
replacement
(Worthwhile?)

X

Slide 13

Estimating Error Rates
Ok we can prune. But when?
p
Prune only if it reduces the estimated error
Error on the training data is NOT a useful estimator
Q: Why it would result in very little pruning?
Use hold-out set for pruning
hold out
Training
T ii
Separate a validation set Data set’
Training
g
Use this validation set to
test the improvement Data set
Validation
C4.5 s
C4 5’s method set
Derive confidence interval from training data
Use a heuristic limit derived from this for pruning
limit, this,
Standard Bernoulli-process-based method
Shaky statistical assumptions (based on training data)
y p ( g )

Slide 14

When dealing with nominal data
g
We evaluated the grain for each possible value
In
I continuous data, we have infinite values.
ti dt h i fi it l
What should we do?
Continuous-valued attributes may take infinite values, but we
have a limited number of values in our instances (at most N if
we have N instances)
Therefore, simulate that you have N nominal values
Evaluate information gain for every possible split point of the
attribute
Choose the best split point
The information gain of the attribute is the information gain
of the best split

Slide 15


Example

Outlook Temperature Humidity Windy Play

Sunny
y 85 85 False No

Sunny 80 90 True No

Overcast 83 86 False Yes

Rainy 75 80 False Yes

… … … … …

Continuous attributes

Slide 16

Split on temperature attribute:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes
Y N
No Y
Yes Y
Yes Yes
Y No
N No
N Yes Y
Y Yes Yes
Y No
N Y
Yes Yes N
Y No

E.g.: temperature < 71.5: yes/ , no/2
g te pe atu e 5 yes/4, o/
temperature ≥ 71.5: yes/5, no/3

Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939 bits

Place split points halfway between values
Can evaluate all split points in one pass!

Slide 17

To speed up
p p
Entropy only needs to be evaluated between points of different
c asses
classes

value 64 65 68 69 70 71 72 72 75 75 80 81 83 85
class Yes X
No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

Potential optimal breakpoints

Breakpoints between values of the same class cannot
be optimal

Slide 18

Deal with Missing Data
Treat missing values as a separate value
g p
Missing value denoted “?” in C4.X
Simple idea: treat missing as a separate value
Q: When this is not appropriate?
A: Wh
A When values are missing d to diff
l i i due different reasons
Example 1: gene expression could be missing when it is very
high or very low
Example 2: field IsPregnant=missing for a male patient should be
treated differently (no) than for a female patient of age 25
(unknown)

Slide 19

Deal with Missing Data
Split instances with missing values into pieces
A piece going down a branch receives a weight proportional to
the popularity of the branch
weights sum to 1
Info gain works with fractional instances
Use sums of weights instead of counts
During classification, split the instance into pieces
in the same way
Merge probability distribution using weights

Slide 20

From Trees to Rules
I finally g a tree from domains with
y got
Noisy instances
Missing l
Mi i values
Continuous attributes
But I prefer rules…
No context dependent
Procedure
Generate a rule for each tree
Get context-independent rules

Slide 21

From Trees to Rules
A procedure a little more sophisticated: C4.5Rules
p p
C4.5rules: greedily prune conditions from each rule if this
reduces its es a ed e o
educes s estimated error
Can produce duplicate rules
Check for this at the end
Then
look at each class in turn
consider the rules for that class
find a “good” subset (guided by MDL)
good
Then rank the subsets to avoid conflicts
Finally, remove rules (greedily) if this decreases error on the
training data

Slide 22

Next Class

Instance-based Classifiers

Slide 23

Lecture6 - C4.5

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lecture6 - C4.5

Similar to Lecture6 - C4.5 (20)

More from Albert Orriols-Puig

More from Albert Orriols-Puig (20)

Recently uploaded

Recently uploaded (20)

Lecture6 - C4.5