Lecture 8: Decision Trees & k-Nearest Neighbors

Machine
Learning
for
Language
Technology
Lecture
8:
Decision
Trees
and
k-‐Nearest
Neighbors
Marina
San:ni
Department
of
Linguis:cs
and
Philology
Uppsala
University,
Uppsala,
Sweden
Autumn
2014
Acknowledgement:
Thanks
to
Prof.
Joakim
Nivre
for
course
design
and
materials
1

Supervised
Classifica:on
• Divide
instances
into
(two
or
more)
classes
– Instance
(feature
vector):
• Features
may
be
categorical
or
numerical
– Class
(label):
– Training
data:
• Classifica:on
in
Language
Technology
– Spam
filtering
(spam
vs.
non-‐spam)
– Spelling
error
detec:on
(error
vs.
no
error)
– Text
categoriza:on
(news,
economy,
culture,
sport,
...)
– Named
en:ty
classifica:on
(person,
loca:on,
organiza:on,
...)
2
€
X = {xt,y t }N
t=1
€
x = x1, …, xm
€
y

Models
for
Classifica:on
• Genera:ve
probabilis:c
models:
– Model
of
P(x,
y)
– Naive
Bayes
• Condi:onal
probabilis:c
models:
– Model
of
P(y
|
x)
– Logis:c
regression
• Discrimina:ve
model:
– No
explicit
probability
model
– Decision
trees,
nearest
neighbor
classifica:on
– Perceptron,
support
vector
machines,
MIRA
3

Repea:ng…
• Noise
– Data
cleaning
is
expensive
and
:me
consuming
• Margin
• Induc:ve
bias
Types of inductive biases
• Minimum cross-‐‑validation error
• Maximum margin
• Minimum description length
• [...]
4

Decision
Trees
• Hierarchical
tree
structure
for
classifica:on
– Each
internal
node
specifies
a
test
of
some
feature
– Each
branch
corresponds
to
a
value
for
the
tested
feature
– Each
leaf
node
provides
a
classifica:on
for
the
instance
• Represents
a
disjunc:on
of
conjunc:ons
of
constraints
– Each
path
from
root
to
leaf
specifies
a
conjunc:on
of
tests
– The
tree
itself
represents
the
disjunc:on
of
all
paths
6

Divide
and
Conquer
• Internal
decision
nodes
– Univariate:
Uses
a
single
a]ribute,
xi
• Numeric
xi
:
Binary
split
:
xi
>
wm
• Discrete
xi
:
n-‐way
split
for
n
possible
values
– Mul:variate:
Uses
all
a]ributes,
x
• Leaves
– Classifica:on:
class
labels
(or
propor:ons)
– Regression:
r
average
(or
local
fit)
• Learning:
– Greedy
recursive
algorithm
– Find
best
split
X =
(X1,
...,
Xp),
then
induce
tree
for
each
Xi
8

Classifica:on
Trees
(ID3,
CART,
C4.5)
9
• For
node
m,
Nm
instances
reach
m,
Ni
m
belong
to
Ci
• Node
m
is
pure
if
pi
m
i
Nm
is
0
or
1
• Measure
of
impurity
is
entropy
KΣ
Im = − pm
i logpi
2m
i=1
ˆP
(Ci |x,m) ≡ pm
i =
Nm

Example:
Entropy
• Assume
two
classes
(C1,
C2)
and
four
instances
(x1,
x2,
x3,
x4)
• Case
1:
– C1
=
{x1,
x2,
x3,
x4},
C2
=
{
}
– Im =
–
(1
log
1
+
0
log
0)
=
0
• Case
2:
– C1
=
{x1,
x2,
x3},
C2
=
{x4}
– Im =
–
(0.75
log
0.75
+
0.25
log
0.25)
=
0.81
• Case
3:
– C1
=
{x1,
x2},
C2
=
{x3,
x4}
– Im =
–
(0.5
log
0.5
+
0.5
log
0.5)
=
1
10

Best
Split
ˆP
(Ci |x,m, j) ≡ pmj
i =
i
Nmj
Nmj
i log2pmj
11
• If
node
m
is
pure,
generate
a
leaf
and
stop,
otherwise
split
with
test
t
and
con:nue
recursively
• Find
the
test
that
minimizes
impurity
• Impurity
ager
split
with
test
t:
– Nmj
of
Nm
take
branch
j
– Ni
mj
belong
to
Ci
Im
nΣ
t = −
Nmj
Nj=1 m
KΣ
pmj
i
KΣ
i=1
€
Im = − pm
i logpi
2m
i=1
ˆP
Ci ( |x,m) ≡ pm
i =
Ni
m
Nm

Informa:on
Gain
• We
want
to
determine
which
a]ribute
in
a
given
set
of
training
feature
vectors
is
most
useful
for
discrimina:ng
between
the
classes
to
be
learned.
• •Informa:on
gain
tells
us
how
important
a
given
a]ribute
of
the
feature
vectors
is.
• •We
will
use
it
to
decide
the
ordering
of
a]ributes
in
the
nodes
of
a
decision
tree.
13

Informa:on
Gain
and
Gain
Ra:o
€
KΣ
Im = − pm
i logpi
2m
i=1
i log2pmj
Nmj
Nj=1 m
14
• Choosing
the
test
that
minimizes
impurity
maximizes
the
informa:on
gain
(IG):
• Informa:on
gain
prefers
features
with
many
values
• The
normalized
version
is
called
gain
ra:o
(GR):
€
Im
nΣ
t = −
Nmj
j=1 Nm
pmj
i
KΣ
i=1
€
IGm
t = I− It
m m
Vt = −
m
Nmj
Nm
log2
nΣ
€
GRm
t =
IGt
m
Vm
t

Pruning
Trees
• Decision
trees
are
suscep:ble
to
overfijng
• Remove
subtrees
for
be]er
generaliza:on:
– Prepruning:
Early
stopping
(e.g.,
with
entropy
threshold)
– Postpruning:
Grow
whole
tree,
then
prune
subtrees
• Prepruning
is
faster,
postpruning
is
more
accurate
(requires
a
separate
valida:on
set)
15

Rule
Extrac:on
from
Trees
16
C4.5Rules
(Quinlan,
1993)

Learning
Rules
• Rule
induc:on
is
similar
to
tree
induc:on
but
– tree
induc:on
is
breadth-‐first
– rule
induc:on
is
depth-‐first
(one
rule
at
a
:me)
• Rule
learning:
– A
rule
is
a
conjunc:on
of
terms
(cf.
tree
path)
– A
rule
covers
an
example
if
all
terms
of
the
rule
evaluate
to
true
for
the
example
(cf.
sequence
of
tests)
– Sequen:al
covering:
Generate
rules
one
at
a
:me
un:l
all
posi:ve
examples
are
covered
– IREP
(Fürnkrantz
and
Widmer,
1994),
Ripper
(Cohen,
1995)
17

Proper:es
of
Decision
Trees
• Decision
trees
are
appropriate
for
classifica:on
when:
– Features
can
be
both
categorical
and
numeric
– Disjunc:ve
descrip:ons
may
be
required
– Training
data
may
be
noisy
(missing
values,
incorrect
labels)
– Interpreta:on
of
learned
model
is
important
(rules)
• Induc0ve
bias
of
(most)
decision
tree
learners:
1. Prefers
trees
with
informa0ve
aEributes
close
to
the
root
2. Prefers
smaller
trees
over
bigger
ones
(with
pruning)
3. Preference
bias
(incomplete
search
of
complete
space)
18

Nearest
Neighbor
Classifica:on
• An
old
idea
• Key
components:
– Storage
of
old
instances
– Similarity-‐based
reasoning
to
new
instances
20
This “rule of nearest neighbor” has considerable
elementary intuitive appeal and probably corresponds to
practice in many situations. For example, it is possible
that much medical diagnosis is influenced by the doctor's
recollection of the subsequent history of an earlier patient
whose symptoms resemble in some way those of the
current patient. (Fix and Hodges, 1952)

k-‐Nearest
Neighbour
• Learning:
– Store
training
instances
in
memory
• Classifica:on:
– Given
new
test
instance
x,
• Compare
it
to
all
stored
instances
• Compute
a
distance
between
x
and
each
stored
instance
xt
• Keep
track
of
the
k
closest
instances
(nearest
neighbors)
– Assign
to
x
the
majority
class
of
the
k
nearest
neighbours
• A
geometric
view
of
learning
– Proximity
in
(feature)
space
à
same
class
– The
smoothness
assump:on
21

Eager
and
Lazy
Learning
• Eager
learning
(e.g.,
decision
trees)
– Learning
–
induce
an
abstract
model
from
data
– Classifica:on
–
apply
model
to
new
data
• Lazy
learning
(a.k.a.
memory-‐based
learning)
– Learning
–
store
data
in
memory
– Classifica:on
–
compare
new
data
to
data
in
memory
– Proper:es:
• Retains
all
the
informa:on
in
the
training
set
–
no
abstrac:on
• Complex
hypothesis
space
–
suitable
for
natural
language?
• Main
drawback
–
classifica:on
can
be
very
inefficient
22

Dimensions
of
a
k-‐NN
Classifier
• Distance
metric
– How
do
we
measure
distance
between
instances?
– Determines
the
layout
of
the
instance
space
• The
k
parameter
– How
large
neighborhood
should
we
consider?
– Determines
the
complexity
of
the
hypothesis
space
23

Distance
Metric
1
• Overlap
=
count
of
mismatching
features
24
€
mΣ
Δ(x,z) = δ (xi,zi )
i=1
⎧
⎪⎪⎪
0
⎨
⎪⎪⎪
⎩
x z if numeric else
if x =
z
i i
≠
−
i i
−
=
i i
i i
i i
if x z
x z
1
,
max min
δ ( , )

Distance
Metric
2
• MVDM
=
Modified
Value
Difference
Metric
25
Δ(mΣ
x,z) = δ (xi,zi )
i=1
δ KΣ
(xi,zi ) = P(Cj | xi ) − P(Cj | zi)
j=1

The
k
parameter
• Tunes
the
complexity
of
the
hypothesis
space
– If
k
=
1,
every
instance
has
its
own
neighborhood
– If
k
=
N,
all
the
feature
space
is
one
neighborhood
26
k = 1
k = 15
ˆE
MΣ
= E(h|V) = 1 h xt ( ) ≠ rt ( )
t=1

A
Simple
Example
Δ(mΣ
x,z) = δ (xi,zi )
i=1
27
Training
set:
1. (a,
b,
a,
c)
à
A
2. (a,
b,
c,
a)
à
B
3. (b,
a,
c,
c)
à
C
4. (c,
a,
b,
c)
à
A
New
instance:
5. (a,
b,
b,
a)
Distances
(overlap):
Δ(1,
5)
=
2
Δ(2,
5)
=
1
Δ(3,
5)
=
4
Δ(4,
5)
=
3
k-‐NN
classifica:on:
1-‐NN(5)
=
B
2-‐NN(5)
=
A/B
3-‐NN(5)
=
A
4-‐NN(5)
=
A
⎧
⎪⎪⎪
0
⎨
⎪⎪⎪
⎩
x z if numeric else
if x =
z
i i
≠
−
i i
−
=
i i
i i
i i
if x z
x z
1
,
max min
δ ( , )

Further
Varia:ons
on
k-‐NN
• Feature
weights:
– The
overlap
metric
gives
all
features
equal
weight
– Features
can
be
weighted
by
IG
or
GR
• Weighted
vo:ng:
– The
normal
decision
rule
gives
all
neighbors
equal
weight
– Instances
can
be
weighted
by
(inverse)
distance
28

Proper:es
of
k-‐NN
• Nearest
neighbor
classifica:on
is
appropriate
when:
– Features
can
be
both
categorical
and
numeric
– Disjunc:ve
descrip:ons
may
be
required
– Training
data
may
be
noisy
(missing
values,
incorrect
labels)
– Fast
classifica:on
is
not
crucial
• Induc0ve
bias
of
k-‐NN:
1. Nearby
instances
should
have
the
same
label
(smoothness
assump0on)
2. All
features
are
equally
important
(without
feature
weights)
3. Complexity
tuned
by
the
k
parameter
29

Lecture 8: Decision Trees & k-Nearest Neighbors

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Lecture 8: Decision Trees & k-Nearest Neighbors

Similar to Lecture 8: Decision Trees & k-Nearest Neighbors (20)

More from Marina Santini

More from Marina Santini (20)

Recently uploaded

Recently uploaded (20)

Lecture 8: Decision Trees & k-Nearest Neighbors