Feature selection

Machine
learning
workshop

guodong@hulu.com

Machine
learning
introduc7on

Logis7c
regression

Feature
selec+on

Boos7ng,
tree
boos7ng

See
more
machine
learning
post:
h>p://dongguo.me

Outline

• 
• 
• 
• 

Introduc7on

Typical
feature
selec7on
methods

Feature
selec7on
in
logis7c
regression

Tips
and
conclusion

What’s/why
feature
selec7on

•  A
procedure
in
machine
learning
to
ﬁnd
a
subset
of

features
that
produces
‘be>er’
model
for
given

dataset

–  Avoid
overﬁLng
and
achieve
be>er
generaliza7on
ability

–  Reduce
the
storage
requirement
and
training
7me

–  Interpretability

When
feature
selec7on
is
important

• 
• 
• 
• 
• 
• 

Noise
data

Lots
of
low
frequent
features

Use
mul7-‐type
features

Too
many
features
comparing
to
samples

Complex
model

Samples
in
real
scenario
is
inhomogeneous
with

training
&
test
samples

When
No.(samples)/No.(feature)
is
large

• 
• 
• 
• 

Feature
selec7on
with
Gini
indexing

Algorithm:
Logis7c
regression

Training
samples:

640k;
test
samples:
49K

Feature:
watch
behavior
of
audiences;
show
level
(11327)

AUC

0.83

0.82

L1-‐LR

L2-‐LR

0.81

0.8

all

80%

70%

60%

50%

40%

30%

ra+o
of
features
used

20%

10%

When
No(samples)
equals
to
No(feature)

•  L1
Logis7c
regression;

•  Training
samples:
50k;
test
samples:
49K

•  Feature:
watch
behavior
of
audiences;
video
level
(49166)

how
AUC
change
with
feature
number
selected

0.736

0.735

0.734

0.733

0.732

ACU

0.731

0.73

0.729

0.728

all

90%
80%
70%
60%
50%
40%
30%
20%
10%

Typical
methods
for
feature
selec7on

•  Categories

Single
feature

evalua7on

Subset
selec7on

ﬁlter

MI,
IG,
KL-‐D,
GI,
CHI

Category
distance,

…

wrapper

Ranking
accuracy

For
LR
(SFO,

using
single
feature
Graiing)

•  Single
feature
evalua7on

–  Frequency
based,
mutual
informa7on,
KL
divergence,
Gini
indexing,

informa7on
gain,
Chi
square
sta7s7c

•  Subset
selec7on
method

–  Sequen7al
forward
selec7on

–  Sequen7al
backward
selec7on

Single
feature
evalua7on

•  Measure
quality
of
features
by
all
kinds
of
metrics

–  Frequency
based

–  Dependence
of
feature
and
label
(Co-‐occurrence)

•  mutual
informa7on,
Chi
square
sta7s7c

–  Informa7on
theory

•  KL
divergence,
informa7on
gain

–  Gini
indexing

Frequency
based

•  Remove
features
according
to
frequency
of
features

or
instances
contain
the
feature

•  Typical
scenario

–  Text
mining

Mutual
informa7on

•  Measure
the
dependence
of
two
random
variables

•  Deﬁni7on

Chi
Square
Sta7s7c

•  Measure
the
dependence
of
two
variables

–  A:
number
of
7mes
feature
t
and
category
c
co-‐occur

–  B:
number
of
7mes
t
occurs
without
c

–  C:
number
of
7mes
c
occurs
without
t

–  D:
number
of
7mes
neither
c
or
t
occurs

–  N:
total
number
of
instances

Entropy

•  Characterize
the
(im)purity
of
an
collec7on
of

examples

𝐸𝑛𝑡𝑟𝑜𝑝𝑦( 𝑆)= −∑𝑖↑▒ 𝑃↓𝑖  𝐼𝑛 𝑃↓𝑖

Informa7on
Gain

•  Reduc7on
in
entropy
caused
by
par77oning
the

examples
according
to
the
a>ribute

KL
divergence

•  Measure
the
diﬀerence
between
two
probability

distribu7on

Gini
indexing

•  Calculate
condi7onal
probability
of
f
given
class
label

•  Normalize
across
all
classes

•  Calculate
Gini
coeﬃcient

•  For
two
categories
case

Comparison
in
text
categoriza7on
(1)

•  A
compara)ve
study
on
feature
selec)on
in
text
categoriza)on
(ICML’97)

Comparison
in
text
categoriza7on
(2)

•  Feature
selec)on
for
text
classiﬁca)on
based
on
Gini
Coeﬃcient
of

Inequality
(JMLR’03)

Shortages
of
single
feature
evalua7on

•  Relevance
between
features
are
ignored

–  Features
could
be
redundant

–  A
feature
that
is
completely
useless
by
itself
can
provide
a

signiﬁcant
performance
improvement
when
taken
with

others

–  Two
features
that
are
useless
by
themselves
can
be
useful

together

Shortages
of
single
feature
evalua7on
(2)

•  A
feature
that
is
completely
useless
by
itself
can

provide
a
signiﬁcant
performance
improvement

when
taken
with
others

Shortages
of
single
feature
evalua7on
(3)

•  Two
features
that
are
useless
by
themselves
can
be

useful
together

Subset
selec7on
methods

•  Select
subsets
of
features
that
together
have
good

predic7ve
power,
as
opposed
to
ranking
features

individually

•  Always
by
adding
new
features
into
exis7ng
set
or

removing
features
out
of
exis7ng
set

–  Sequen7al
forward
selec7on

–  Sequen7al
backward
selec7on

•  Evalua7on

–  category
distance
measurement

–  Classiﬁca7on
error

Category
distance
measurement

•  Select
features
subset
with
large
category
distance

Wrapper
methods
for
logis7c
regression

•  Forward
feature
selec7on

–  Naïve
method

•  need
build
models
quadra7c
in
the
number
of
feature

–  Graiing

–  Single
feature
op7miza7on

SFO
(Singhet
al.,
2009)

•  Only
op7mizing
coeﬃcient
of
the
new
feature

•  Only
need
iterate
over
instances
that
contain
the

new
feature

•  Also
fully
relearn
one
new
model
with
selected

feature
included

Graiing
(Perkins
2003)

•  Use
the
loss
func7on’s
gradient
with
respect
to
the

new
feature
to
decide
whether
to
add
the
feature

•  At
each
step,
feature
with
largest
gradient
is
added

•  Model
is
fully
relearned
aier
each
feature
is
added

–  Need
only
build
D
models
overall

Experimenta7on

•  Percent
improvement
of
log-‐likelihood
in
test
set

•  Both
SFO
and
Graiing
are
easy
parallelized

Summariza7on

•  Categories

Single
feature
evalua+on

Subset
selec+on

ﬁlter

MI,
IG,
KL-‐D,
GI,
CHI

Category
distance,

…

wrapper

Ranking
accuracy
using

single
feature

For
LR
(SFO,

Graiing)

•  Filter
+
Single
feature
evalua7on

–  Less
7me
consuming,
usually
works
well

•  Wrapper
+
Subset
selec7on

–  Higher
accuracy,
but
easy
overﬁLng

Tips
about
feature
selec7on

• 
• 
• 
• 

Remove
features
could
not
occur
in
real
scenario

If
no
contribu7on,
the
less
feature
the
be>er

Use
L1
regulariza7on
for
logis7c
regression

Use
random
subspace
method

References

•  Feature
selec)on
for
Classifica)on
(IDA’97)

•  An
Introduc)on
to
Variable
and
Feature
Selec)on
(JMLR’03)

•  Feature
selec)on
for
text
classifica)on
Based
on
Gini

Coefficient
of
Inequality
(JMLR’03)

•  A
compara)ve
study
on
feature
selec)on
in
text

categoriza)on
(ICML’97)

•  Scaling
Up
Machine
Learning

Feature selection

More Related Content

What's hot

Similar to Feature selection

More from Dong Guo

Recently uploaded

Feature selection

Editor's Notes