Outliers and Inconsistency

Inconsistency
and
Outliers

Ac#ve
Learning
by
Outlier
Detec#on

Inconsistency
Robustness
Symposium
2011

Neil
Rubens

Assistant
Professor

University
of
Electro-‐Communica#ons

Tokyo,
Japan

Outline

Inconsistency
Robustness
is
a
mul#-‐disciplinary

issue.

We
discuss
some
of
the
aspect
of

Inconsistency
Robustness
from
the
perspec#ve

of
Machine
Learning:

•  What
is
Inconsistency

•  Can
Inconsistency
be
Useful

•  Measuring
Inconsistency

Outlier
Types

•  Spa#al
Outlier

–  unlabeled
data

Our
Focus

•  Model
Outlier

–  labeled
data

Causes
of
Outliers

•  Faulty
data

–  Entry
error,
malfunc#on,
etc.

•  Chance/Devia#on

•  Incorrect
Model

Our
Focus

hQp://www.dkimages.com/discover/previews/
852/20223083.JPG

Typical
Treatment

of
Outliers

•  Assume
that
the

learned
model
is

correct
and
discard

points
that
don’t

agree
with
the
model

Our
Focus

Atypical
Treatment
of
Outliers

•  Assume
that
data
is
right,
and
that
the

model
is
wrong

some tweaking. How

some tweaking. However, if

Moreover obtaining label
it should be changed signi

beled data is needed for per
labeled data is large enoug
problem as impractical. Wh
incompatability and keep m

Due to abundance of data

labeled data is rather scarc
Obtaining Data could be “COSTLY”be change

additional labeled data as to
it should

assumption that the current
incompatability and
Medicine: —
diagnosis: pain, time, $ x1
x2
drug discovery: $$$, time y

Practicality:
.
User Interaction: b
y
effort, time —

focus).
Practicality:
—

—
x1
x2

b
Due to abundance

y

y
Expertise Elicitation:

.
problem as impractic
$, time labeled data is rathe
labeled data is large
additional labeled dat
focus).
Moreover obtainin
beled data is needed
–

— — if some

problem as descent ... (except the number ofdata descent ...
x2 issue is exhorbated, in al settins This issue... exhor

outliers,issuemight be discarding most outliers,issue... exhor
it should be changed significantly; instead of be changet

it should be changed significantly; instead of be changet

gradient impractical. While the ulabeled samples we c

focus). Say why it’s an interesting problem: Say why of t
some tweaking. How

some tweaking. How
some tweaking. However, if the current model is inaccura

additional is inaccura
problem as impractic

gradient impractic
This phenomena occurs frequently during phenomena o

This phenomena occurs frequently during phenomena o
This we is exhorbated, in al settins This we might be
additional labeled data as to enable personalization (a comm

additional labeled problem: enable personalization (a comm

of mac
problem as impractical. While the ulabeled data is abunda

problem as is abunda
incompatability and keep making minor Moreover obtainin

Moreover obtainin

– d) Say what fol

[2]. learned model and/or existing data is refered to asa
ronment in which w
some tweaking. However, if the current model labeled dat

a) State the dat
of non-stationary en

the goal of machine learning isoftonon-stationary en
additional labeled pro
assumption is large

labeled be is large
overal, the is rathe

labeled very is rathe

focus). Not all it’s

of The learning process [7], [6], or in aThe learning accur
assumption that the j

Due to abundance of data; one may mistakenly dismiss t

labeled data is large enough; there may stilldata a need j

Due to abundance of data; one may mistakenly dismiss t

informative data poi
assumption that the current model is accurate, and requires c

assumption that the current model is accurate, and requires c

ronment in which changes may occur in y underlying mo

predictive model mo
– d) Say what follows from your solution: If we disc
labeled data is rather scarce. Even iflabeled data amount

labeled data is rather scarce. Even ifmake is data amount
ronment in which ch

which changes data. Data in the inconsistentch
beled data is needed for personaliization— data is needed

c) Say needed for personaliization c) Say needed
of .the learning process [7], [6], or in a.the learning proc

the goalan proc
This the early sta

====the early sta
beled data iswhat your solution achieves: ... data iswhat yo
Due to abundance

Due to abundance

ronment inmodel from the may occur that is underlying fro
overal, the small)
x1 x1 more info
incompatability and

incompatability and

labeled data is large enough; there may stilldata a need
it should ignoring

it should ignoring
Moreover obtaining labeled data could be expensive.

outliers expensive.
labeled bethat the

the learned model
x2 x2 is rather

Contributions

Contributions
Moreover obtaining labeled data could be are bad
y y informatio

x2 which is

in which is
Practicality:

Practicality:
. . and consis
b
y b
y
incompatability and keep making minor tweaks.

tweaks.
the outcom

learn
— — –

b)

outlier.
This
focus).
Practicality: Practicality:

the
beled ...

beled
——
—

—

—

—

—
x1
x2

–1
in
b

b
x
Due to abundance of data; one mayDue to abundance of data; one may m
mistakenly dismiss this

–

–
y

y

y

[2].

[2].
.
outliers
problem as impractical. While the ulabeled as impractical. While the ulab
problem data is abundant,
tal to learn
labeled data is rather scarce. Evenlabeled data is amountscarce. Even if
if overal, the rather of
unless o
labeled data is large enough; there may still be alarge enough; there ma
labeled data is need for
anomaly d
additional labeled data as tomodel isadditional and requires just model isperso
assumption that the current enable personalization (a common enable accu
accurate, labeled data as to
assumption that the current active le
focus). focus). ——
some tweaking. However, if the current tweaking. inaccurate, if the curren
some model is However, learning. t
should be changed labeled data instead of ignoring significantly; ofte
Moreover changed labeled AL: cou
it Moreover obtaining significantly; could bebe obtaining La- needs toins
it should expensive. the data b

data as to
Unlabeled Data
beled data is needed keeppersonaliization tweaks. neededkeep personaliizatio
incompatability and for making minor ... and are ig
incompatability and for making minor
beled data is
Sampling
—
–
— —– indeed con
if some
This issue is exhorbated, in al settins in issue is... http://je
This which exhorbated, in al settins

make is very small)
x1 x1 more info
This phenomena occurs frequently x2
x2 during the early stages new-physi
This phenomena occurs frequently d
is rather

a) State the
2. Bad

Contributions

Contributions
of y the learning process [7], [6], orof ythe non-stationary envi- [6], or in
in a learning process [7], informatio

outliers are bad
Practicality:

Practicality:
ronment in which changes may occur .in the underlying model may occur in
. ronment in which changes data consis
and includ
[2]. b
y [2].
b
y profession
the outcom

predictive
–
— —– this here:I

====
–

b)

outlier.
This
focus).

Contributions
Practicality: Contributions
Practicality: May Be G
——
—

—

—

—

—

—
x1
x2

x1

gradient descent ... (exceptone mayDue to abundance... (exceptVersion of
Due to abundance of data; the number of samples we this one may m
gradient descent ofcan
mistakenly dismiss data; the numb
b

b
–

–

–
y

y

y

y
[2].
.

the
outliers
make is very small)
problem as impractical. While the problem very small)
make is data is abundant, —
ulabeled as impractical. While the ulab
tal to learn
labeled data is rather scarce. Even if—
— labeled datathe rather scarce. Even if o
overal, is amount of
unless o
labeled State is large enough; there may still bethe needenough; there ma
a) data the problem: labeled data is a problem:
a) State large for
b) Say why data as interesting problem:labeled it’s an to anomaly d
additional labeled it’s an to enable personalization (a common enable perso
additional Not all of as interesting pro
b) Say why data the
outliers are bad
focus). focus). are bad
outliers ——
c) Say obtaining labeled achieves: be what your labeled data ofte type of
Moreover obtaining solution AL: cou
Moreover what your solution data could Sayexpensive. La-
c) achieve
Multiple Hypothesis Hypothesis/Model data is If we follows from f (x, ✓)
beled data iswhat follows from your solution: needed for personaliization
d) Say needed for personaliizationd) Say
beled ... Selection what discard and are ig
your so

assumption that the c
assumption that the current model is accurate, and requires jus
some tweaking. However, if the currentsome tweaking. Ho
model is inaccurate
it should be change
it should be changed significantly; instead of ignoring th
incompatability and keep making minorincompatability and
tweaks.
— —
x1 x1
x2 x2
y y
. .
y y
— —
Practicality: Practicality:
b b
.

.
Due to abundance
Due to abundance of data; one may mistakenly dismiss thi

[2].
y
y

–
Little is learned –

y
y

–
b

b
x2
x1

—
—
—

x1

—
—
problem as impractical. While the ulabeled data as abundant
problem is impracti

the
focus).

c)

====
labeled data is rathe
labeled data is rather scarce. Even if overal, the amount o
labeled data is large

with some data
(irregardless of the output values)
labeled data is large enough; there may still be a need fo
Consistent Sample

Inconsistent Sample
additional labeled data as to enable personalization labeled da
additional (a commo
Practicality:

Practicality:
beled # of hypotheses

focus). focus).

Will not agreebeled of the hypotheses
Contributions
Moreover obtaining labeled data could Moreover obtaini
be expensive. La

additional labeled
assumption that the current model is accurate, and requires jus

Due very small)
...
some tweaking. However, if the current beled data inaccurate
beled data is needed for personaliization model is is needed
– –
it should be changed significantly; instead of ignoring th
which ...
incompatability and keep making settins tweaks.issue is exho
This issue is exhorbated, in al minor inThis

a) data the problem:
— This phenomena
This phenomena occurs frequently during the early stage
non-stationary envi
ofxthe learning process [7], [6], or in aof the learning proc
1
ronment in which changes may occur in ronment in which ch
x2 the underlying mode
[2].
y [2].
.– –
yContributions Contributions
—gradient descent ... (except the number gradient descent ..
of samples we ca
Does not allow to reducedata be needed for personaliization ...

make is very small)

outliers, weis needed for personaliization ...
make is very small)
Practicality:
b

— —
.

Due to abundance of data; one may mistakenly dismiss thi
y
y

–
b
x2

—
—
incompatability and keep making minor tweaks.

problem State the problem:
a) as impractical. While the ulabeled data State the pro
a) is abundant
the

focus). Say what your solution achieves: focus).
labeled

labeled data why it’s an interesting if overal, the amountit’s
b) Say is rather scarce. Even problem: Not all of th
b) Say why o
This issue is exhorbated, in al settins in which ...

labeled data bad large enough; there may still be a bad fo
outliers are is outliers are need
Inconsistent Sample

c) Say what your solution achieves:
additional labeled data as to enable personalization (a what yo
c) Say commo
beled data is
samples we

d) d) Say what fo
focus). Say what follows from your solution: If we discar
outliers, we might b
outliers, we might be discarding most informative data point
Moreover obtaining labeled data could be expensive. La
d) Say obtaining labeled your solution: If we discard
labeled Say whylarge an interesting problem: Notaall of the
of x2 learning process [7], [6], or in a non-stationary envi-
be expensive. La-
additional labeled data as to enable personalization (a common
labeled data is large enough; there may still be a need for
labeled data is rather scarce. Even if overal, the amount of
problem as impractical. While the ulabeled data is abundant,
Due to abundance of data; one may mistakenly dismiss this

Moreover what follows from data could be expensive. La-
b) data is it’s enough; there may still be data is for
labeled amount of
assumption that the

====
beled data is needed for personaliization ...====
Due to abundance

Number of hypotheses is reduced needed

–The goal of machine learning is to The goal accurat
learn an of ma
predictive model from the data. Data that is inconsistent wit
This issue is exhorbated, in al settins predictive...
in which model fro
–

–

—
—

the learned model occurs frequently duringlearned model a
This phenomena and/or existing data is refered to stage
the the early as
outlier.
——

——
——

of the learning process [7], [6], or in a outlier.
non-stationary envi
active

f (x, ✓)

— —
ronment in which changes may occur in the underlying mode
[2].Learned model is often assumed to be Learned model cor
approximately is
problem as impractical. While the ulabeled data is as impractical. While the
and consisten

and consisten
outliers are

outliers are
May Be Good
professionally

unless obje
if some po
assumption that the current model is accurate, and requires just current model is

make isto abundance of data; one may mistakenly dismiss this of data; one m
it should is changed significantly; instead should be changed significantly

This phenomena occurs frequently during x1 early stages new-physics.h
the outcomes)

—— learn

labeled State is rather scarce. Even if overal, the data is rather scarce. Even
tal to learning

Moreover obtaining labeled data could some tweaking. However, if the cu
is rather limi

Moreover obtaining labeled data
more informa

AL: often c

might be discarding most informative data points for personaliiz
the here:It Tu
this outcomes)

type of outl
problem abundant, tal to learning

rect, therefore using
it of ignoring the needs to be la

rather limi
more informa

AL: often c
ronment in which changes may occur in the underlying model data including
is 2. Bad data
if some poi

information is

anomaly detec

need large enough; there
incompatability and keep making m
some tweaking. However, if the current model is inaccurate, learning. typic

indeed contain

information is

outliers are bad data as to enable personalization (a common anomaly detec
additional labeled data as to enable p
and are ignore

gradient descent ... (except the number of Practicality: can Version of Tru

indeed contain
http://jeffjon
unless objec

and are ignore

Rubens
et
al,
AJS
2011

Model Selection

(a) under-fit (b) over-fit (c) appropriate fit

Figure 8: Dependence between model complexity and accuracy.

If
there
is
no
inconsistency
between
the
training
and
tes#ng
data
then

the
most
complex
model
would
tend
be
selected.

Change
Detec#on
/
Model
Correc#on

Is
inconsistency
caused
by
noise
(or
minor

factors)
or
by
changes
in
the
underlying
model

–  Applica#ons:

medical
diagnos#cs,
intrusion

detec#on,
network
analysis,

ﬁnance

hQp://www.sa#magingcorp.com/galleryimages/high-‐resolu#on-‐landsat-‐satellite-‐imagery-‐oman.jpg

Conclusion

•  Inconsistency
could
be
useful
for:

–  Hypothesis
Learning

–  Model
Selec#on

–  Model
Correc#on

Neil
Rubens

Assistant
Professor

Ac#ve
Intelligence
Group

Laboratory
for
Knowledge
Compu#ng

University
of
Electro-‐Communica#ons

Tokyo,
Japan

hQp://Ac#veIntelligence.org

Outliers and Inconsistency

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to Outliers and Inconsistency

Similar to Outliers and Inconsistency (20)

More from Neil Rubens

More from Neil Rubens (16)

Recently uploaded

Recently uploaded (20)

Outliers and Inconsistency