2. •
Semi
-‐super
vised
Learning
?
•
Scarcity
of
Training
Data
•
What
are
constraints
?
•
How/why
do
they
help
?
3. Supervised
learning
(
X1àY1)
Labelled
Data
(X2-‐àY2)
(X3à
Y3)..
……(XnàYn)
.
What
if
n
is
less
?
..
Obtaining
training
data
is
Costly
and
it
could
be
inefficient
.
Example
:
(Fraud
detecNon
/
Anamoly
detecNon)
Domain
experNse
helps……
4. De9initions
• X
=
(X1,X2,X3,X4…………Xn)
• Y
=
(Y1,Y2,Y3,Y4…………Yn)
• H
:
XàY
is
a
classifier
.
f
:
(Cross
product
of
X
and
Y
)
-‐àR
set
of
real
numbers
• The
out-‐put
of
the
classifier
will
be
such
y
which
maximizes
the
value
of
funcNon
f
5.
• ClassificaNon
funcNon
..
• It’s
a
linear
sum
of
feature
funcNons
7. Can
we
exploit
knowledge
of
constraints
in
Inference
Phase?
• Lets
assume
n
items
(observaNons)
in
sequence
and
p
labels..
i.e.,
n
tokens
and
p
parts
of
speech
or
n
tokens
and
p
tags
in
an
NER
task
Brute
Force
:
O(n
power
p
)
Viterbi
:
O(
N
power
P)
Can
we
go
down
further
?
Can
we
further
reduce
our
search
space
Further
down
?
8. Introducing
constraints
into
Model
• Let
C1,
C2
……….CK
be
the
constraints
• C:
(Cross
product
of
X
and
Y)
à
{0,1}
• Constraints
are
of
two
types
.
• Hard
(MUST
be
saNsfied)
• Sof
(Can
be
relaxed)
• 1Cx
is
the
set
of
sequence
labels
that
DON’T
violate
the
constraints
9. Constraints
come
to
rescue
•
Lets
say
x
out
of
X
possible
tag
sequences
violate
the
constraints
.
•
Search
space
comes
from
X
to
X-‐x
.
•
How
do
we
infer
?
•
Does
Viterbi
help
us
?
10. Example
A
B
C
D
E
F
G
S1
X1
X1
X1
X1
X1
X1
X1
S2
X10
X10
X10
X10
X10
X10
X10
S3
X11
X11
X11
X11
X11
X1I
X11
Mo:va:onal
Interviewing
:
At
least
ONE
reflecNon
11. Soft
constraints
How
do
we
calculate
distance
here
?
How
do
we
learn
the
parameters
?
12. Lars
Ole
Andersen.
Program
Analysis
and
SpecializaNon
for
the
C
programming
Language
.
PhD
Thesis
,
DIKU
,
University
of
Copenhagen,
May
1994.
This
is
Ground
Truth
.
But
HMM
gives
this.
Lars
Ole
Andersen.
Program
Analysis
and
SpecializaNon
for
the
C
Programming
Language
.
PhD
Thesis
,
DIKU
,
University
of
Copenhagen,
May
1994.
13.
14.
15. Top-‐k
inference
We
only
chose
the
few
top
possible
sequences
and
add
ALL
of
of
them
to
training
data.
The
author
used
beam
search
decoding,
but
this
can
be
done
with
any
inference
procedure.
From
the
Unlabeled
sample,
we
label
them
and
include
them
in
the
training
data.
Choice
:
We
may
include
only
the
high
confident
samples.
PitFall
:
Then
we
don’t
really
learn
properly
and
miss-‐out
some
characteris:cs