2. Overview
15/07/09
2
• MT
decoding
• Need
to
find
w
that
assigns
higher
scores
to
be@er
translaBons
(e,
d)
• Be@er
translaBons
=
translaBons
with
lower
error
f:
source
sentence,
e:
target
sentence,
d:
derivaBon
w:
weight
vector,
h(・):
feature
funcBon
3. Loss
MinimizaBon
• Given
parallel
corpus
(F,
E),
find
w
that
minimizes
loss
funcBon
l(・)
• e.g.,
l(F,
E;
w)
=
1
–
BLEU(E,
decodew(F))
• λ
is
a
regularizaBon
constant
to
avoid
overfiUng
15/07/09
3
regularizaBon
term
4. Problems
to
Consider
1. Search
space
is
vast
• impossible
to
consider
all
candidates
• correct
translaBon
is
rarely
possible
2. ApproximaBon
of
error
funcBon
• Error
metrics
(e.g.
BLEU)
are
not
differenBable
• Split
corpus-‐level
metrics
into
sentence
level
3. How
to
calculate
argmin
wTh
15/07/09
4
5. Batch
Learning
• Given
parallel
corpus
(F,
E),
iniBalize
w
and
iteraBvely
1. decode
whole
corpus
F
with
current
w,
and
get
k-‐best
lists
C
2. opBmize
w
3. loop
unBl
convergence
• vs.
online
learning
• opBmize
w
per
sentence
15/07/09
5
6. Minimum
Error
Rate
Training
(MERT)
• Given
error
funcBon
error(E,
Ê),
directly
minimize
it
• E:
reference
translaBons,
Ê:
system
translaBons
• e.g.
error(E,
Ê)
=
1
–
BLEU(E,
Ê)
• In
other
words,
• Since
error(・)
is
not
differenBable
w.r.t.
w,
gradient-‐based
method
is
not
applicable
• Instead,
use
Powell’s
method
• gradients
not
required
15/07/09
6
7. Powell’s
Method
• IteraBvely,
fix
a
direcBon,
and
find
opBmal
w
in
that
direcBon
• Applicable
when
gradients
are
not
available
15/07/09
7
w0
w1
w2
w3
x1
x2
8. OpBmizaBon
in
One
DirecBon
• 1-‐best
translaBon
parameterized
by
scalar
γ
15/07/09
8
bm:
one-‐hot
vector
with
mth
dim
=
1
intercept
slope
γ
wh
+
γh
c1
c2
c4
c3
Candidates
with
highest
score
are
selected
envelope
γ
error
c1
c3
c4
e.g.)
f
=
黒い
猫
を
見た
e
=
I
saw
a
black
cat
c1
=
I
saw
black
cat
c2
=
saw
a
black
cat
…
9. Corpus-‐level
Error
• Sentence-‐level
losses
are
summed
to
get
corpus-‐level
error
15/07/09
9
sentence
1
sentence
2
add
sentence-‐level
error
sentence-‐level
envelope
mulB-‐sentence
error
γ*
Find
γ
that
minimizes
overall
error!
10. Problems
of
Powell’s
Method
• SensiBve
to
iniBalizaBon
of
w
• Not
suitable
for
high-‐dimensional
feature
vectors
15/07/09
10
11. Sojmax
Loss
• TranslaBon
probability
• Loss
is
negaBve
likelihood
of
oracle
translaBons
where
oracle
translaBons
are
• Gradient-‐based
methods
(e.g.
L-‐BFGS)
are
applicable
15/07/09
11
12. Max
Margin
Loss
15/07/09
12
• Make
sure
distances
between
correct
translaBons
and
incorrect
translaBons
are
large
• For
example:
• OpBmizaBon
methods
for
SVM
are
applicable
(e.g.
SMO)
for
all
oracle
and
non-‐oracle
pairs
…
penalize
when
diff
in
error
is
greater
than
diff
in
score
f:
黒い猫を見た,
e
(correct):
I
saw
a
black
cat
e*
(oracle)
I
saw
black
cat
0.1
0.4
e
(system)
see
red
dog
0.9
0.3
error
score
(=wTh)
large
small!
bad!
13. Pairwise
Ranking
OpBmizaBon
(PRO)
• Parameter
esBmaBon
as
ranking
problem
• Classifier
learns
w
to
rank
candidates
by
error
• Generate
training
examples
from
pairs
of
candidates
• posiBve
example:
h(cand1)
–
h(cand2)
=
(-‐4,
6)
• negaBve
example:
h(cand3)
–
h(cand1)
=
(3,
-‐7)
• wT{h(cand1)
–
h(cand2)}
>
0
⇔
wTh(cand1)
>
wTh(cand2)
• Off-‐the-‐shelf
linear
binary
classifiers
can
be
used
15/07/09
13
f:
黒い猫を見た,
e
(correct):
I
saw
a
black
cat
e
(cand1)
I
see
black
cat
0.3
(-‐1,
2)
???
e
(cand2)
see
black
dog
0.7
(3,
-‐4)
???
e
(cand3)
see
red
dog
0.9
(2,
-‐5)
???
error
score
(=wTh)
h
14. Minimum
Bayes
Risk
15/07/09
14
• Minimize
expected
loss
where
• γ
=
0:
all
candidates
are
equally
likely
• γ
=
1:
sojmax
• γ→∞:
highest
scoring
candidate
with
probability
1
(MERT)
• DifferenBable
and
considers
many
candidates
<e,d>
15. Sentence-‐level
BLEU
• Sentence-‐level
error
funcBons
are
needed
for
opBmizaBon
• BLEU
is
corpus-‐level
metric
• 4-‐gram
precision
is
ojen
0
on
sentence
level
• varies
from
human
judgments
• Sentence-‐level
error
• Linear
BLEU
• (Expected
BLEU)
15/07/09
15
16. Linear
BLEU
• Linear
approximaBon
of
change
in
BLEU
c:
sum
of
sentence
lengths
mn:
#
matched
n-‐grams
• Add
one
sentence:
(c,
mn)
-‐>
(c’,
mn’)
• Linear
BLEU
error
of
candidate
e
15/07/09
16
log
BLEU
(c,mn)
(c’,m’n)
Δ
#
matched
n-‐grams
in
e