[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

MT
Study
Group

Supervised
Phrase
Table
Triangula8on

with
Neural
Word
Embeddings

for
Low-‐Resource
Languages

Tomer
Levinboim

and

David
Chiang

Proc.
of
EMNLP
2015,
Lisbon,
Portugal

Introduced
by
Akiva
Miura,
AHC-‐Lab

15/10/15
2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST
1

Contents

15/10/15
2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST
2
1.  Introduc8on

2.  Preliminaries

3.  Supervised
Word
Transla8ons

4.  Experiments

5.  Conclusion

6.  Impression

1.
Introduc8on

15/10/15
3
Problem:
Scarceness
of
Bilingual
Data

l  PBMT
systems
require
considerable
amounts
of
source-‐target

parallel
data
to
produce
good
quality
transla8on

Ø A
triangulated
source-‐target
phrase
table
can
be
composed
from

a
source-‐pivot
and
pivot-‐target
phrase
table,
but
s8ll
noisy

l  This
paper
shows
a
supervised
learning
technique
that
improves

noisy
phrase
transla;on
scores
by
extrac8on
of
word
transla8on

distribu8ons
from
small
amounts
of
bilingual
data

Ø This
method
gained
improvement
on
Malagasy-‐to-‐French
and

Spanish-‐to-‐French
transla8on
tasks
via
English

2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST

2.
Preliminaries

15/10/15
4
Denota8on:

2 Preliminaries
Let s, p, t denote words and s, p, t denote phrases
in the source, pivot, and target languages, respec-
tively. Also, let T denote a phrase table estimated
over a parallel corpus and ˆT denote a triangu-
lated phrase table. We use similar notation for their
respective phrase translation features , lexical-
weighting features lex, and the word translation
probabilities w.
2.1 Triangulation (weak baseline)
In phrase table triangulation, a source-target
3 S
Whil
of th
ited t
bles.
learn
Ou
butio
(thro
as th
them
shou
- words in source, pivot, and target languages respectively
s and s, p, t denote phrases
d target languages, respec-
te a phrase table estimated
and ˆT denote a triangu-
se similar notation for their
lation features , lexical-
and the word translation
eak baseline)
gulation, a source-target
3 Supervised Word Transla
While interpolation (Eq. 3) may h
of the noisy triangulated scores,
ited to phrase pairs appearing in
bles. Here, we suggest a discrimin
learning method that can a↵ect al
Our idea is to regard word tr
butions derived from source-targ
(through word alignments or dic
as the correct translation distrib
them to learn discriminately: corr
- phrases in …
minaries
denote words and s, p, t denote phrases
ce, pivot, and target languages, respec-
o, let T denote a phrase table estimated
allel corpus and ˆT denote a triangu-
e table. We use similar notation for their
phrase translation features , lexical-
features lex, and the word translation
es w.
ngulation (weak baseline)
3 Supervised W
While interpolatio
of the noisy triang
ited to phrase pair
bles. Here, we sug
learning method th
Our idea is to
butions derived fro
(through word ali
as the correct tra
them to learn discr
- a phrase table estimated over a parallel corpus
s and s, p, t denote phrases
d target languages, respec-
ote a phrase table estimated
and ˆT denote a triangu-
use similar notation for their
slation features , lexical-
, and the word translation
weak baseline)
ngulation, a source-target
3 Supervised Word Translations
While interpolation (Eq. 3) may help corr
of the noisy triangulated scores, its e↵ec
ited to phrase pairs appearing in both p
bles. Here, we suggest a discriminative su
learning method that can a↵ect all phrase
Our idea is to regard word translatio
butions derived from source-target biling
(through word alignments or dictionary
as the correct translation distributions,
them to learn discriminately: correct targ
should become likely translations, and
- a triangulated phrase table
denote phrases
guages, respec-
table estimated
note a triangu-
otation for their
ures , lexical-
ord translation
ne)
3 Supervised Word Transla
While interpolation (Eq. 3) may h
of the noisy triangulated scores,
ited to phrase pairs appearing in
bles. Here, we suggest a discrimin
learning method that can a↵ect all
Our idea is to regard word tr
butions derived from source-targe
(through word alignments or dic
as the correct translation distrib
them to learn discriminately: corr
- phrase translation features
ries
te words and s, p, t denote phrases
ivot, and target languages, respec-
T denote a phrase table estimated
corpus and ˆT denote a triangu-
le. We use similar notation for their
se translation features , lexical-
ures lex, and the word translation
ation (weak baseline)
3 Supervised Wo
While interpolation (
of the noisy triangul
ited to phrase pairs
bles. Here, we sugge
learning method that
Our idea is to reg
butions derived from
(through word align
as the correct trans
them to learn discrim
- lexical-weighting features
liminaries
t denote words and s, p, t denote phrases
ource, pivot, and target languages, respec-
lso, let T denote a phrase table estimated
parallel corpus and ˆT denote a triangu-
rase table. We use similar notation for their
ve phrase translation features , lexical-
ng features lex, and the word translation
ities w.
iangulation (weak baseline)
3 Super
While inte
of the nois
ited to phr
bles. Here,
learning m
Our ide
butions de
(through w
as the cor
them to le
- word translation probabilities
2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST

2.1
Triangula8on
(weak
baseline)

15/10/15
5
l  A
source-‐target
phrase
table
Tst
is
constructed
by
combining
a

source-‐pivot
and
pivot-‐target
phrase
table
Tsp,
Tpt

l  Combining
alignment:

ngulation (weak baseline)
table triangulation, a source-target
ble Tst is constructed by combining a
ot and pivot-target phrase table Tsp, Tpt,
mated on its respective parallel data. For
ting phrase pair (s, t), we can also com-
ignment â as the most frequent align-
ained by combining source-pivot and
et alignments asp and apt across all pivot
as follows: {(s, t) | 9p : (s, p) 2 asp ^
}.
riangulated source-to-target lexical
denoted clexst, are approximated in two
st, word translation scores ˆwst are ap-
d by marginalizing over the pivot words:
X
them to learn discrim
should become likely
ones should be down-
yond the vocabulary
appeal to word embed
We present our fo
target direction. The
obtained simply by s
get languages.
3.1 Model
Let c
sup
st denote the n
s was aligned to targe
or in the dictionary).
tion distributions wsu
sup P sup
hrase table triangulation, a source-target
e table Tst is constructed by combining a
e-pivot and pivot-target phrase table Tsp, Tpt,
estimated on its respective parallel data. For
resulting phrase pair (s, t), we can also com-
an alignment â as the most frequent align-
obtained by combining source-pivot and
target alignments asp and apt across all pivot
es p as follows: {(s, t) | 9p : (s, p) 2 asp ^
2 apt}.
e triangulated source-to-target lexical
hts, denoted clexst, are approximated in two
First, word translation scores ˆwst are ap-
mated by marginalizing over the pivot words:
ˆwst(t | s) =
X
p
wsp(p | s) · wpt(t | p). (1)
given a (triangulated) phrase pair (s, t) with
ment â, let âs,: = {t | (s, t) 2 â}; the lexical-
hting probability is (Koehn et al., 2003):
should become likely translation
ones should be down-weighted. T
yond the vocabulary of the sourc
appeal to word embeddings.
We present our formulation
target direction. The target-to-so
obtained simply by swapping th
get languages.
3.1 Model
Let c
sup
st denote the number of ti
s was aligned to target word t (in
or in the dictionary). We define
tion distributions wsup(t | s) =
c
sup
s =
P
t c
sup
st . Furthermore, let q
word translation probabilities we
consider maximizing the log-like
arg max
q
L(q) = arg max
q
X
(s,t)
c
source-pivot and pivot-target phrase
each estimated on its respective pa
each resulting phrase pair (s, t), we
pute an alignment â as the most f
ment obtained by combining sou
pivot-target alignments asp and apt
phrases p as follows: {(s, t) | 9p :
(p, t) 2 apt}.
The triangulated source-to-t
weights, denoted clexst, are approx
steps: First, word translation scor
proximated by marginalizing over th
ˆwst(t | s) =
X
p
wsp(p | s) · w
Next, given a (triangulated) phrase
alignment â, let âs,: = {t | (s, t) 2
weighting probability is (Koehn et2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST
=

l  Lexical
weigh8ng
probability
es8ma8on:

ment obtained by combining source-pivot and
pivot-target alignments asp and apt across all pivot
phrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^
(p, t) 2 apt}.
The triangulated source-to-target lexical
weights, denoted clexst, are approximated in two
steps: First, word translation scores ˆwst are ap-
proximated by marginalizing over the pivot words:
ˆwst(t | s) =
X
p
wsp(p | s) · wpt(t | p). (1)
Next, given a (triangulated) phrase pair (s, t) with
alignment â, let âs,: = {t | (s, t) 2 â}; the lexical-
weighting probability is (Koehn et al., 2003):
clexst(t | s, â) =
Y
s2s
1
|âs,:|
X
t2âs,:
ˆwst(t | s). (2)
obtained sim
get language
3.1 Mode
Let c
sup
st den
s was aligne
or in the dic
tion distribu
c
sup
s =
P
t c
s
s
word transla
consider ma
arg max
q
Clearly, the
mizes L. Ho
generalizes
The triangulated source-to-target lexical
weights, denoted clexst, are approximated in two
steps: First, word translation scores ˆwst are ap-
proximated by marginalizing over the pivot words:
ˆwst(t | s) =
X
p
wsp(p | s) · wpt(t | p). (1)
Y
s2s
1
|âs,:|
X
t2âs,:
ˆwst(t | s). (2)
The triangulated phrase translation scores, de-
noted ˆst, are computed by analogy with Eq. 1.
We also compute these scores in the reverse
direction by swapping the source and target lan-
Let cst den
s was aligne
or in the dic
tion distribu
c
sup
s =
P
t c
s
s
word transla
consider ma
arg max
q
Clearly, the
mizes L. Ho
generalizes
served in th
those source
phrase table
In order t
to vector re
l  The
triangulated
phrase
transla8on
scores
are
computed
by
analogy

with
Eq.
1

l  Compu8ng
these
scores
in
the
reverse
direc8on
by
swapping
the

source
and
target
languages

2.2
Interpola8on
(strong
baseline)

15/10/15
6
l  Given
access
to
source-‐target
data,
an
ordinary
source-‐target
phrase

table
Tst
can
be
es8mated
directly

2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST
l  Interpola8on
of
phrase
pairs
entries
that
occur
in
both
tables:

2.2 Interpolation (strong baseline)
Given access to source-target data, an ordinary
source-target phrase table Tst can be estimated di-
rectly. Wu and Wang (2007) suggest interpolating
phrase pairs entries that occur in both tables:
Tinterp = ↵Tst + (1 ↵) ˆTst. (3)
Phrase pairs appearing in only one phrase table are
added as-is. We refer to the resulting table as the
interpolated phrase table.
we constrai
q(t
Here, the v
features an
tures. The p
In this w
dings for v
tain only th
1080
Phrase
pairs
appearing
in
only
one
phrase
table
are
added
as-‐is

3.
Supervised
Word
Transla8on

15/10/15
7
l  The
eﬀect
of
interpola8on
(Eq.
3)
is
limited
to
phrase
pairs
appearing

in
both
phrase
tables.

l  The
idea
of
this
paper
is
to
regard
word
transla8on
distribu8ons

derived
from
source-‐target
bilingual
data
(through
word
alignments

or
dic8onary
entries)
as
the
correct
transla8on,
and
use
them
to

learn
discriminately

•  correct
target
words
should
become
likely
transla;ons

•  incorrect
ones
should
be
down-‐weighted

Ø  To
generalize
beyond
the
vocabulary
of
the
source-‐target
data,
the

authors
appeal
to
word
embeddings

2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST

3.1
Model

15/10/15
8
Defining:

2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST
obtained simply by swapping the source and tar-
get languages.
3.1 Model
Let c
sup
st denote the number of times source word
s was aligned to target word t (in word alignment,
or in the dictionary). We define the word transla-
tion distributions wsup(t | s) = c
sup
st /c
sup
s , where
c
sup
s =
P
t c
sup
st . Furthermore, let q(t | s) denote the
word translation probabilities we wish to learn and
consider maximizing the log-likelihood function:
arg max
q
L(q) = arg max
q
X
(s,t)
c
sup
st log q(t | s).
Clearly, the solution q(· | s) := wsup(· | s) maxi-
mizes L. However, we would like a solution that
- the number of times source word s was aligned to
target word t (in word alignment, or in the dictionary)
es.
ote the number of times source word
d to target word t (in word alignment,
tionary). We define the word transla-
tions wsup(t | s) = c
sup
st /c
sup
s , where
up
. Furthermore, let q(t | s) denote the
tion probabilities we wish to learn and
ximizing the log-likelihood function:
L(q) = arg max
q
X
(s,t)
c
sup
st log q(t | s).
solution q(· | s) := wsup(· | s) maxi-
wever, we would like a solution that
to source words s beyond those ob-
- the word translation distribution
where
ource-pivot and
across all pivot
: (s, p) 2 asp ^
-target lexical
ximated in two
ores ˆwst are ap-
the pivot words:
wpt(t | p). (1)
e pair (s, t) with
â}; the lexical-
al., 2003):
ˆwst(t | s). (2)
get languages.
3.1 Model
Let c
sup
sup
st /c
sup
s , where
c
sup
s =
P
t c
sup
arg max
q
L(q) = arg max
q
X
(s,t)
c
sup
st log q(t | s).
generalizes to source words s beyond those ob-
served in the source-target corpus – in particular,
lation in the source-to-
et-to-source direction is
ping the source and tar-
er of times source word
ord t (in word alignment,
define the word transla-
| s) = c
sup
st /c
sup
s , where
re, let q(t | s) denote the
ties we wish to learn and
log-likelihood function:
ax
X
(s,t)
c
sup
st log q(t | s).
s) := wsup(· | s) maxi-
ould like a solution that
rds s beyond those ob-
- the word translation probabilities we wish to learn
l  We
consider
maximizing
the
log-‐likelihood
func8on:

ent align-
pivot and
s all pivot
) 2 asp ^
lexical
d in two
t are ap-
ot words:
p). (1)
(s, t) with
e lexical-
003):
s). (2)
target direction. The target-to-source direction is
get languages.
3.1 Model
Let c
sup
sup
st /c
sup
s , where
c
sup
s =
P
t c
sup
arg max
q
L(q) = arg max
q
X
(s,t)
c
sup
st log q(t | s).
Cleary,
the
solu8on

align-
and
pivot
asp ^
exical
n two
e ap-
ords:
(1)
with
xical-
:
(2)
target direction. The target-to-source direction is
get languages.
3.1 Model
Let c
sup
sup
st /c
sup
s , where
c
sup
s =
P
t c
sup
arg max
q
L(q) = arg max
q
X
(s,t)
c
sup
st log q(t | s).
maximizes
L

Ø  However,
we
would
like
a
solu8on
that
generalizes
to
source
words
s

beyond
those
observed
in
the
source-‐target
corpus

3.1
Model
(cont’d)

15/10/15
9
l  In
order
to
generalize,
we
abstract
from
words
to
vector

representa8ons
of
words

Ø We
constrain
q
to
the
following
parameteriza8on:

2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST
scores, de-
ith Eq. 1.
the reverse
d target lan-
)
an ordinary
estimated di-
interpolating
tables:
. (3)
rase table are
table as the
those source words that appear in the triangulated
phrase table ˆT, but not in T.
In order to generalize, we abstract from words
to vector representations of words. Specifically,
we constrain q to the following parameterization:
q(t | s) =
1
Zs
exp
⇣
vT
s Avt + fT
st h
⌘
Zs =
X
t2T (s)
exp
⇣
vT
s Avt + fT
st h
⌘
.
Here, the vectors vs and vt represent monolingual
features and the vector fst represents bilingual fea-
tures. The parameters A and h are to be learned.
In this work, we use monolingual word embed-
dings for vs and vt, and set the vector fst to con-
tain only the value of the triangulated score, such
1080
le ˆT, but not in T.
r to generalize, we abstract from words
representations of words. Specifically,
ain q to the following parameterization:
t | s) =
1
Zs
exp
⇣
vT
s Avt + f T
st h
⌘
Zs =
X
t2T (s)
exp
⇣
vT
s Avt + f T
st h
⌘
.
vectors vs and vt represent monolingual
nd the vector fst represents bilingual fea-
parameters A and h are to be learned.
work, we use monolingual word embed-
vs and vt, and set the vector fst to con-
he value of the triangulated score, such
- vectors of monolingual features (word embeddings)
rce words that appear in the triangulated
ble ˆT, but not in T.
er to generalize, we abstract from words
representations of words. Specifically,
rain q to the following parameterization:
(t | s) =
1
Zs
exp
⇣
vT
s Avt + fT
st h
⌘
Zs =
X
t2T (s)
exp
⇣
vT
s Avt + fT
st h
⌘
.
vectors vs and vt represent monolingual
nd the vector fst represents bilingual fea-
e parameters A and h are to be learned.
work, we use monolingual word embed-
vs and vt, and set the vector fst to con-
the value of the triangulated score, such
Y
s2s
1
|âs,:|
X
t2âs,:
ˆwst(t | s). (2)
The triangulated phrase translation scores, de-
noted ˆst, are computed by analogy with Eq. 1.
We also compute these scores in the reverse
direction by swapping the source and target lan-
guages.
2.2 Interpolation (strong baseline)
Given access to source-target data, an ordinary
source-target phrase table Tst can be estimated di-
rectly. Wu and Wang (2007) suggest interpolating
phrase pairs entries that occur in both tables:
Tinterp = ↵Tst + (1 ↵) ˆTst. (3)
Phrase pairs appearing in only one phrase table are
added as-is. We refer to the resulting table as the
interpolated phrase table.
arg max
q
L(q) = arg max
q
X
(s,t)
c
sup
st log q(t | s).
those source words that appear in the triangulated
phrase table ˆT, but not in T.
In order to generalize, we abstract from words
to vector representations of words. Specifically,
we constrain q to the following parameterization:
q(t | s) =
1
Zs
exp
⇣
vT
s Avt + fT
st h
⌘
Zs =
X
t2T (s)
exp
⇣
vT
s Avt + fT
st h
⌘
.
Here, the vectors vs and vt represent monolingual
features and the vector fst represents bilingual fea-
tures. The parameters A and h are to be learned.
In this work, we use monolingual word embed-
dings for vs and vt, and set the vector fst to con-
tain only the value of the triangulated score, such
1080- a vector of bilingual features (triangulated scores)
generalize, we abstract from words
resentations of words. Specifically,
q to the following parameterization:
) =
1
Zs
exp
⇣
vT
s Avt + fT
st h
⌘
s =
X
t2T(s)
exp
⇣
vT
s Avt + fT
st h
⌘
.
ors vs and vt represent monolingual
he vector fst represents bilingual fea-
ameters A and h are to be learned.
k, we use monolingual word embed-
nd vt, and set the vector fst to con-
value of the triangulated score, such
- parameters to be learned
l  For
normaliza8on:

that fst := ˆwst. Therefore, the matrix A is a lin-
ear transformation between the source and target
embedding spaces, and h (now a scalar) quantifies
how the triangulated scores ˆw are to be trusted.
In the normalization factor Zs, we let t range
only over possible translations of s suggested by
either wsup or the triangulated word probabilities.
That is:
T (s) = {t | wsup
(t | s) > 0 _ ˆw(t | s) > 0}.
This restriction makes e cient computation pos-
sible, as otherwise the normalization term would
have to be computed over the entire target vocab- Fi
Ø  Under
the
parameteriza8on,
our
goal
is
to
solve
the
following:

st st
ear transformation between the source and target
embedding spaces, and h (now a scalar) quantifies
how the triangulated scores ˆw are to be trusted.
In the normalization factor Zs, we let t range
only over possible translations of s suggested by
either wsup or the triangulated word probabilities.
That is:
T (s) = {t | wsup
(t | s) > 0 _ ˆw(t | s) > 0}.
This restriction makes e cient computation pos-
sible, as otherwise the normalization term would
have to be computed over the entire target vocab-
ulary.
Under this parameterization, our goal is to solve
the following maximization problem:
max
A,h
L(A, h) = max
A,h
X
s,t
c
sup
st log q(t | s). (4)
Figure 1: The (
per iteration. A
nificantly accel
However, we
lation scores q
ing all probabil
therefore interp

3.2
Op8miza8on

15/10/15
10
l  The
objec8ve
func8on
in
Eq.
4
is
concave
in
both
A
and
h

Ø We
can
reach
the
global
solu8on
of
the
problem
using
gradient

descent

l  Taking
deriva8ves,
the
gradient
is

2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST
A,h A,h
s,t
3.2 Optimization
The objective function in Eq. 4 is concave in both
A and h. This is because after taking the log, we
are left with a weighted sum of linear and concave
(negative log-sum-exp) terms in A and h. We can
therefore reach the global solution of the problem
using gradient descent.
Taking derivatives, the gradient is
@L
@A
=
X
s,t
mstvsvT
t
@L
@h
=
X
s,t
mst fst
where the scalar mst = c
sup
st c
sup
s q(t | s) for the
current value of q.
For quick results, we limited the number of gra-
dient steps to 200 and selected the iteration that
minimized the total variation distance to wsup over
a held out dev set:
X
s
||q(· | s) wsup
(· | s)||1. (5)
We obtained better convergence rate by us-
lation scores q to be too sharp
ing all probability mass to a si
therefore interpolated q with t
translation scores ˆw:
q = q + (1
To integrate the lexical wei
(Eq. 2), we simply appended
in the phrase table in addition
cal weights. Following this, w
value that maximizes B on
3.4 Summary of method
In summary, to improve upon
terpolated phrase table, we:
1. Learn word translation dist
vision against distribution
the source-target bilingual
2. Smooth the learned distrib
lating with triangulated wo
ˆw (§3.3).
3. Compute new lexical weig
The objective function in Eq. 4 is concave in both
A and h. This is because after taking the log, we
@L
@A
=
X
s,t
mstvsvT
t
@L
@h
=
X
s,t
mst fst
sup
st c
sup
s q(t | s) for the
current value of q.
a held out dev set:
X
s
||q(· | s) wsup
(· | s)||1. (5)
To inte
(Eq. 2),
in the p
cal wei
value th
3.4 S
In summ
terpola
1. Lear
visio
the s
2. Smo
latin
ˆw (§
l  For
quick
results,
this
research
limited
the
number
of
gradient

steps
to
200
and
selected
the
itera8on
that
minimized
the
total

varia8on
distance
to
wsup
over
a
held
out
dev
set:

@L
@A
=
X
s,t
mstvsvT
t
@L
@h
=
X
s,t
mst fst
sup
st c
sup
s q(t | s) for the
current value of q.
a held out dev set:
X
s
||q(· | s) wsup
(· | s)||1. (5)
We obtained better convergence rate by us-
q
To integrate the
(Eq. 2), we simply
in the phrase table
cal weights. Follow
value that maximi
3.4 Summary o
In summary, to im
terpolated phrase t
1. Learn word tran
vision against
the source-targ
2. Smooth the lea
lating with trian
ˆw (§3.3).
3. Compute new l

3.2
Op8miza8on
(cont’d)

15/10/15
11
2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST
A is a lin-
e and target
) quantiﬁes
trusted.
let t range
uggested by
robabilities.
s) > 0}.
utation pos-
term would
rget vocab-
l is to solve
Figure 1: The (target-to-source) objective function
per iteration. Applying batch Adagrad (blue) sig-
niﬁcantly accelerates convergence.

3.3
Re-‐es8ma8ng
lexical
weights

15/10/15
12
l  Having
learned
the
model
(A
and
h),
we
can
now
use
q(t
|
s)
to

es8mate
the
lexical
weights
(Eq.
2)
of
any
aligned
phrase

pairs

,
assuming
it
is
composed
of
embeddable
words

l  However,
the
authors
found
the
supervised
word
transla8on

scores
q
to
be
too
sharp,
some8mes
assigning
all
probability

mass
to
a
single
target
word

Ø  They
therefore
interpolated
q
with
the
triangulated
word

transla8on
scores:

•  To
integrate
the
lexical
weights
induced
by
qβ
(Eq.
2),
they

simply
appended
them
as
new
features
in
the
phrase
table

2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST
ng lexical weights
e model (A and h), we can now
mate the lexical weights (Eq. 2)
rase pairs (s, t, ˆa), assuming it is
eddable words.
4 Experiments
To test our method, w
resource translation e
phrase-based MT system
2007).
1081
Figure 1: The (target-to-source) objective function
per iteration. Applying batch Adagrad (blue) sig-
niﬁcantly accelerates convergence.
However, we found the supervised word trans-
lation scores q to be too sharp, sometimes assign-
ing all probability mass to a single target word. We
therefore interpolated q with the triangulated word
translation scores ˆw:
q = q + (1 ) ˆw. (6)
To integrate the lexical weights induced by q
(Eq. 2), we simply appended them as new features
in the phrase table in addition to the existing lexi-

3.4
Summary
of
method

15/10/15
13
In
summary,
to
improve
upon
a
triangulated
or
interpolated

phrase
table,
the
authors:

1.  Learn
word
transla8on
distribu8ons
q
by
supervision
against

distribu8ons
wsup
derived
from
the
source-‐target
bilingual
data

(§3.1)

2.  Smooth
the
learned
distribu8ons
q
by
interpola8ng
with

triangulated
word
transla8on
scores

3.  Compute
new
lexical
weights
and
append
them
to
the
phrase

table
(§3.3)

2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST
X
t
mstvsvT
t
@L
@h
=
X
s,t
mst fst
lar mst = c
sup
st c
sup
s q(t | s) for the
of q.
esults, we limited the number of gra-
200 and selected the iteration that
e total variation distance to wsup over
set:
X
||q(· | s) wsup
(· | s)||1. (5)
better convergence rate by us-
version of the e↵ective and easy-
Adagrad technique (Duchi et al.,
gure 1.
value that maximizes B on a
3.4 Summary of method
In summary, to improve upon a t
terpolated phrase table, we:
1. Learn word translation distrib
vision against distributions w
the source-target bilingual da
2. Smooth the learned distributi
lating with triangulated word
ˆw (§3.3).
3. Compute new lexical weights
to the phrase table (§3.3).
4 Experiments

4.
Experiments

15/10/15
14
l  To
test
the
proposed
method,
the
authors
conducted
two
low-‐
resource
transla8on
experiments
using
Moses

Transla8on
Tasks:

l  Fixing
the
pivot
language
to
English,
they
applied
their
method

on
two
data
scenarios:

1.  Spanish-‐to-‐French:

two
related
languages
used
to
simulate
a
low-‐resource

seeng.
The
baseline
is
phrase
table
interpola8on
(Eq.
3)

2.  Malagasy-‐to-‐French:

two
unrelated
languages
for
which
they
have
a
small

dic8onary,
but
no
parallel
corpus.
The
baseline
is

triangula8on
alone.

2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST

4.1
Data

15/10/15
15
Datasets:

•  European-‐language
bitext
were
extracted
from
Europarl

•  For
Malagasy-‐English,
Global
Voices
parallel
data
available
online

•  The
Malagasy-‐French
dic8onary
from
online
resources
and
small

Malagasy-‐French
tune/tests
from
Global
Voices

2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST
parallel corpus (aside from tuning and testing
data). The baseline is triangulation alone (there
is no source-target model to interpolate with).
Table 1 lists some statistics of the bilin-
gual data we used. European-language bitexts
were extracted from Europarl (Koehn, 2005). For
Malagasy-English, we used the Global Voices par-
allel data available online.1 The Malagasy-French
dictionary was extracted from online resources2
and the small Malagasy-French tune/test sets were
extracted3 from Global Voices.
lines of data
language pair train tune test
sp-fr 4k 1.5k 1.5k
mg-fr 1.1k 1.2k 1.2k
sp-en 50k – –
mg-en 100k – –
en-fr 50k – –
Table 1: Bilingual datasets. Legend: sp=Spanish,
fr=French, en=English, mg=Malagasy.
Table 2 lists token statistics of the monolin-
gual data used. We used word2vec4 to generate
To produce w , we aligned the small Spani
French parallel corpus in both directions, a
symmetrized using the intersection heuristic. T
was done to obtain high precision alignments (
often-used grow-diag-ﬁnal-and heuristic is op
mized for phrase extraction, not precision).
We used the skip-gram model to estimate
Spanish and French word embeddings and set
dimension to d = 200 and context window
w = 5 (default). Subsequently, to run our metho
we ﬁltered out source and target words that eith
did not appear in the triangulation, or, did not ha
an embedding. We took words that appeared mo
than 10 times in the parallel corpus for the traini
set (⇠690 words), and between 5–9 times for
held out dev set (⇠530 words). This was done
both source-target and target-source directions.
In Table 3 we show that the distributions learn
by our method are much better approximations
wsup compared to those obtained by triangulatio
Method source!target target!sourc
triangulation 71.6% 72.0%
our scores 30.2% 33.8%
Table 3: Average total variation distance (Eq.
4.1 Data
Fixing the pivot language to English, we applied
our method on two data scenarios:
1. Spanish-to-French: two related languages
used to simulate a low-resource setting. The
baseline is phrase table interpolation (Eq. 3).
2. Malagasy-to-French: two unrelated languages
for which we have a small dictionary, but no
language words
French 1.5G
Spanish 1.4G
Malagasy 58M
Table 2: Size of monolingual corpus per language
as measured in number of tokens.
4.2 Spanish-French Results

4.2
Spanish-‐French
Results

15/10/15
16
l  To
produce
wsup,
the
authors
aligned
the
small
Spanish-‐French
parallel

corpus
in
both
direc8ons,
and
symmetrized
using
the
intersec8on

heuris8c
to
obtain
high
precision
(not
grow-‐diag-‐final-‐and)

l  To
train
skip-‐gram
model,
dimension
d
=
200
and
context
window
w
=
5

l  They
took
words
that
appeared
more
than
10
8mes
in
the
parallel

corpus
for
the
training
set
(〜690
words),
and
5-‐9
8mes
for
the
held

out
dev
set
(〜530
words)

l  They
fixed
β
:=
0.95
to
examine
the
effect
of
their
supervised
method

2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST
with).
bilin-
bitexts
05). For
ces par-
-French
ources2
ets were
panish,
onolin-
enerate
ddings.
ere (in-
ned to-
Leipzig
gs were
Voices,7
symmetrized using the intersection heuristic. This
was done to obtain high precision alignments (the
often-used grow-diag-final-and heuristic is opti-
mized for phrase extraction, not precision).
We used the skip-gram model to estimate the
Spanish and French word embeddings and set the
dimension to d = 200 and context window to
w = 5 (default). Subsequently, to run our method,
we filtered out source and target words that either
did not appear in the triangulation, or, did not have
an embedding. We took words that appeared more
than 10 times in the parallel corpus for the training
set (⇠690 words), and between 5–9 times for the
held out dev set (⇠530 words). This was done in
both source-target and target-source directions.
In Table 3 we show that the distributions learned
by our method are much better approximations of
wsup compared to those obtained by triangulation.
Method source!target target!source
triangulation 71.6% 72.0%
our scores 30.2% 33.8%
Table 3: Average total variation distance (Eq. 5)
to the dev set portion of wsup (computed only over
words whose translations in wsup appear in the tri-
angulation). Using word embeddings, our method
is able to better generalize on the dev set.
We then examined the e↵ect of appending our
supervised lexical weights. We fixed the word
Method ↵ tune test
source-target – 26.8 25.3
triangulation – 29.2 28.4
interpolation 0.7 30.2 29.2
interpolation+our scores 0.6 30.8 29.9
Table 4: Spanish-French B scores. Append-
ing lexical weights obtained by supervision over
a small source-target corpus significantly out-
performs phrase table interpolation (Eq. 3) by
+0.7 B .
4.3 Malagasy-French Results
We
cal w
gulate
strate
This i
we fit
or eve
Ackn
The a
Kevin
portiv
anony

4.3
Malagasy-‐French
Results

15/10/15
17
l  The
wsup
distribu8ons
used
for
supervision
were
taken
to
be

uniform
distribu8ons
over
the
dic8onary
transla8ons

•  For
each
training
direc8on,
they
used
70%/30%
split
of
the

dic8onary
to
form
the
train
and
dev
sets

l  To
train
skip-‐gram
model,
d
=
100,
w
=
3

2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST
Having significantly less Malagasy monolin-
gual data, we used d = 100 dimensional embed-
dings and a w = 3 context window to estimate both
Malagasy and French words.
As before, we added our supervised lexical
weights as new features in the phrase table. How-
ever, instead of fixing = 0.95 as above, we
searched for 2 {0.9, 0.8, 0.7, 0.6} in Eq. 6 to max-
imize B on a small tune set. We report our re-
sults in Table 5. Using only a dictionary, we are
able to improve over triangulation by +0.5 B , a
statistically significant di↵erence (p < 0.01).
Method tune test
triangulation – 12.2 11.1
triangulation+our scores 0.6 12.4 11.6
Table 5: Malagasy-French B . Supervision with
a dictionary significantly improves upon simple
triangulation by +0.5 B .
5 Conclusion
Trevor Cohn and Mir
translation by triang
multi-parallel corpo
735.
John Duchi, Elad Haz
Adaptive subgradie
and stochastic optim
Research, 12:2121–
Philipp Koehn, Franz
2003. Statistical ph
NAACL HLT, pages
Philipp Koehn, Hieu
Callison-Burch, Ma
Brooke Cowan, W
Richard Zens, Chri
dra Constantin, and
Open source toolki
tion. In Proc. ACL,
stration Sessions, pa
Philipp Koehn. 2004.
machine translation
pages 388–395.
Philipp Koehn. 2005.
statistical machine t
pages 79–86.

5.
Conclusion

15/10/15
18
In
this
paper:

l  The
authors
argued
that
construc8ng
a
triangulated
phrase
table

independently
from
even
very
limited
source-‐target
data
underu;lizes

that
parallel
data

Ø  They
designed
a
supervised
learning
algorithm
that
relies
on
word

transla8ons
distribu8ons
derived
from
the
parallel
data
as
well
as
a

distributed
representa;on
of
words
(embeddings)

Ø  The
laker
enables
their
algorithm
to
assign
transla8on
probabili8es
to

word
pairs
that
do
not
appear
in
the
source-‐target
bilingual
data

l  Model
with
the
new
lexical
weights
genera8on
demonstrates

improvements
in
MT
quality
on
two
tasks
despite
the
fact
that
wsup

were
es8mated
automa8cally
or
even
naïvely
as
uniform
distribu;ons

2015©Akiva
Miura

AHC-‐Lab,
IS,
NAIST

[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

Similar to [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages (20)

More from NAIST Machine Translation Study Group

More from NAIST Machine Translation Study Group (14)

Recently uploaded

Recently uploaded (20)

[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages