SlideShare a Scribd company logo
1 of 20
Download to read offline
MT	
  Study	
  Group	
  
	
  
Supervised	
  Phrase	
  Table	
  Triangula8on	
  
with	
  Neural	
  Word	
  Embeddings	
  
for	
  Low-­‐Resource	
  Languages	
  
	
  
Tomer	
  Levinboim	
  	
  and	
  	
  David	
  Chiang	
  
	
  
Proc.	
  of	
  EMNLP	
  2015,	
  Lisbon,	
  Portugal	
  
Introduced	
  by	
  Akiva	
  Miura,	
  AHC-­‐Lab	
  
15/10/15	
 2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST	
 1
Contents	
  
15/10/15	
 2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST	
 2	
1.  Introduc8on	
  
2.  Preliminaries	
  
3.  Supervised	
  Word	
  Transla8ons	
  
4.  Experiments	
  
5.  Conclusion	
  
6.  Impression
1.	
  Introduc8on	
  
15/10/15	
 3	
Problem:	
  Scarceness	
  of	
  Bilingual	
  Data	
  
	
  	
  
l  PBMT	
  systems	
  require	
  considerable	
  amounts	
  of	
  source-­‐target	
  
parallel	
  data	
  to	
  produce	
  good	
  quality	
  transla8on	
  
Ø A	
  triangulated	
  source-­‐target	
  phrase	
  table	
  can	
  be	
  composed	
  from	
  
a	
  source-­‐pivot	
  and	
  pivot-­‐target	
  phrase	
  table,	
  but	
  s8ll	
  noisy	
  
l  This	
  paper	
  shows	
  a	
  supervised	
  learning	
  technique	
  that	
  improves	
  
noisy	
  phrase	
  transla;on	
  scores	
  by	
  extrac8on	
  of	
  word	
  transla8on	
  
distribu8ons	
  from	
  small	
  amounts	
  of	
  bilingual	
  data	
  
Ø This	
  method	
  gained	
  improvement	
  on	
  Malagasy-­‐to-­‐French	
  and	
  
Spanish-­‐to-­‐French	
  transla8on	
  tasks	
  via	
  English	
  
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST
2.	
  Preliminaries	
  
15/10/15	
 4	
Denota8on:	
  
2 Preliminaries
Let s, p, t denote words and s, p, t denote phrases
in the source, pivot, and target languages, respec-
tively. Also, let T denote a phrase table estimated
over a parallel corpus and ˆT denote a triangu-
lated phrase table. We use similar notation for their
respective phrase translation features , lexical-
weighting features lex, and the word translation
probabilities w.
2.1 Triangulation (weak baseline)
In phrase table triangulation, a source-target
3 S
Whil
of th
ited t
bles.
learn
Ou
butio
(thro
as th
them
shou
- words in source, pivot, and target languages respectively	
s and s, p, t denote phrases
d target languages, respec-
te a phrase table estimated
and ˆT denote a triangu-
se similar notation for their
lation features , lexical-
and the word translation
eak baseline)
gulation, a source-target
3 Supervised Word Transla
While interpolation (Eq. 3) may h
of the noisy triangulated scores,
ited to phrase pairs appearing in
bles. Here, we suggest a discrimin
learning method that can a↵ect al
Our idea is to regard word tr
butions derived from source-targ
(through word alignments or dic
as the correct translation distrib
them to learn discriminately: corr
- phrases in …	
minaries
denote words and s, p, t denote phrases
ce, pivot, and target languages, respec-
o, let T denote a phrase table estimated
allel corpus and ˆT denote a triangu-
e table. We use similar notation for their
phrase translation features , lexical-
features lex, and the word translation
es w.
ngulation (weak baseline)
3 Supervised W
While interpolatio
of the noisy triang
ited to phrase pair
bles. Here, we sug
learning method th
Our idea is to
butions derived fro
(through word ali
as the correct tra
them to learn discr
- a phrase table estimated over a parallel corpus	
s and s, p, t denote phrases
d target languages, respec-
ote a phrase table estimated
and ˆT denote a triangu-
use similar notation for their
slation features , lexical-
, and the word translation
weak baseline)
ngulation, a source-target
3 Supervised Word Translations
While interpolation (Eq. 3) may help corr
of the noisy triangulated scores, its e↵ec
ited to phrase pairs appearing in both p
bles. Here, we suggest a discriminative su
learning method that can a↵ect all phrase
Our idea is to regard word translatio
butions derived from source-target biling
(through word alignments or dictionary
as the correct translation distributions,
them to learn discriminately: correct targ
should become likely translations, and
- a triangulated phrase table	
denote phrases
guages, respec-
table estimated
note a triangu-
otation for their
ures , lexical-
ord translation
ne)
3 Supervised Word Transla
While interpolation (Eq. 3) may h
of the noisy triangulated scores,
ited to phrase pairs appearing in
bles. Here, we suggest a discrimin
learning method that can a↵ect all
Our idea is to regard word tr
butions derived from source-targe
(through word alignments or dic
as the correct translation distrib
them to learn discriminately: corr
- phrase translation features	
ries
te words and s, p, t denote phrases
ivot, and target languages, respec-
T denote a phrase table estimated
corpus and ˆT denote a triangu-
le. We use similar notation for their
se translation features , lexical-
ures lex, and the word translation
ation (weak baseline)
3 Supervised Wo
While interpolation (
of the noisy triangul
ited to phrase pairs
bles. Here, we sugge
learning method that
Our idea is to reg
butions derived from
(through word align
as the correct trans
them to learn discrim
- lexical-weighting features	
liminaries
t denote words and s, p, t denote phrases
ource, pivot, and target languages, respec-
lso, let T denote a phrase table estimated
parallel corpus and ˆT denote a triangu-
rase table. We use similar notation for their
ve phrase translation features , lexical-
ng features lex, and the word translation
ities w.
iangulation (weak baseline)
3 Super
While inte
of the nois
ited to phr
bles. Here,
learning m
Our ide
butions de
(through w
as the cor
them to le
- word translation probabilities	
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST
2.1	
  Triangula8on	
  (weak	
  baseline)	
  
15/10/15	
 5	
l  A	
  source-­‐target	
  phrase	
  table	
  Tst	
  is	
  constructed	
  by	
  combining	
  a	
  
source-­‐pivot	
  and	
  pivot-­‐target	
  phrase	
  table	
  Tsp,	
  Tpt	
  
l  Combining	
  alignment:	
  
ngulation (weak baseline)
table triangulation, a source-target
ble Tst is constructed by combining a
ot and pivot-target phrase table Tsp, Tpt,
mated on its respective parallel data. For
ting phrase pair (s, t), we can also com-
ignment ˆa as the most frequent align-
ained by combining source-pivot and
et alignments asp and apt across all pivot
as follows: {(s, t) | 9p : (s, p) 2 asp ^
}.
riangulated source-to-target lexical
denoted clexst, are approximated in two
st, word translation scores ˆwst are ap-
d by marginalizing over the pivot words:
X
them to learn discrim
should become likely
ones should be down-
yond the vocabulary
appeal to word embed
We present our fo
target direction. The
obtained simply by s
get languages.
3.1 Model
Let c
sup
st denote the n
s was aligned to targe
or in the dictionary).
tion distributions wsu
sup P sup
hrase table triangulation, a source-target
e table Tst is constructed by combining a
e-pivot and pivot-target phrase table Tsp, Tpt,
estimated on its respective parallel data. For
resulting phrase pair (s, t), we can also com-
an alignment ˆa as the most frequent align-
obtained by combining source-pivot and
target alignments asp and apt across all pivot
es p as follows: {(s, t) | 9p : (s, p) 2 asp ^
2 apt}.
e triangulated source-to-target lexical
hts, denoted clexst, are approximated in two
First, word translation scores ˆwst are ap-
mated by marginalizing over the pivot words:
ˆwst(t | s) =
X
p
wsp(p | s) · wpt(t | p). (1)
given a (triangulated) phrase pair (s, t) with
ment ˆa, let ˆas,: = {t | (s, t) 2 ˆa}; the lexical-
hting probability is (Koehn et al., 2003):
should become likely translation
ones should be down-weighted. T
yond the vocabulary of the sourc
appeal to word embeddings.
We present our formulation
target direction. The target-to-so
obtained simply by swapping th
get languages.
3.1 Model
Let c
sup
st denote the number of ti
s was aligned to target word t (in
or in the dictionary). We define
tion distributions wsup(t | s) =
c
sup
s =
P
t c
sup
st . Furthermore, let q
word translation probabilities we
consider maximizing the log-like
arg max
q
L(q) = arg max
q
X
(s,t)
c
source-pivot and pivot-target phrase
each estimated on its respective pa
each resulting phrase pair (s, t), we
pute an alignment ˆa as the most f
ment obtained by combining sou
pivot-target alignments asp and apt
phrases p as follows: {(s, t) | 9p :
(p, t) 2 apt}.
The triangulated source-to-t
weights, denoted clexst, are approx
steps: First, word translation scor
proximated by marginalizing over th
ˆwst(t | s) =
X
p
wsp(p | s) · w
Next, given a (triangulated) phrase
alignment ˆa, let ˆas,: = {t | (s, t) 2
weighting probability is (Koehn et2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST	
=	
  
l  Lexical	
  weigh8ng	
  probability	
  es8ma8on:	
  
ment obtained by combining source-pivot and
pivot-target alignments asp and apt across all pivot
phrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^
(p, t) 2 apt}.
The triangulated source-to-target lexical
weights, denoted clexst, are approximated in two
steps: First, word translation scores ˆwst are ap-
proximated by marginalizing over the pivot words:
ˆwst(t | s) =
X
p
wsp(p | s) · wpt(t | p). (1)
Next, given a (triangulated) phrase pair (s, t) with
alignment ˆa, let ˆas,: = {t | (s, t) 2 ˆa}; the lexical-
weighting probability is (Koehn et al., 2003):
clexst(t | s, ˆa) =
Y
s2s
1
|ˆas,:|
X
t2ˆas,:
ˆwst(t | s). (2)
obtained sim
get language
3.1 Mode
Let c
sup
st den
s was aligne
or in the dic
tion distribu
c
sup
s =
P
t c
s
s
word transla
consider ma
arg max
q
Clearly, the
mizes L. Ho
generalizes
The triangulated source-to-target lexical
weights, denoted clexst, are approximated in two
steps: First, word translation scores ˆwst are ap-
proximated by marginalizing over the pivot words:
ˆwst(t | s) =
X
p
wsp(p | s) · wpt(t | p). (1)
Next, given a (triangulated) phrase pair (s, t) with
alignment ˆa, let ˆas,: = {t | (s, t) 2 ˆa}; the lexical-
weighting probability is (Koehn et al., 2003):
clexst(t | s, ˆa) =
Y
s2s
1
|ˆas,:|
X
t2ˆas,:
ˆwst(t | s). (2)
The triangulated phrase translation scores, de-
noted ˆst, are computed by analogy with Eq. 1.
We also compute these scores in the reverse
direction by swapping the source and target lan-
Let cst den
s was aligne
or in the dic
tion distribu
c
sup
s =
P
t c
s
s
word transla
consider ma
arg max
q
Clearly, the
mizes L. Ho
generalizes
served in th
those source
phrase table
In order t
to vector re
l  The	
  triangulated	
  phrase	
  transla8on	
  scores	
  are	
  computed	
  by	
  analogy	
  
with	
  Eq.	
  1	
  
l  Compu8ng	
  these	
  scores	
  in	
  the	
  reverse	
  direc8on	
  by	
  swapping	
  the	
  
source	
  and	
  target	
  languages	
  
2.2	
  Interpola8on	
  (strong	
  baseline)	
  
15/10/15	
 6	
l  Given	
  access	
  to	
  source-­‐target	
  data,	
  an	
  ordinary	
  source-­‐target	
  phrase	
  
table	
  Tst	
  can	
  be	
  es8mated	
  directly	
  
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST	
l  Interpola8on	
  of	
  phrase	
  pairs	
  entries	
  that	
  occur	
  in	
  both	
  tables:	
  
2.2 Interpolation (strong baseline)
Given access to source-target data, an ordinary
source-target phrase table Tst can be estimated di-
rectly. Wu and Wang (2007) suggest interpolating
phrase pairs entries that occur in both tables:
Tinterp = ↵Tst + (1 ↵) ˆTst. (3)
Phrase pairs appearing in only one phrase table are
added as-is. We refer to the resulting table as the
interpolated phrase table.
we constrai
q(t
Here, the v
features an
tures. The p
In this w
dings for v
tain only th
1080
Phrase	
  pairs	
  appearing	
  in	
  only	
  one	
  phrase	
  table	
  are	
  added	
  as-­‐is	
  
3.	
  Supervised	
  Word	
  Transla8on	
  
15/10/15	
 7	
l  The	
  effect	
  of	
  interpola8on	
  (Eq.	
  3)	
  is	
  limited	
  to	
  phrase	
  pairs	
  appearing	
  
in	
  both	
  phrase	
  tables.	
  
l  The	
  idea	
  of	
  this	
  paper	
  is	
  to	
  regard	
  word	
  transla8on	
  distribu8ons	
  
derived	
  from	
  source-­‐target	
  bilingual	
  data	
  (through	
  word	
  alignments	
  
or	
  dic8onary	
  entries)	
  as	
  the	
  correct	
  transla8on,	
  and	
  use	
  them	
  to	
  
learn	
  discriminately	
  
•  correct	
  target	
  words	
  should	
  become	
  likely	
  transla;ons	
  
•  incorrect	
  ones	
  should	
  be	
  down-­‐weighted	
  
Ø  To	
  generalize	
  beyond	
  the	
  vocabulary	
  of	
  the	
  source-­‐target	
  data,	
  the	
  
authors	
  appeal	
  to	
  word	
  embeddings	
  
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST
3.1	
  Model	
  
15/10/15	
 8	
Defining:	
  
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST	
obtained simply by swapping the source and tar-
get languages.
3.1 Model
Let c
sup
st denote the number of times source word
s was aligned to target word t (in word alignment,
or in the dictionary). We define the word transla-
tion distributions wsup(t | s) = c
sup
st /c
sup
s , where
c
sup
s =
P
t c
sup
st . Furthermore, let q(t | s) denote the
word translation probabilities we wish to learn and
consider maximizing the log-likelihood function:
arg max
q
L(q) = arg max
q
X
(s,t)
c
sup
st log q(t | s).
Clearly, the solution q(· | s) := wsup(· | s) maxi-
mizes L. However, we would like a solution that
- the number of times source word s was aligned to
target word t (in word alignment, or in the dictionary)	
es.
ote the number of times source word
d to target word t (in word alignment,
tionary). We define the word transla-
tions wsup(t | s) = c
sup
st /c
sup
s , where
up
. Furthermore, let q(t | s) denote the
tion probabilities we wish to learn and
ximizing the log-likelihood function:
L(q) = arg max
q
X
(s,t)
c
sup
st log q(t | s).
solution q(· | s) := wsup(· | s) maxi-
wever, we would like a solution that
to source words s beyond those ob-
- the word translation distribution	
where	
ource-pivot and
across all pivot
: (s, p) 2 asp ^
-target lexical
ximated in two
ores ˆwst are ap-
the pivot words:
wpt(t | p). (1)
e pair (s, t) with
ˆa}; the lexical-
al., 2003):
ˆwst(t | s). (2)
obtained simply by swapping the source and tar-
get languages.
3.1 Model
Let c
sup
st denote the number of times source word
s was aligned to target word t (in word alignment,
or in the dictionary). We define the word transla-
tion distributions wsup(t | s) = c
sup
st /c
sup
s , where
c
sup
s =
P
t c
sup
st . Furthermore, let q(t | s) denote the
word translation probabilities we wish to learn and
consider maximizing the log-likelihood function:
arg max
q
L(q) = arg max
q
X
(s,t)
c
sup
st log q(t | s).
Clearly, the solution q(· | s) := wsup(· | s) maxi-
mizes L. However, we would like a solution that
generalizes to source words s beyond those ob-
served in the source-target corpus – in particular,
lation in the source-to-
et-to-source direction is
ping the source and tar-
er of times source word
ord t (in word alignment,
define the word transla-
| s) = c
sup
st /c
sup
s , where
re, let q(t | s) denote the
ties we wish to learn and
log-likelihood function:
ax
X
(s,t)
c
sup
st log q(t | s).
s) := wsup(· | s) maxi-
ould like a solution that
rds s beyond those ob-
- the word translation probabilities we wish to learn	
l  We	
  consider	
  maximizing	
  the	
  log-­‐likelihood	
  func8on:	
  
ent align-
pivot and
s all pivot
) 2 asp ^
lexical
d in two
t are ap-
ot words:
p). (1)
(s, t) with
e lexical-
003):
s). (2)
target direction. The target-to-source direction is
obtained simply by swapping the source and tar-
get languages.
3.1 Model
Let c
sup
st denote the number of times source word
s was aligned to target word t (in word alignment,
or in the dictionary). We define the word transla-
tion distributions wsup(t | s) = c
sup
st /c
sup
s , where
c
sup
s =
P
t c
sup
st . Furthermore, let q(t | s) denote the
word translation probabilities we wish to learn and
consider maximizing the log-likelihood function:
arg max
q
L(q) = arg max
q
X
(s,t)
c
sup
st log q(t | s).
Clearly, the solution q(· | s) := wsup(· | s) maxi-
mizes L. However, we would like a solution that
generalizes to source words s beyond those ob-
Cleary,	
  the	
  solu8on	
  
align-
and
pivot
asp ^
exical
n two
e ap-
ords:
(1)
with
xical-
:
(2)
target direction. The target-to-source direction is
obtained simply by swapping the source and tar-
get languages.
3.1 Model
Let c
sup
st denote the number of times source word
s was aligned to target word t (in word alignment,
or in the dictionary). We define the word transla-
tion distributions wsup(t | s) = c
sup
st /c
sup
s , where
c
sup
s =
P
t c
sup
st . Furthermore, let q(t | s) denote the
word translation probabilities we wish to learn and
consider maximizing the log-likelihood function:
arg max
q
L(q) = arg max
q
X
(s,t)
c
sup
st log q(t | s).
Clearly, the solution q(· | s) := wsup(· | s) maxi-
mizes L. However, we would like a solution that
generalizes to source words s beyond those ob-
served in the source-target corpus – in particular,
maximizes	
  L	
  
Ø  However,	
  we	
  would	
  like	
  a	
  solu8on	
  that	
  generalizes	
  to	
  source	
  words	
  s	
  
beyond	
  those	
  observed	
  in	
  the	
  source-­‐target	
  corpus	
  
3.1	
  Model	
  (cont’d)	
  
15/10/15	
 9	
l  In	
  order	
  to	
  generalize,	
  we	
  abstract	
  from	
  words	
  to	
  vector	
  
representa8ons	
  of	
  words	
  
Ø We	
  constrain	
  q	
  to	
  the	
  following	
  parameteriza8on:	
  
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST	
scores, de-
ith Eq. 1.
the reverse
d target lan-
)
an ordinary
estimated di-
interpolating
tables:
. (3)
rase table are
table as the
served in the source-target corpus – in particular,
those source words that appear in the triangulated
phrase table ˆT, but not in T.
In order to generalize, we abstract from words
to vector representations of words. Specifically,
we constrain q to the following parameterization:
q(t | s) =
1
Zs
exp
⇣
vT
s Avt + fT
st h
⌘
Zs =
X
t2T (s)
exp
⇣
vT
s Avt + fT
st h
⌘
.
Here, the vectors vs and vt represent monolingual
features and the vector fst represents bilingual fea-
tures. The parameters A and h are to be learned.
In this work, we use monolingual word embed-
dings for vs and vt, and set the vector fst to con-
tain only the value of the triangulated score, such
1080
le ˆT, but not in T.
r to generalize, we abstract from words
representations of words. Specifically,
ain q to the following parameterization:
t | s) =
1
Zs
exp
⇣
vT
s Avt + f T
st h
⌘
Zs =
X
t2T (s)
exp
⇣
vT
s Avt + f T
st h
⌘
.
vectors vs and vt represent monolingual
nd the vector fst represents bilingual fea-
parameters A and h are to be learned.
work, we use monolingual word embed-
vs and vt, and set the vector fst to con-
he value of the triangulated score, such
- vectors of monolingual features (word embeddings)	
rce words that appear in the triangulated
ble ˆT, but not in T.
er to generalize, we abstract from words
representations of words. Specifically,
rain q to the following parameterization:
(t | s) =
1
Zs
exp
⇣
vT
s Avt + fT
st h
⌘
Zs =
X
t2T (s)
exp
⇣
vT
s Avt + fT
st h
⌘
.
vectors vs and vt represent monolingual
nd the vector fst represents bilingual fea-
e parameters A and h are to be learned.
work, we use monolingual word embed-
vs and vt, and set the vector fst to con-
the value of the triangulated score, such
Next, given a (triangulated) phrase pair (s, t) with
alignment ˆa, let ˆas,: = {t | (s, t) 2 ˆa}; the lexical-
weighting probability is (Koehn et al., 2003):
clexst(t | s, ˆa) =
Y
s2s
1
|ˆas,:|
X
t2ˆas,:
ˆwst(t | s). (2)
The triangulated phrase translation scores, de-
noted ˆst, are computed by analogy with Eq. 1.
We also compute these scores in the reverse
direction by swapping the source and target lan-
guages.
2.2 Interpolation (strong baseline)
Given access to source-target data, an ordinary
source-target phrase table Tst can be estimated di-
rectly. Wu and Wang (2007) suggest interpolating
phrase pairs entries that occur in both tables:
Tinterp = ↵Tst + (1 ↵) ˆTst. (3)
Phrase pairs appearing in only one phrase table are
added as-is. We refer to the resulting table as the
interpolated phrase table.
arg max
q
L(q) = arg max
q
X
(s,t)
c
sup
st log q(t | s).
Clearly, the solution q(· | s) := wsup(· | s) maxi-
mizes L. However, we would like a solution that
generalizes to source words s beyond those ob-
served in the source-target corpus – in particular,
those source words that appear in the triangulated
phrase table ˆT, but not in T.
In order to generalize, we abstract from words
to vector representations of words. Specifically,
we constrain q to the following parameterization:
q(t | s) =
1
Zs
exp
⇣
vT
s Avt + fT
st h
⌘
Zs =
X
t2T (s)
exp
⇣
vT
s Avt + fT
st h
⌘
.
Here, the vectors vs and vt represent monolingual
features and the vector fst represents bilingual fea-
tures. The parameters A and h are to be learned.
In this work, we use monolingual word embed-
dings for vs and vt, and set the vector fst to con-
tain only the value of the triangulated score, such
1080- a vector of bilingual features (triangulated scores)	
generalize, we abstract from words
resentations of words. Specifically,
q to the following parameterization:
) =
1
Zs
exp
⇣
vT
s Avt + fT
st h
⌘
s =
X
t2T(s)
exp
⇣
vT
s Avt + fT
st h
⌘
.
ors vs and vt represent monolingual
he vector fst represents bilingual fea-
ameters A and h are to be learned.
k, we use monolingual word embed-
nd vt, and set the vector fst to con-
value of the triangulated score, such
- parameters to be learned	
l  For	
  normaliza8on:	
  
that fst := ˆwst. Therefore, the matrix A is a lin-
ear transformation between the source and target
embedding spaces, and h (now a scalar) quantifies
how the triangulated scores ˆw are to be trusted.
In the normalization factor Zs, we let t range
only over possible translations of s suggested by
either wsup or the triangulated word probabilities.
That is:
T (s) = {t | wsup
(t | s) > 0 _ ˆw(t | s) > 0}.
This restriction makes e cient computation pos-
sible, as otherwise the normalization term would
have to be computed over the entire target vocab- Fi
Ø  Under	
  the	
  parameteriza8on,	
  our	
  goal	
  is	
  to	
  solve	
  the	
  following:	
  
st st
ear transformation between the source and target
embedding spaces, and h (now a scalar) quantifies
how the triangulated scores ˆw are to be trusted.
In the normalization factor Zs, we let t range
only over possible translations of s suggested by
either wsup or the triangulated word probabilities.
That is:
T (s) = {t | wsup
(t | s) > 0 _ ˆw(t | s) > 0}.
This restriction makes e cient computation pos-
sible, as otherwise the normalization term would
have to be computed over the entire target vocab-
ulary.
Under this parameterization, our goal is to solve
the following maximization problem:
max
A,h
L(A, h) = max
A,h
X
s,t
c
sup
st log q(t | s). (4)
Figure 1: The (
per iteration. A
nificantly accel
However, we
lation scores q
ing all probabil
therefore interp
3.2	
  Op8miza8on	
  
15/10/15	
 10	
l  The	
  objec8ve	
  func8on	
  in	
  Eq.	
  4	
  is	
  concave	
  in	
  both	
  A	
  and	
  h	
  
Ø We	
  can	
  reach	
  the	
  global	
  solu8on	
  of	
  the	
  problem	
  using	
  gradient	
  
descent	
  
l  Taking	
  deriva8ves,	
  the	
  gradient	
  is	
  
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST	
A,h A,h
s,t
3.2 Optimization
The objective function in Eq. 4 is concave in both
A and h. This is because after taking the log, we
are left with a weighted sum of linear and concave
(negative log-sum-exp) terms in A and h. We can
therefore reach the global solution of the problem
using gradient descent.
Taking derivatives, the gradient is
@L
@A
=
X
s,t
mstvsvT
t
@L
@h
=
X
s,t
mst fst
where the scalar mst = c
sup
st c
sup
s q(t | s) for the
current value of q.
For quick results, we limited the number of gra-
dient steps to 200 and selected the iteration that
minimized the total variation distance to wsup over
a held out dev set:
X
s
||q(· | s) wsup
(· | s)||1. (5)
We obtained better convergence rate by us-
lation scores q to be too sharp
ing all probability mass to a si
therefore interpolated q with t
translation scores ˆw:
q = q + (1
To integrate the lexical wei
(Eq. 2), we simply appended
in the phrase table in addition
cal weights. Following this, w
value that maximizes B on
3.4 Summary of method
In summary, to improve upon
terpolated phrase table, we:
1. Learn word translation dist
vision against distribution
the source-target bilingual
2. Smooth the learned distrib
lating with triangulated wo
ˆw (§3.3).
3. Compute new lexical weig
The objective function in Eq. 4 is concave in both
A and h. This is because after taking the log, we
are left with a weighted sum of linear and concave
(negative log-sum-exp) terms in A and h. We can
therefore reach the global solution of the problem
using gradient descent.
Taking derivatives, the gradient is
@L
@A
=
X
s,t
mstvsvT
t
@L
@h
=
X
s,t
mst fst
where the scalar mst = c
sup
st c
sup
s q(t | s) for the
current value of q.
For quick results, we limited the number of gra-
dient steps to 200 and selected the iteration that
minimized the total variation distance to wsup over
a held out dev set:
X
s
||q(· | s) wsup
(· | s)||1. (5)
To inte
(Eq. 2),
in the p
cal wei
value th
3.4 S
In summ
terpola
1. Lear
visio
the s
2. Smo
latin
ˆw (§
l  For	
  quick	
  results,	
  this	
  research	
  limited	
  the	
  number	
  of	
  gradient	
  
steps	
  to	
  200	
  and	
  selected	
  the	
  itera8on	
  that	
  minimized	
  the	
  total	
  
varia8on	
  distance	
  to	
  wsup	
  over	
  a	
  held	
  out	
  dev	
  set:	
  
are left with a weighted sum of linear and concave
(negative log-sum-exp) terms in A and h. We can
therefore reach the global solution of the problem
using gradient descent.
Taking derivatives, the gradient is
@L
@A
=
X
s,t
mstvsvT
t
@L
@h
=
X
s,t
mst fst
where the scalar mst = c
sup
st c
sup
s q(t | s) for the
current value of q.
For quick results, we limited the number of gra-
dient steps to 200 and selected the iteration that
minimized the total variation distance to wsup over
a held out dev set:
X
s
||q(· | s) wsup
(· | s)||1. (5)
We obtained better convergence rate by us-
q
To integrate the
(Eq. 2), we simply
in the phrase table
cal weights. Follow
value that maximi
3.4 Summary o
In summary, to im
terpolated phrase t
1. Learn word tran
vision against
the source-targ
2. Smooth the lea
lating with trian
ˆw (§3.3).
3. Compute new l
3.2	
  Op8miza8on	
  (cont’d)	
  
15/10/15	
 11	
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST	
A is a lin-
e and target
) quantifies
trusted.
let t range
uggested by
robabilities.
s) > 0}.
utation pos-
term would
rget vocab-
l is to solve
Figure 1: The (target-to-source) objective function
per iteration. Applying batch Adagrad (blue) sig-
nificantly accelerates convergence.
3.3	
  Re-­‐es8ma8ng	
  lexical	
  weights	
  
15/10/15	
 12	
l  Having	
  learned	
  the	
  model	
  (A	
  and	
  h),	
  we	
  can	
  now	
  use	
  q(t	
  |	
  s)	
  to	
  
es8mate	
  the	
  lexical	
  weights	
  (Eq.	
  2)	
  of	
  any	
  aligned	
  phrase	
  
pairs	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ,	
  assuming	
  it	
  is	
  composed	
  of	
  embeddable	
  words	
  
l  However,	
  the	
  authors	
  found	
  the	
  supervised	
  word	
  transla8on	
  
scores	
  q	
  to	
  be	
  too	
  sharp,	
  some8mes	
  assigning	
  all	
  probability	
  
mass	
  to	
  a	
  single	
  target	
  word	
  
Ø  They	
  therefore	
  interpolated	
  q	
  with	
  the	
  triangulated	
  word	
  
transla8on	
  scores:	
  
•  To	
  integrate	
  the	
  lexical	
  weights	
  induced	
  by	
  qβ	
  (Eq.	
  2),	
  they	
  
simply	
  appended	
  them	
  as	
  new	
  features	
  in	
  the	
  phrase	
  table	
  
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST	
ng lexical weights
e model (A and h), we can now
mate the lexical weights (Eq. 2)
rase pairs (s, t, ˆa), assuming it is
eddable words.
4 Experiments
To test our method, w
resource translation e
phrase-based MT system
2007).
1081
Figure 1: The (target-to-source) objective function
per iteration. Applying batch Adagrad (blue) sig-
nificantly accelerates convergence.
However, we found the supervised word trans-
lation scores q to be too sharp, sometimes assign-
ing all probability mass to a single target word. We
therefore interpolated q with the triangulated word
translation scores ˆw:
q = q + (1 ) ˆw. (6)
To integrate the lexical weights induced by q
(Eq. 2), we simply appended them as new features
in the phrase table in addition to the existing lexi-
3.4	
  Summary	
  of	
  method	
  
15/10/15	
 13	
In	
  summary,	
  to	
  improve	
  upon	
  a	
  triangulated	
  or	
  interpolated	
  
phrase	
  table,	
  the	
  authors:	
  
	
  
1.  Learn	
  word	
  transla8on	
  distribu8ons	
  q	
  by	
  supervision	
  against	
  
distribu8ons	
  wsup	
  derived	
  from	
  the	
  source-­‐target	
  bilingual	
  data	
  
(§3.1)	
  
2.  Smooth	
  the	
  learned	
  distribu8ons	
  q	
  by	
  interpola8ng	
  with	
  
triangulated	
  word	
  transla8on	
  scores	
  	
  
3.  Compute	
  new	
  lexical	
  weights	
  and	
  append	
  them	
  to	
  the	
  phrase	
  
table	
  (§3.3)	
  
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST	
X
t
mstvsvT
t
@L
@h
=
X
s,t
mst fst
lar mst = c
sup
st c
sup
s q(t | s) for the
of q.
esults, we limited the number of gra-
200 and selected the iteration that
e total variation distance to wsup over
set:
X
||q(· | s) wsup
(· | s)||1. (5)
better convergence rate by us-
version of the e↵ective and easy-
Adagrad technique (Duchi et al.,
gure 1.
value that maximizes B on a
3.4 Summary of method
In summary, to improve upon a t
terpolated phrase table, we:
1. Learn word translation distrib
vision against distributions w
the source-target bilingual da
2. Smooth the learned distributi
lating with triangulated word
ˆw (§3.3).
3. Compute new lexical weights
to the phrase table (§3.3).
4 Experiments
4.	
  Experiments	
  
15/10/15	
 14	
l  To	
  test	
  the	
  proposed	
  method,	
  the	
  authors	
  conducted	
  two	
  low-­‐
resource	
  transla8on	
  experiments	
  using	
  Moses	
  
Transla8on	
  Tasks:	
  
l  Fixing	
  the	
  pivot	
  language	
  to	
  English,	
  they	
  applied	
  their	
  method	
  
on	
  two	
  data	
  scenarios:	
  
1.  Spanish-­‐to-­‐French:	
  
two	
  related	
  languages	
  used	
  to	
  simulate	
  a	
  low-­‐resource	
  
seeng.	
  The	
  baseline	
  is	
  phrase	
  table	
  interpola8on	
  (Eq.	
  3)	
  
2.  Malagasy-­‐to-­‐French:	
  
two	
  unrelated	
  languages	
  for	
  which	
  they	
  have	
  a	
  small	
  
dic8onary,	
  but	
  no	
  parallel	
  corpus.	
  The	
  baseline	
  is	
  
triangula8on	
  alone.	
  
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST
4.1	
  Data	
  
15/10/15	
 15	
Datasets:	
  
•  European-­‐language	
  bitext	
  were	
  extracted	
  from	
  Europarl	
  
•  For	
  Malagasy-­‐English,	
  Global	
  Voices	
  parallel	
  data	
  available	
  online	
  
•  The	
  Malagasy-­‐French	
  dic8onary	
  from	
  online	
  resources	
  and	
  small	
  
Malagasy-­‐French	
  tune/tests	
  from	
  Global	
  Voices	
  
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST	
parallel corpus (aside from tuning and testing
data). The baseline is triangulation alone (there
is no source-target model to interpolate with).
Table 1 lists some statistics of the bilin-
gual data we used. European-language bitexts
were extracted from Europarl (Koehn, 2005). For
Malagasy-English, we used the Global Voices par-
allel data available online.1 The Malagasy-French
dictionary was extracted from online resources2
and the small Malagasy-French tune/test sets were
extracted3 from Global Voices.
lines of data
language pair train tune test
sp-fr 4k 1.5k 1.5k
mg-fr 1.1k 1.2k 1.2k
sp-en 50k – –
mg-en 100k – –
en-fr 50k – –
Table 1: Bilingual datasets. Legend: sp=Spanish,
fr=French, en=English, mg=Malagasy.
Table 2 lists token statistics of the monolin-
gual data used. We used word2vec4 to generate
To produce w , we aligned the small Spani
French parallel corpus in both directions, a
symmetrized using the intersection heuristic. T
was done to obtain high precision alignments (
often-used grow-diag-final-and heuristic is op
mized for phrase extraction, not precision).
We used the skip-gram model to estimate
Spanish and French word embeddings and set
dimension to d = 200 and context window
w = 5 (default). Subsequently, to run our metho
we filtered out source and target words that eith
did not appear in the triangulation, or, did not ha
an embedding. We took words that appeared mo
than 10 times in the parallel corpus for the traini
set (⇠690 words), and between 5–9 times for
held out dev set (⇠530 words). This was done
both source-target and target-source directions.
In Table 3 we show that the distributions learn
by our method are much better approximations
wsup compared to those obtained by triangulatio
Method source!target target!sourc
triangulation 71.6% 72.0%
our scores 30.2% 33.8%
Table 3: Average total variation distance (Eq.
4.1 Data
Fixing the pivot language to English, we applied
our method on two data scenarios:
1. Spanish-to-French: two related languages
used to simulate a low-resource setting. The
baseline is phrase table interpolation (Eq. 3).
2. Malagasy-to-French: two unrelated languages
for which we have a small dictionary, but no
language words
French 1.5G
Spanish 1.4G
Malagasy 58M
Table 2: Size of monolingual corpus per language
as measured in number of tokens.
4.2 Spanish-French Results
4.2	
  Spanish-­‐French	
  Results	
  
15/10/15	
 16	
l  To	
  produce	
  wsup,	
  the	
  authors	
  aligned	
  the	
  small	
  Spanish-­‐French	
  parallel	
  
corpus	
  in	
  both	
  direc8ons,	
  and	
  symmetrized	
  using	
  the	
  intersec8on	
  
heuris8c	
  to	
  obtain	
  high	
  precision	
  (not	
  grow-­‐diag-­‐final-­‐and)	
  
l  To	
  train	
  skip-­‐gram	
  model,	
  dimension	
  d	
  =	
  200	
  and	
  context	
  window	
  w	
  =	
  5	
  
l  They	
  took	
  words	
  that	
  appeared	
  more	
  than	
  10	
  8mes	
  in	
  the	
  parallel	
  
corpus	
  for	
  the	
  training	
  set	
  (〜690	
  words),	
  and	
  5-­‐9	
  8mes	
  for	
  the	
  held	
  
out	
  dev	
  set	
  (〜530	
  words)	
  
l  They	
  fixed	
  β	
  :=	
  0.95	
  to	
  examine	
  the	
  effect	
  of	
  their	
  supervised	
  method	
  
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST	
with).
bilin-
bitexts
05). For
ces par-
-French
ources2
ets were
panish,
onolin-
enerate
ddings.
ere (in-
ned to-
Leipzig
gs were
Voices,7
symmetrized using the intersection heuristic. This
was done to obtain high precision alignments (the
often-used grow-diag-final-and heuristic is opti-
mized for phrase extraction, not precision).
We used the skip-gram model to estimate the
Spanish and French word embeddings and set the
dimension to d = 200 and context window to
w = 5 (default). Subsequently, to run our method,
we filtered out source and target words that either
did not appear in the triangulation, or, did not have
an embedding. We took words that appeared more
than 10 times in the parallel corpus for the training
set (⇠690 words), and between 5–9 times for the
held out dev set (⇠530 words). This was done in
both source-target and target-source directions.
In Table 3 we show that the distributions learned
by our method are much better approximations of
wsup compared to those obtained by triangulation.
Method source!target target!source
triangulation 71.6% 72.0%
our scores 30.2% 33.8%
Table 3: Average total variation distance (Eq. 5)
to the dev set portion of wsup (computed only over
words whose translations in wsup appear in the tri-
angulation). Using word embeddings, our method
is able to better generalize on the dev set.
We then examined the e↵ect of appending our
supervised lexical weights. We fixed the word
Method ↵ tune test
source-target – 26.8 25.3
triangulation – 29.2 28.4
interpolation 0.7 30.2 29.2
interpolation+our scores 0.6 30.8 29.9
Table 4: Spanish-French B scores. Append-
ing lexical weights obtained by supervision over
a small source-target corpus significantly out-
performs phrase table interpolation (Eq. 3) by
+0.7 B .
4.3 Malagasy-French Results
We
cal w
gulate
strate
This i
we fit
or eve
Ackn
The a
Kevin
portiv
anony
4.3	
  Malagasy-­‐French	
  Results	
  
15/10/15	
 17	
l  The	
  wsup	
  distribu8ons	
  used	
  for	
  supervision	
  were	
  taken	
  to	
  be	
  
uniform	
  distribu8ons	
  over	
  the	
  dic8onary	
  transla8ons	
  
•  For	
  each	
  training	
  direc8on,	
  they	
  used	
  70%/30%	
  split	
  of	
  the	
  
dic8onary	
  to	
  form	
  the	
  train	
  and	
  dev	
  sets	
  
l  To	
  train	
  skip-­‐gram	
  model,	
  d	
  =	
  100,	
  w	
  =	
  3	
  
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST	
Having significantly less Malagasy monolin-
gual data, we used d = 100 dimensional embed-
dings and a w = 3 context window to estimate both
Malagasy and French words.
As before, we added our supervised lexical
weights as new features in the phrase table. How-
ever, instead of fixing = 0.95 as above, we
searched for 2 {0.9, 0.8, 0.7, 0.6} in Eq. 6 to max-
imize B on a small tune set. We report our re-
sults in Table 5. Using only a dictionary, we are
able to improve over triangulation by +0.5 B , a
statistically significant di↵erence (p < 0.01).
Method tune test
triangulation – 12.2 11.1
triangulation+our scores 0.6 12.4 11.6
Table 5: Malagasy-French B . Supervision with
a dictionary significantly improves upon simple
triangulation by +0.5 B .
5 Conclusion
Trevor Cohn and Mir
translation by triang
multi-parallel corpo
735.
John Duchi, Elad Haz
Adaptive subgradie
and stochastic optim
Research, 12:2121–
Philipp Koehn, Franz
2003. Statistical ph
NAACL HLT, pages
Philipp Koehn, Hieu
Callison-Burch, Ma
Brooke Cowan, W
Richard Zens, Chri
dra Constantin, and
Open source toolki
tion. In Proc. ACL,
stration Sessions, pa
Philipp Koehn. 2004.
machine translation
pages 388–395.
Philipp Koehn. 2005.
statistical machine t
pages 79–86.
5.	
  Conclusion	
  
15/10/15	
 18	
In	
  this	
  paper:	
  
l  The	
  authors	
  argued	
  that	
  construc8ng	
  a	
  triangulated	
  phrase	
  table	
  
independently	
  from	
  even	
  very	
  limited	
  source-­‐target	
  data	
  underu;lizes	
  
that	
  parallel	
  data	
  
Ø  They	
  designed	
  a	
  supervised	
  learning	
  algorithm	
  that	
  relies	
  on	
  word	
  
transla8ons	
  distribu8ons	
  derived	
  from	
  the	
  parallel	
  data	
  as	
  well	
  as	
  a	
  
distributed	
  representa;on	
  of	
  words	
  (embeddings)	
  
Ø  The	
  laker	
  enables	
  their	
  algorithm	
  to	
  assign	
  transla8on	
  probabili8es	
  to	
  
word	
  pairs	
  that	
  do	
  not	
  appear	
  in	
  the	
  source-­‐target	
  bilingual	
  data	
  
l  Model	
  with	
  the	
  new	
  lexical	
  weights	
  genera8on	
  demonstrates	
  
improvements	
  in	
  MT	
  quality	
  on	
  two	
  tasks	
  despite	
  the	
  fact	
  that	
  wsup	
  
were	
  es8mated	
  automa8cally	
  or	
  even	
  naïvely	
  as	
  uniform	
  distribu;ons	
  
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST
6.	
  Impression	
  
15/10/15	
 19	
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST
End	
  Slide	
  
15/10/15	
 20	
2015©Akiva	
  Miura	
  	
  	
  AHC-­‐Lab,	
  IS,	
  NAIST

More Related Content

Viewers also liked

Diptico xvii jornada
Diptico xvii jornadaDiptico xvii jornada
Diptico xvii jornadamagllerandi
 
Tatiana giraldo giraldo
Tatiana giraldo  giraldoTatiana giraldo  giraldo
Tatiana giraldo giraldoguest664d43
 
Odoo eBay Connector
Odoo eBay ConnectorOdoo eBay Connector
Odoo eBay ConnectorHiren Vora
 
Dados Fanpage Juscelino Acústico Bar NOV/2011-FEV/2012
Dados Fanpage Juscelino Acústico Bar NOV/2011-FEV/2012Dados Fanpage Juscelino Acústico Bar NOV/2011-FEV/2012
Dados Fanpage Juscelino Acústico Bar NOV/2011-FEV/2012Soluções Marketing Digital
 
Asoso aplicação
Asoso   aplicaçãoAsoso   aplicação
Asoso aplicaçãogrupouro
 
Smair zaid 1601_pcp_personal pitches
Smair zaid 1601_pcp_personal pitchesSmair zaid 1601_pcp_personal pitches
Smair zaid 1601_pcp_personal pitchesZaid Smair
 
Resenha 1 - Estudo dos Processos de Comunicação Científica e Tecnológica
Resenha 1 - Estudo dos Processos de Comunicação Científica e TecnológicaResenha 1 - Estudo dos Processos de Comunicação Científica e Tecnológica
Resenha 1 - Estudo dos Processos de Comunicação Científica e TecnológicaJuliana Gulka
 
Du cours papier au cours online
Du cours papier au cours onlineDu cours papier au cours online
Du cours papier au cours onlineIsabelle Dremeau
 

Viewers also liked (20)

Esquema tema 1 lengua
Esquema tema 1 lenguaEsquema tema 1 lengua
Esquema tema 1 lengua
 
1
11
1
 
Presentación1
Presentación1Presentación1
Presentación1
 
La tecnoa..[1]mx
La tecnoa..[1]mxLa tecnoa..[1]mx
La tecnoa..[1]mx
 
Podcast
PodcastPodcast
Podcast
 
Fair Taxation
Fair TaxationFair Taxation
Fair Taxation
 
14ENASB
14ENASB14ENASB
14ENASB
 
Treball de la meua vida
Treball de la meua vidaTreball de la meua vida
Treball de la meua vida
 
Diptico xvii jornada
Diptico xvii jornadaDiptico xvii jornada
Diptico xvii jornada
 
Tatiana giraldo giraldo
Tatiana giraldo  giraldoTatiana giraldo  giraldo
Tatiana giraldo giraldo
 
Test results 7d may.pdf
Test results 7d may.pdfTest results 7d may.pdf
Test results 7d may.pdf
 
Odoo eBay Connector
Odoo eBay ConnectorOdoo eBay Connector
Odoo eBay Connector
 
Dados Fanpage Juscelino Acústico Bar NOV/2011-FEV/2012
Dados Fanpage Juscelino Acústico Bar NOV/2011-FEV/2012Dados Fanpage Juscelino Acústico Bar NOV/2011-FEV/2012
Dados Fanpage Juscelino Acústico Bar NOV/2011-FEV/2012
 
Asoso aplicação
Asoso   aplicaçãoAsoso   aplicação
Asoso aplicação
 
page2
page2page2
page2
 
cv...
cv...cv...
cv...
 
Smair zaid 1601_pcp_personal pitches
Smair zaid 1601_pcp_personal pitchesSmair zaid 1601_pcp_personal pitches
Smair zaid 1601_pcp_personal pitches
 
Resenha 1 - Estudo dos Processos de Comunicação Científica e Tecnológica
Resenha 1 - Estudo dos Processos de Comunicação Científica e TecnológicaResenha 1 - Estudo dos Processos de Comunicação Científica e Tecnológica
Resenha 1 - Estudo dos Processos de Comunicação Científica e Tecnológica
 
Du cours papier au cours online
Du cours papier au cours onlineDu cours papier au cours online
Du cours papier au cours online
 
Formation pp2
Formation pp2Formation pp2
Formation pp2
 

Similar to [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

Statistical machine translation
Statistical machine translationStatistical machine translation
Statistical machine translationHrishikesh Nair
 
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATIONTSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATIONLifeng (Aaron) Han
 
Meta back translation
Meta back translationMeta back translation
Meta back translationHyunKyu Jeon
 
A Statistical Model for Morphology Inspired by the Amis Language
A Statistical Model for Morphology Inspired by the Amis LanguageA Statistical Model for Morphology Inspired by the Amis Language
A Statistical Model for Morphology Inspired by the Amis Languagedannyijwest
 
A statistical model for morphology inspired by the Amis language
A statistical model for morphology inspired by the Amis languageA statistical model for morphology inspired by the Amis language
A statistical model for morphology inspired by the Amis languageIJwest
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Edmond Lepedus
 
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...ijnlc
 
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHOD
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHODA SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHOD
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHODIJwest
 
A Self-Supervised Tibetan-Chinese Vocabulary Alignment Method
A Self-Supervised Tibetan-Chinese Vocabulary Alignment MethodA Self-Supervised Tibetan-Chinese Vocabulary Alignment Method
A Self-Supervised Tibetan-Chinese Vocabulary Alignment Methoddannyijwest
 
LREC'2008 translation universals
LREC'2008 translation universalsLREC'2008 translation universals
LREC'2008 translation universalsNaveed Afzal
 
A Systematic Approach To Probabilistic Pointer Analysis
A Systematic Approach To Probabilistic Pointer AnalysisA Systematic Approach To Probabilistic Pointer Analysis
A Systematic Approach To Probabilistic Pointer AnalysisMonica Franklin
 
Using Parallel Propbanks to Enhance Word-alignments
Using Parallel Propbanks to Enhance Word-alignmentsUsing Parallel Propbanks to Enhance Word-alignments
Using Parallel Propbanks to Enhance Word-alignmentsJinho Choi
 
Conceptual similarity: why, where and how
Conceptual similarity: why, where and howConceptual similarity: why, where and how
Conceptual similarity: why, where and howGiannis Tsakonas
 
Speech To Sign Language Interpreter System
Speech To Sign Language Interpreter SystemSpeech To Sign Language Interpreter System
Speech To Sign Language Interpreter Systemkkkseld
 
05 8640 (update email) multiset cs closure propertie (edit lafi)2
05 8640 (update email) multiset cs closure propertie (edit lafi)205 8640 (update email) multiset cs closure propertie (edit lafi)2
05 8640 (update email) multiset cs closure propertie (edit lafi)2IAESIJEECS
 

Similar to [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages (20)

Statistical machine translation
Statistical machine translationStatistical machine translation
Statistical machine translation
 
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATIONTSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION
 
Meta back translation
Meta back translationMeta back translation
Meta back translation
 
A Statistical Model for Morphology Inspired by the Amis Language
A Statistical Model for Morphology Inspired by the Amis LanguageA Statistical Model for Morphology Inspired by the Amis Language
A Statistical Model for Morphology Inspired by the Amis Language
 
A statistical model for morphology inspired by the Amis language
A statistical model for morphology inspired by the Amis languageA statistical model for morphology inspired by the Amis language
A statistical model for morphology inspired by the Amis language
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
 
Trie Data Structure
Trie Data Structure Trie Data Structure
Trie Data Structure
 
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
 
NLP_KASHK:N-Grams
NLP_KASHK:N-GramsNLP_KASHK:N-Grams
NLP_KASHK:N-Grams
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
 
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHOD
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHODA SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHOD
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHOD
 
A Self-Supervised Tibetan-Chinese Vocabulary Alignment Method
A Self-Supervised Tibetan-Chinese Vocabulary Alignment MethodA Self-Supervised Tibetan-Chinese Vocabulary Alignment Method
A Self-Supervised Tibetan-Chinese Vocabulary Alignment Method
 
LREC'2008 translation universals
LREC'2008 translation universalsLREC'2008 translation universals
LREC'2008 translation universals
 
P99 1067
P99 1067P99 1067
P99 1067
 
A Systematic Approach To Probabilistic Pointer Analysis
A Systematic Approach To Probabilistic Pointer AnalysisA Systematic Approach To Probabilistic Pointer Analysis
A Systematic Approach To Probabilistic Pointer Analysis
 
Using Parallel Propbanks to Enhance Word-alignments
Using Parallel Propbanks to Enhance Word-alignmentsUsing Parallel Propbanks to Enhance Word-alignments
Using Parallel Propbanks to Enhance Word-alignments
 
Lec18
Lec18Lec18
Lec18
 
Conceptual similarity: why, where and how
Conceptual similarity: why, where and howConceptual similarity: why, where and how
Conceptual similarity: why, where and how
 
Speech To Sign Language Interpreter System
Speech To Sign Language Interpreter SystemSpeech To Sign Language Interpreter System
Speech To Sign Language Interpreter System
 
05 8640 (update email) multiset cs closure propertie (edit lafi)2
05 8640 (update email) multiset cs closure propertie (edit lafi)205 8640 (update email) multiset cs closure propertie (edit lafi)2
05 8640 (update email) multiset cs closure propertie (edit lafi)2
 

More from NAIST Machine Translation Study Group

[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...NAIST Machine Translation Study Group
 
[Paper Introduction] Distant supervision for relation extraction without labe...
[Paper Introduction] Distant supervision for relation extraction without labe...[Paper Introduction] Distant supervision for relation extraction without labe...
[Paper Introduction] Distant supervision for relation extraction without labe...NAIST Machine Translation Study Group
 
[Paper Introduction] Efficient top down btg parsing for machine translation p...
[Paper Introduction] Efficient top down btg parsing for machine translation p...[Paper Introduction] Efficient top down btg parsing for machine translation p...
[Paper Introduction] Efficient top down btg parsing for machine translation p...NAIST Machine Translation Study Group
 
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...NAIST Machine Translation Study Group
 
[Paper Introduction] Evaluating MT Systems with Second Language Proficiency T...
[Paper Introduction] Evaluating MT Systems with Second Language Proficiency T...[Paper Introduction] Evaluating MT Systems with Second Language Proficiency T...
[Paper Introduction] Evaluating MT Systems with Second Language Proficiency T...NAIST Machine Translation Study Group
 
[Paper Introduction] Bilingual word representations with monolingual quality ...
[Paper Introduction] Bilingual word representations with monolingual quality ...[Paper Introduction] Bilingual word representations with monolingual quality ...
[Paper Introduction] Bilingual word representations with monolingual quality ...NAIST Machine Translation Study Group
 
[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...
[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...
[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...NAIST Machine Translation Study Group
 
[Paper Introduction] Training a Natural Language Generator From Unaligned Data
[Paper Introduction] Training a Natural Language Generator From Unaligned Data[Paper Introduction] Training a Natural Language Generator From Unaligned Data
[Paper Introduction] Training a Natural Language Generator From Unaligned DataNAIST Machine Translation Study Group
 

More from NAIST Machine Translation Study Group (14)

[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
[Paper Introduction] Efficient Lattice Rescoring Using Recurrent Neural Netwo...
 
[Paper Introduction] Distant supervision for relation extraction without labe...
[Paper Introduction] Distant supervision for relation extraction without labe...[Paper Introduction] Distant supervision for relation extraction without labe...
[Paper Introduction] Distant supervision for relation extraction without labe...
 
On using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translationOn using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translation
 
RNN-based Translation Models (Japanese)
RNN-based Translation Models (Japanese)RNN-based Translation Models (Japanese)
RNN-based Translation Models (Japanese)
 
[Paper Introduction] Efficient top down btg parsing for machine translation p...
[Paper Introduction] Efficient top down btg parsing for machine translation p...[Paper Introduction] Efficient top down btg parsing for machine translation p...
[Paper Introduction] Efficient top down btg parsing for machine translation p...
 
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
 
[Paper Introduction] Evaluating MT Systems with Second Language Proficiency T...
[Paper Introduction] Evaluating MT Systems with Second Language Proficiency T...[Paper Introduction] Evaluating MT Systems with Second Language Proficiency T...
[Paper Introduction] Evaluating MT Systems with Second Language Proficiency T...
 
[Paper Introduction] Bilingual word representations with monolingual quality ...
[Paper Introduction] Bilingual word representations with monolingual quality ...[Paper Introduction] Bilingual word representations with monolingual quality ...
[Paper Introduction] Bilingual word representations with monolingual quality ...
 
[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...
[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...
[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...
 
[Book Reading] 機械翻訳 - Section 3 No.1
[Book Reading] 機械翻訳 - Section 3 No.1[Book Reading] 機械翻訳 - Section 3 No.1
[Book Reading] 機械翻訳 - Section 3 No.1
 
[Paper Introduction] Training a Natural Language Generator From Unaligned Data
[Paper Introduction] Training a Natural Language Generator From Unaligned Data[Paper Introduction] Training a Natural Language Generator From Unaligned Data
[Paper Introduction] Training a Natural Language Generator From Unaligned Data
 
[Book Reading] 機械翻訳 - Section 5 No.2
[Book Reading] 機械翻訳 - Section 5 No.2[Book Reading] 機械翻訳 - Section 5 No.2
[Book Reading] 機械翻訳 - Section 5 No.2
 
[Book Reading] 機械翻訳 - Section 7 No.1
[Book Reading] 機械翻訳 - Section 7 No.1[Book Reading] 機械翻訳 - Section 7 No.1
[Book Reading] 機械翻訳 - Section 7 No.1
 
[Book Reading] 機械翻訳 - Section 2 No.2
 [Book Reading] 機械翻訳 - Section 2 No.2 [Book Reading] 機械翻訳 - Section 2 No.2
[Book Reading] 機械翻訳 - Section 2 No.2
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

  • 1. MT  Study  Group     Supervised  Phrase  Table  Triangula8on   with  Neural  Word  Embeddings   for  Low-­‐Resource  Languages     Tomer  Levinboim    and    David  Chiang     Proc.  of  EMNLP  2015,  Lisbon,  Portugal   Introduced  by  Akiva  Miura,  AHC-­‐Lab   15/10/15 2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST 1
  • 2. Contents   15/10/15 2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST 2 1.  Introduc8on   2.  Preliminaries   3.  Supervised  Word  Transla8ons   4.  Experiments   5.  Conclusion   6.  Impression
  • 3. 1.  Introduc8on   15/10/15 3 Problem:  Scarceness  of  Bilingual  Data       l  PBMT  systems  require  considerable  amounts  of  source-­‐target   parallel  data  to  produce  good  quality  transla8on   Ø A  triangulated  source-­‐target  phrase  table  can  be  composed  from   a  source-­‐pivot  and  pivot-­‐target  phrase  table,  but  s8ll  noisy   l  This  paper  shows  a  supervised  learning  technique  that  improves   noisy  phrase  transla;on  scores  by  extrac8on  of  word  transla8on   distribu8ons  from  small  amounts  of  bilingual  data   Ø This  method  gained  improvement  on  Malagasy-­‐to-­‐French  and   Spanish-­‐to-­‐French  transla8on  tasks  via  English   2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST
  • 4. 2.  Preliminaries   15/10/15 4 Denota8on:   2 Preliminaries Let s, p, t denote words and s, p, t denote phrases in the source, pivot, and target languages, respec- tively. Also, let T denote a phrase table estimated over a parallel corpus and ˆT denote a triangu- lated phrase table. We use similar notation for their respective phrase translation features , lexical- weighting features lex, and the word translation probabilities w. 2.1 Triangulation (weak baseline) In phrase table triangulation, a source-target 3 S Whil of th ited t bles. learn Ou butio (thro as th them shou - words in source, pivot, and target languages respectively s and s, p, t denote phrases d target languages, respec- te a phrase table estimated and ˆT denote a triangu- se similar notation for their lation features , lexical- and the word translation eak baseline) gulation, a source-target 3 Supervised Word Transla While interpolation (Eq. 3) may h of the noisy triangulated scores, ited to phrase pairs appearing in bles. Here, we suggest a discrimin learning method that can a↵ect al Our idea is to regard word tr butions derived from source-targ (through word alignments or dic as the correct translation distrib them to learn discriminately: corr - phrases in … minaries denote words and s, p, t denote phrases ce, pivot, and target languages, respec- o, let T denote a phrase table estimated allel corpus and ˆT denote a triangu- e table. We use similar notation for their phrase translation features , lexical- features lex, and the word translation es w. ngulation (weak baseline) 3 Supervised W While interpolatio of the noisy triang ited to phrase pair bles. Here, we sug learning method th Our idea is to butions derived fro (through word ali as the correct tra them to learn discr - a phrase table estimated over a parallel corpus s and s, p, t denote phrases d target languages, respec- ote a phrase table estimated and ˆT denote a triangu- use similar notation for their slation features , lexical- , and the word translation weak baseline) ngulation, a source-target 3 Supervised Word Translations While interpolation (Eq. 3) may help corr of the noisy triangulated scores, its e↵ec ited to phrase pairs appearing in both p bles. Here, we suggest a discriminative su learning method that can a↵ect all phrase Our idea is to regard word translatio butions derived from source-target biling (through word alignments or dictionary as the correct translation distributions, them to learn discriminately: correct targ should become likely translations, and - a triangulated phrase table denote phrases guages, respec- table estimated note a triangu- otation for their ures , lexical- ord translation ne) 3 Supervised Word Transla While interpolation (Eq. 3) may h of the noisy triangulated scores, ited to phrase pairs appearing in bles. Here, we suggest a discrimin learning method that can a↵ect all Our idea is to regard word tr butions derived from source-targe (through word alignments or dic as the correct translation distrib them to learn discriminately: corr - phrase translation features ries te words and s, p, t denote phrases ivot, and target languages, respec- T denote a phrase table estimated corpus and ˆT denote a triangu- le. We use similar notation for their se translation features , lexical- ures lex, and the word translation ation (weak baseline) 3 Supervised Wo While interpolation ( of the noisy triangul ited to phrase pairs bles. Here, we sugge learning method that Our idea is to reg butions derived from (through word align as the correct trans them to learn discrim - lexical-weighting features liminaries t denote words and s, p, t denote phrases ource, pivot, and target languages, respec- lso, let T denote a phrase table estimated parallel corpus and ˆT denote a triangu- rase table. We use similar notation for their ve phrase translation features , lexical- ng features lex, and the word translation ities w. iangulation (weak baseline) 3 Super While inte of the nois ited to phr bles. Here, learning m Our ide butions de (through w as the cor them to le - word translation probabilities 2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST
  • 5. 2.1  Triangula8on  (weak  baseline)   15/10/15 5 l  A  source-­‐target  phrase  table  Tst  is  constructed  by  combining  a   source-­‐pivot  and  pivot-­‐target  phrase  table  Tsp,  Tpt   l  Combining  alignment:   ngulation (weak baseline) table triangulation, a source-target ble Tst is constructed by combining a ot and pivot-target phrase table Tsp, Tpt, mated on its respective parallel data. For ting phrase pair (s, t), we can also com- ignment ˆa as the most frequent align- ained by combining source-pivot and et alignments asp and apt across all pivot as follows: {(s, t) | 9p : (s, p) 2 asp ^ }. riangulated source-to-target lexical denoted clexst, are approximated in two st, word translation scores ˆwst are ap- d by marginalizing over the pivot words: X them to learn discrim should become likely ones should be down- yond the vocabulary appeal to word embed We present our fo target direction. The obtained simply by s get languages. 3.1 Model Let c sup st denote the n s was aligned to targe or in the dictionary). tion distributions wsu sup P sup hrase table triangulation, a source-target e table Tst is constructed by combining a e-pivot and pivot-target phrase table Tsp, Tpt, estimated on its respective parallel data. For resulting phrase pair (s, t), we can also com- an alignment ˆa as the most frequent align- obtained by combining source-pivot and target alignments asp and apt across all pivot es p as follows: {(s, t) | 9p : (s, p) 2 asp ^ 2 apt}. e triangulated source-to-target lexical hts, denoted clexst, are approximated in two First, word translation scores ˆwst are ap- mated by marginalizing over the pivot words: ˆwst(t | s) = X p wsp(p | s) · wpt(t | p). (1) given a (triangulated) phrase pair (s, t) with ment ˆa, let ˆas,: = {t | (s, t) 2 ˆa}; the lexical- hting probability is (Koehn et al., 2003): should become likely translation ones should be down-weighted. T yond the vocabulary of the sourc appeal to word embeddings. We present our formulation target direction. The target-to-so obtained simply by swapping th get languages. 3.1 Model Let c sup st denote the number of ti s was aligned to target word t (in or in the dictionary). We define tion distributions wsup(t | s) = c sup s = P t c sup st . Furthermore, let q word translation probabilities we consider maximizing the log-like arg max q L(q) = arg max q X (s,t) c source-pivot and pivot-target phrase each estimated on its respective pa each resulting phrase pair (s, t), we pute an alignment ˆa as the most f ment obtained by combining sou pivot-target alignments asp and apt phrases p as follows: {(s, t) | 9p : (p, t) 2 apt}. The triangulated source-to-t weights, denoted clexst, are approx steps: First, word translation scor proximated by marginalizing over th ˆwst(t | s) = X p wsp(p | s) · w Next, given a (triangulated) phrase alignment ˆa, let ˆas,: = {t | (s, t) 2 weighting probability is (Koehn et2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST =   l  Lexical  weigh8ng  probability  es8ma8on:   ment obtained by combining source-pivot and pivot-target alignments asp and apt across all pivot phrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^ (p, t) 2 apt}. The triangulated source-to-target lexical weights, denoted clexst, are approximated in two steps: First, word translation scores ˆwst are ap- proximated by marginalizing over the pivot words: ˆwst(t | s) = X p wsp(p | s) · wpt(t | p). (1) Next, given a (triangulated) phrase pair (s, t) with alignment ˆa, let ˆas,: = {t | (s, t) 2 ˆa}; the lexical- weighting probability is (Koehn et al., 2003): clexst(t | s, ˆa) = Y s2s 1 |ˆas,:| X t2ˆas,: ˆwst(t | s). (2) obtained sim get language 3.1 Mode Let c sup st den s was aligne or in the dic tion distribu c sup s = P t c s s word transla consider ma arg max q Clearly, the mizes L. Ho generalizes The triangulated source-to-target lexical weights, denoted clexst, are approximated in two steps: First, word translation scores ˆwst are ap- proximated by marginalizing over the pivot words: ˆwst(t | s) = X p wsp(p | s) · wpt(t | p). (1) Next, given a (triangulated) phrase pair (s, t) with alignment ˆa, let ˆas,: = {t | (s, t) 2 ˆa}; the lexical- weighting probability is (Koehn et al., 2003): clexst(t | s, ˆa) = Y s2s 1 |ˆas,:| X t2ˆas,: ˆwst(t | s). (2) The triangulated phrase translation scores, de- noted ˆst, are computed by analogy with Eq. 1. We also compute these scores in the reverse direction by swapping the source and target lan- Let cst den s was aligne or in the dic tion distribu c sup s = P t c s s word transla consider ma arg max q Clearly, the mizes L. Ho generalizes served in th those source phrase table In order t to vector re l  The  triangulated  phrase  transla8on  scores  are  computed  by  analogy   with  Eq.  1   l  Compu8ng  these  scores  in  the  reverse  direc8on  by  swapping  the   source  and  target  languages  
  • 6. 2.2  Interpola8on  (strong  baseline)   15/10/15 6 l  Given  access  to  source-­‐target  data,  an  ordinary  source-­‐target  phrase   table  Tst  can  be  es8mated  directly   2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST l  Interpola8on  of  phrase  pairs  entries  that  occur  in  both  tables:   2.2 Interpolation (strong baseline) Given access to source-target data, an ordinary source-target phrase table Tst can be estimated di- rectly. Wu and Wang (2007) suggest interpolating phrase pairs entries that occur in both tables: Tinterp = ↵Tst + (1 ↵) ˆTst. (3) Phrase pairs appearing in only one phrase table are added as-is. We refer to the resulting table as the interpolated phrase table. we constrai q(t Here, the v features an tures. The p In this w dings for v tain only th 1080 Phrase  pairs  appearing  in  only  one  phrase  table  are  added  as-­‐is  
  • 7. 3.  Supervised  Word  Transla8on   15/10/15 7 l  The  effect  of  interpola8on  (Eq.  3)  is  limited  to  phrase  pairs  appearing   in  both  phrase  tables.   l  The  idea  of  this  paper  is  to  regard  word  transla8on  distribu8ons   derived  from  source-­‐target  bilingual  data  (through  word  alignments   or  dic8onary  entries)  as  the  correct  transla8on,  and  use  them  to   learn  discriminately   •  correct  target  words  should  become  likely  transla;ons   •  incorrect  ones  should  be  down-­‐weighted   Ø  To  generalize  beyond  the  vocabulary  of  the  source-­‐target  data,  the   authors  appeal  to  word  embeddings   2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST
  • 8. 3.1  Model   15/10/15 8 Defining:   2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST obtained simply by swapping the source and tar- get languages. 3.1 Model Let c sup st denote the number of times source word s was aligned to target word t (in word alignment, or in the dictionary). We define the word transla- tion distributions wsup(t | s) = c sup st /c sup s , where c sup s = P t c sup st . Furthermore, let q(t | s) denote the word translation probabilities we wish to learn and consider maximizing the log-likelihood function: arg max q L(q) = arg max q X (s,t) c sup st log q(t | s). Clearly, the solution q(· | s) := wsup(· | s) maxi- mizes L. However, we would like a solution that - the number of times source word s was aligned to target word t (in word alignment, or in the dictionary) es. ote the number of times source word d to target word t (in word alignment, tionary). We define the word transla- tions wsup(t | s) = c sup st /c sup s , where up . Furthermore, let q(t | s) denote the tion probabilities we wish to learn and ximizing the log-likelihood function: L(q) = arg max q X (s,t) c sup st log q(t | s). solution q(· | s) := wsup(· | s) maxi- wever, we would like a solution that to source words s beyond those ob- - the word translation distribution where ource-pivot and across all pivot : (s, p) 2 asp ^ -target lexical ximated in two ores ˆwst are ap- the pivot words: wpt(t | p). (1) e pair (s, t) with ˆa}; the lexical- al., 2003): ˆwst(t | s). (2) obtained simply by swapping the source and tar- get languages. 3.1 Model Let c sup st denote the number of times source word s was aligned to target word t (in word alignment, or in the dictionary). We define the word transla- tion distributions wsup(t | s) = c sup st /c sup s , where c sup s = P t c sup st . Furthermore, let q(t | s) denote the word translation probabilities we wish to learn and consider maximizing the log-likelihood function: arg max q L(q) = arg max q X (s,t) c sup st log q(t | s). Clearly, the solution q(· | s) := wsup(· | s) maxi- mizes L. However, we would like a solution that generalizes to source words s beyond those ob- served in the source-target corpus – in particular, lation in the source-to- et-to-source direction is ping the source and tar- er of times source word ord t (in word alignment, define the word transla- | s) = c sup st /c sup s , where re, let q(t | s) denote the ties we wish to learn and log-likelihood function: ax X (s,t) c sup st log q(t | s). s) := wsup(· | s) maxi- ould like a solution that rds s beyond those ob- - the word translation probabilities we wish to learn l  We  consider  maximizing  the  log-­‐likelihood  func8on:   ent align- pivot and s all pivot ) 2 asp ^ lexical d in two t are ap- ot words: p). (1) (s, t) with e lexical- 003): s). (2) target direction. The target-to-source direction is obtained simply by swapping the source and tar- get languages. 3.1 Model Let c sup st denote the number of times source word s was aligned to target word t (in word alignment, or in the dictionary). We define the word transla- tion distributions wsup(t | s) = c sup st /c sup s , where c sup s = P t c sup st . Furthermore, let q(t | s) denote the word translation probabilities we wish to learn and consider maximizing the log-likelihood function: arg max q L(q) = arg max q X (s,t) c sup st log q(t | s). Clearly, the solution q(· | s) := wsup(· | s) maxi- mizes L. However, we would like a solution that generalizes to source words s beyond those ob- Cleary,  the  solu8on   align- and pivot asp ^ exical n two e ap- ords: (1) with xical- : (2) target direction. The target-to-source direction is obtained simply by swapping the source and tar- get languages. 3.1 Model Let c sup st denote the number of times source word s was aligned to target word t (in word alignment, or in the dictionary). We define the word transla- tion distributions wsup(t | s) = c sup st /c sup s , where c sup s = P t c sup st . Furthermore, let q(t | s) denote the word translation probabilities we wish to learn and consider maximizing the log-likelihood function: arg max q L(q) = arg max q X (s,t) c sup st log q(t | s). Clearly, the solution q(· | s) := wsup(· | s) maxi- mizes L. However, we would like a solution that generalizes to source words s beyond those ob- served in the source-target corpus – in particular, maximizes  L   Ø  However,  we  would  like  a  solu8on  that  generalizes  to  source  words  s   beyond  those  observed  in  the  source-­‐target  corpus  
  • 9. 3.1  Model  (cont’d)   15/10/15 9 l  In  order  to  generalize,  we  abstract  from  words  to  vector   representa8ons  of  words   Ø We  constrain  q  to  the  following  parameteriza8on:   2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST scores, de- ith Eq. 1. the reverse d target lan- ) an ordinary estimated di- interpolating tables: . (3) rase table are table as the served in the source-target corpus – in particular, those source words that appear in the triangulated phrase table ˆT, but not in T. In order to generalize, we abstract from words to vector representations of words. Specifically, we constrain q to the following parameterization: q(t | s) = 1 Zs exp ⇣ vT s Avt + fT st h ⌘ Zs = X t2T (s) exp ⇣ vT s Avt + fT st h ⌘ . Here, the vectors vs and vt represent monolingual features and the vector fst represents bilingual fea- tures. The parameters A and h are to be learned. In this work, we use monolingual word embed- dings for vs and vt, and set the vector fst to con- tain only the value of the triangulated score, such 1080 le ˆT, but not in T. r to generalize, we abstract from words representations of words. Specifically, ain q to the following parameterization: t | s) = 1 Zs exp ⇣ vT s Avt + f T st h ⌘ Zs = X t2T (s) exp ⇣ vT s Avt + f T st h ⌘ . vectors vs and vt represent monolingual nd the vector fst represents bilingual fea- parameters A and h are to be learned. work, we use monolingual word embed- vs and vt, and set the vector fst to con- he value of the triangulated score, such - vectors of monolingual features (word embeddings) rce words that appear in the triangulated ble ˆT, but not in T. er to generalize, we abstract from words representations of words. Specifically, rain q to the following parameterization: (t | s) = 1 Zs exp ⇣ vT s Avt + fT st h ⌘ Zs = X t2T (s) exp ⇣ vT s Avt + fT st h ⌘ . vectors vs and vt represent monolingual nd the vector fst represents bilingual fea- e parameters A and h are to be learned. work, we use monolingual word embed- vs and vt, and set the vector fst to con- the value of the triangulated score, such Next, given a (triangulated) phrase pair (s, t) with alignment ˆa, let ˆas,: = {t | (s, t) 2 ˆa}; the lexical- weighting probability is (Koehn et al., 2003): clexst(t | s, ˆa) = Y s2s 1 |ˆas,:| X t2ˆas,: ˆwst(t | s). (2) The triangulated phrase translation scores, de- noted ˆst, are computed by analogy with Eq. 1. We also compute these scores in the reverse direction by swapping the source and target lan- guages. 2.2 Interpolation (strong baseline) Given access to source-target data, an ordinary source-target phrase table Tst can be estimated di- rectly. Wu and Wang (2007) suggest interpolating phrase pairs entries that occur in both tables: Tinterp = ↵Tst + (1 ↵) ˆTst. (3) Phrase pairs appearing in only one phrase table are added as-is. We refer to the resulting table as the interpolated phrase table. arg max q L(q) = arg max q X (s,t) c sup st log q(t | s). Clearly, the solution q(· | s) := wsup(· | s) maxi- mizes L. However, we would like a solution that generalizes to source words s beyond those ob- served in the source-target corpus – in particular, those source words that appear in the triangulated phrase table ˆT, but not in T. In order to generalize, we abstract from words to vector representations of words. Specifically, we constrain q to the following parameterization: q(t | s) = 1 Zs exp ⇣ vT s Avt + fT st h ⌘ Zs = X t2T (s) exp ⇣ vT s Avt + fT st h ⌘ . Here, the vectors vs and vt represent monolingual features and the vector fst represents bilingual fea- tures. The parameters A and h are to be learned. In this work, we use monolingual word embed- dings for vs and vt, and set the vector fst to con- tain only the value of the triangulated score, such 1080- a vector of bilingual features (triangulated scores) generalize, we abstract from words resentations of words. Specifically, q to the following parameterization: ) = 1 Zs exp ⇣ vT s Avt + fT st h ⌘ s = X t2T(s) exp ⇣ vT s Avt + fT st h ⌘ . ors vs and vt represent monolingual he vector fst represents bilingual fea- ameters A and h are to be learned. k, we use monolingual word embed- nd vt, and set the vector fst to con- value of the triangulated score, such - parameters to be learned l  For  normaliza8on:   that fst := ˆwst. Therefore, the matrix A is a lin- ear transformation between the source and target embedding spaces, and h (now a scalar) quantifies how the triangulated scores ˆw are to be trusted. In the normalization factor Zs, we let t range only over possible translations of s suggested by either wsup or the triangulated word probabilities. That is: T (s) = {t | wsup (t | s) > 0 _ ˆw(t | s) > 0}. This restriction makes e cient computation pos- sible, as otherwise the normalization term would have to be computed over the entire target vocab- Fi Ø  Under  the  parameteriza8on,  our  goal  is  to  solve  the  following:   st st ear transformation between the source and target embedding spaces, and h (now a scalar) quantifies how the triangulated scores ˆw are to be trusted. In the normalization factor Zs, we let t range only over possible translations of s suggested by either wsup or the triangulated word probabilities. That is: T (s) = {t | wsup (t | s) > 0 _ ˆw(t | s) > 0}. This restriction makes e cient computation pos- sible, as otherwise the normalization term would have to be computed over the entire target vocab- ulary. Under this parameterization, our goal is to solve the following maximization problem: max A,h L(A, h) = max A,h X s,t c sup st log q(t | s). (4) Figure 1: The ( per iteration. A nificantly accel However, we lation scores q ing all probabil therefore interp
  • 10. 3.2  Op8miza8on   15/10/15 10 l  The  objec8ve  func8on  in  Eq.  4  is  concave  in  both  A  and  h   Ø We  can  reach  the  global  solu8on  of  the  problem  using  gradient   descent   l  Taking  deriva8ves,  the  gradient  is   2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST A,h A,h s,t 3.2 Optimization The objective function in Eq. 4 is concave in both A and h. This is because after taking the log, we are left with a weighted sum of linear and concave (negative log-sum-exp) terms in A and h. We can therefore reach the global solution of the problem using gradient descent. Taking derivatives, the gradient is @L @A = X s,t mstvsvT t @L @h = X s,t mst fst where the scalar mst = c sup st c sup s q(t | s) for the current value of q. For quick results, we limited the number of gra- dient steps to 200 and selected the iteration that minimized the total variation distance to wsup over a held out dev set: X s ||q(· | s) wsup (· | s)||1. (5) We obtained better convergence rate by us- lation scores q to be too sharp ing all probability mass to a si therefore interpolated q with t translation scores ˆw: q = q + (1 To integrate the lexical wei (Eq. 2), we simply appended in the phrase table in addition cal weights. Following this, w value that maximizes B on 3.4 Summary of method In summary, to improve upon terpolated phrase table, we: 1. Learn word translation dist vision against distribution the source-target bilingual 2. Smooth the learned distrib lating with triangulated wo ˆw (§3.3). 3. Compute new lexical weig The objective function in Eq. 4 is concave in both A and h. This is because after taking the log, we are left with a weighted sum of linear and concave (negative log-sum-exp) terms in A and h. We can therefore reach the global solution of the problem using gradient descent. Taking derivatives, the gradient is @L @A = X s,t mstvsvT t @L @h = X s,t mst fst where the scalar mst = c sup st c sup s q(t | s) for the current value of q. For quick results, we limited the number of gra- dient steps to 200 and selected the iteration that minimized the total variation distance to wsup over a held out dev set: X s ||q(· | s) wsup (· | s)||1. (5) To inte (Eq. 2), in the p cal wei value th 3.4 S In summ terpola 1. Lear visio the s 2. Smo latin ˆw (§ l  For  quick  results,  this  research  limited  the  number  of  gradient   steps  to  200  and  selected  the  itera8on  that  minimized  the  total   varia8on  distance  to  wsup  over  a  held  out  dev  set:   are left with a weighted sum of linear and concave (negative log-sum-exp) terms in A and h. We can therefore reach the global solution of the problem using gradient descent. Taking derivatives, the gradient is @L @A = X s,t mstvsvT t @L @h = X s,t mst fst where the scalar mst = c sup st c sup s q(t | s) for the current value of q. For quick results, we limited the number of gra- dient steps to 200 and selected the iteration that minimized the total variation distance to wsup over a held out dev set: X s ||q(· | s) wsup (· | s)||1. (5) We obtained better convergence rate by us- q To integrate the (Eq. 2), we simply in the phrase table cal weights. Follow value that maximi 3.4 Summary o In summary, to im terpolated phrase t 1. Learn word tran vision against the source-targ 2. Smooth the lea lating with trian ˆw (§3.3). 3. Compute new l
  • 11. 3.2  Op8miza8on  (cont’d)   15/10/15 11 2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST A is a lin- e and target ) quantifies trusted. let t range uggested by robabilities. s) > 0}. utation pos- term would rget vocab- l is to solve Figure 1: The (target-to-source) objective function per iteration. Applying batch Adagrad (blue) sig- nificantly accelerates convergence.
  • 12. 3.3  Re-­‐es8ma8ng  lexical  weights   15/10/15 12 l  Having  learned  the  model  (A  and  h),  we  can  now  use  q(t  |  s)  to   es8mate  the  lexical  weights  (Eq.  2)  of  any  aligned  phrase   pairs                                    ,  assuming  it  is  composed  of  embeddable  words   l  However,  the  authors  found  the  supervised  word  transla8on   scores  q  to  be  too  sharp,  some8mes  assigning  all  probability   mass  to  a  single  target  word   Ø  They  therefore  interpolated  q  with  the  triangulated  word   transla8on  scores:   •  To  integrate  the  lexical  weights  induced  by  qβ  (Eq.  2),  they   simply  appended  them  as  new  features  in  the  phrase  table   2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST ng lexical weights e model (A and h), we can now mate the lexical weights (Eq. 2) rase pairs (s, t, ˆa), assuming it is eddable words. 4 Experiments To test our method, w resource translation e phrase-based MT system 2007). 1081 Figure 1: The (target-to-source) objective function per iteration. Applying batch Adagrad (blue) sig- nificantly accelerates convergence. However, we found the supervised word trans- lation scores q to be too sharp, sometimes assign- ing all probability mass to a single target word. We therefore interpolated q with the triangulated word translation scores ˆw: q = q + (1 ) ˆw. (6) To integrate the lexical weights induced by q (Eq. 2), we simply appended them as new features in the phrase table in addition to the existing lexi-
  • 13. 3.4  Summary  of  method   15/10/15 13 In  summary,  to  improve  upon  a  triangulated  or  interpolated   phrase  table,  the  authors:     1.  Learn  word  transla8on  distribu8ons  q  by  supervision  against   distribu8ons  wsup  derived  from  the  source-­‐target  bilingual  data   (§3.1)   2.  Smooth  the  learned  distribu8ons  q  by  interpola8ng  with   triangulated  word  transla8on  scores     3.  Compute  new  lexical  weights  and  append  them  to  the  phrase   table  (§3.3)   2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST X t mstvsvT t @L @h = X s,t mst fst lar mst = c sup st c sup s q(t | s) for the of q. esults, we limited the number of gra- 200 and selected the iteration that e total variation distance to wsup over set: X ||q(· | s) wsup (· | s)||1. (5) better convergence rate by us- version of the e↵ective and easy- Adagrad technique (Duchi et al., gure 1. value that maximizes B on a 3.4 Summary of method In summary, to improve upon a t terpolated phrase table, we: 1. Learn word translation distrib vision against distributions w the source-target bilingual da 2. Smooth the learned distributi lating with triangulated word ˆw (§3.3). 3. Compute new lexical weights to the phrase table (§3.3). 4 Experiments
  • 14. 4.  Experiments   15/10/15 14 l  To  test  the  proposed  method,  the  authors  conducted  two  low-­‐ resource  transla8on  experiments  using  Moses   Transla8on  Tasks:   l  Fixing  the  pivot  language  to  English,  they  applied  their  method   on  two  data  scenarios:   1.  Spanish-­‐to-­‐French:   two  related  languages  used  to  simulate  a  low-­‐resource   seeng.  The  baseline  is  phrase  table  interpola8on  (Eq.  3)   2.  Malagasy-­‐to-­‐French:   two  unrelated  languages  for  which  they  have  a  small   dic8onary,  but  no  parallel  corpus.  The  baseline  is   triangula8on  alone.   2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST
  • 15. 4.1  Data   15/10/15 15 Datasets:   •  European-­‐language  bitext  were  extracted  from  Europarl   •  For  Malagasy-­‐English,  Global  Voices  parallel  data  available  online   •  The  Malagasy-­‐French  dic8onary  from  online  resources  and  small   Malagasy-­‐French  tune/tests  from  Global  Voices   2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST parallel corpus (aside from tuning and testing data). The baseline is triangulation alone (there is no source-target model to interpolate with). Table 1 lists some statistics of the bilin- gual data we used. European-language bitexts were extracted from Europarl (Koehn, 2005). For Malagasy-English, we used the Global Voices par- allel data available online.1 The Malagasy-French dictionary was extracted from online resources2 and the small Malagasy-French tune/test sets were extracted3 from Global Voices. lines of data language pair train tune test sp-fr 4k 1.5k 1.5k mg-fr 1.1k 1.2k 1.2k sp-en 50k – – mg-en 100k – – en-fr 50k – – Table 1: Bilingual datasets. Legend: sp=Spanish, fr=French, en=English, mg=Malagasy. Table 2 lists token statistics of the monolin- gual data used. We used word2vec4 to generate To produce w , we aligned the small Spani French parallel corpus in both directions, a symmetrized using the intersection heuristic. T was done to obtain high precision alignments ( often-used grow-diag-final-and heuristic is op mized for phrase extraction, not precision). We used the skip-gram model to estimate Spanish and French word embeddings and set dimension to d = 200 and context window w = 5 (default). Subsequently, to run our metho we filtered out source and target words that eith did not appear in the triangulation, or, did not ha an embedding. We took words that appeared mo than 10 times in the parallel corpus for the traini set (⇠690 words), and between 5–9 times for held out dev set (⇠530 words). This was done both source-target and target-source directions. In Table 3 we show that the distributions learn by our method are much better approximations wsup compared to those obtained by triangulatio Method source!target target!sourc triangulation 71.6% 72.0% our scores 30.2% 33.8% Table 3: Average total variation distance (Eq. 4.1 Data Fixing the pivot language to English, we applied our method on two data scenarios: 1. Spanish-to-French: two related languages used to simulate a low-resource setting. The baseline is phrase table interpolation (Eq. 3). 2. Malagasy-to-French: two unrelated languages for which we have a small dictionary, but no language words French 1.5G Spanish 1.4G Malagasy 58M Table 2: Size of monolingual corpus per language as measured in number of tokens. 4.2 Spanish-French Results
  • 16. 4.2  Spanish-­‐French  Results   15/10/15 16 l  To  produce  wsup,  the  authors  aligned  the  small  Spanish-­‐French  parallel   corpus  in  both  direc8ons,  and  symmetrized  using  the  intersec8on   heuris8c  to  obtain  high  precision  (not  grow-­‐diag-­‐final-­‐and)   l  To  train  skip-­‐gram  model,  dimension  d  =  200  and  context  window  w  =  5   l  They  took  words  that  appeared  more  than  10  8mes  in  the  parallel   corpus  for  the  training  set  (〜690  words),  and  5-­‐9  8mes  for  the  held   out  dev  set  (〜530  words)   l  They  fixed  β  :=  0.95  to  examine  the  effect  of  their  supervised  method   2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST with). bilin- bitexts 05). For ces par- -French ources2 ets were panish, onolin- enerate ddings. ere (in- ned to- Leipzig gs were Voices,7 symmetrized using the intersection heuristic. This was done to obtain high precision alignments (the often-used grow-diag-final-and heuristic is opti- mized for phrase extraction, not precision). We used the skip-gram model to estimate the Spanish and French word embeddings and set the dimension to d = 200 and context window to w = 5 (default). Subsequently, to run our method, we filtered out source and target words that either did not appear in the triangulation, or, did not have an embedding. We took words that appeared more than 10 times in the parallel corpus for the training set (⇠690 words), and between 5–9 times for the held out dev set (⇠530 words). This was done in both source-target and target-source directions. In Table 3 we show that the distributions learned by our method are much better approximations of wsup compared to those obtained by triangulation. Method source!target target!source triangulation 71.6% 72.0% our scores 30.2% 33.8% Table 3: Average total variation distance (Eq. 5) to the dev set portion of wsup (computed only over words whose translations in wsup appear in the tri- angulation). Using word embeddings, our method is able to better generalize on the dev set. We then examined the e↵ect of appending our supervised lexical weights. We fixed the word Method ↵ tune test source-target – 26.8 25.3 triangulation – 29.2 28.4 interpolation 0.7 30.2 29.2 interpolation+our scores 0.6 30.8 29.9 Table 4: Spanish-French B scores. Append- ing lexical weights obtained by supervision over a small source-target corpus significantly out- performs phrase table interpolation (Eq. 3) by +0.7 B . 4.3 Malagasy-French Results We cal w gulate strate This i we fit or eve Ackn The a Kevin portiv anony
  • 17. 4.3  Malagasy-­‐French  Results   15/10/15 17 l  The  wsup  distribu8ons  used  for  supervision  were  taken  to  be   uniform  distribu8ons  over  the  dic8onary  transla8ons   •  For  each  training  direc8on,  they  used  70%/30%  split  of  the   dic8onary  to  form  the  train  and  dev  sets   l  To  train  skip-­‐gram  model,  d  =  100,  w  =  3   2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST Having significantly less Malagasy monolin- gual data, we used d = 100 dimensional embed- dings and a w = 3 context window to estimate both Malagasy and French words. As before, we added our supervised lexical weights as new features in the phrase table. How- ever, instead of fixing = 0.95 as above, we searched for 2 {0.9, 0.8, 0.7, 0.6} in Eq. 6 to max- imize B on a small tune set. We report our re- sults in Table 5. Using only a dictionary, we are able to improve over triangulation by +0.5 B , a statistically significant di↵erence (p < 0.01). Method tune test triangulation – 12.2 11.1 triangulation+our scores 0.6 12.4 11.6 Table 5: Malagasy-French B . Supervision with a dictionary significantly improves upon simple triangulation by +0.5 B . 5 Conclusion Trevor Cohn and Mir translation by triang multi-parallel corpo 735. John Duchi, Elad Haz Adaptive subgradie and stochastic optim Research, 12:2121– Philipp Koehn, Franz 2003. Statistical ph NAACL HLT, pages Philipp Koehn, Hieu Callison-Burch, Ma Brooke Cowan, W Richard Zens, Chri dra Constantin, and Open source toolki tion. In Proc. ACL, stration Sessions, pa Philipp Koehn. 2004. machine translation pages 388–395. Philipp Koehn. 2005. statistical machine t pages 79–86.
  • 18. 5.  Conclusion   15/10/15 18 In  this  paper:   l  The  authors  argued  that  construc8ng  a  triangulated  phrase  table   independently  from  even  very  limited  source-­‐target  data  underu;lizes   that  parallel  data   Ø  They  designed  a  supervised  learning  algorithm  that  relies  on  word   transla8ons  distribu8ons  derived  from  the  parallel  data  as  well  as  a   distributed  representa;on  of  words  (embeddings)   Ø  The  laker  enables  their  algorithm  to  assign  transla8on  probabili8es  to   word  pairs  that  do  not  appear  in  the  source-­‐target  bilingual  data   l  Model  with  the  new  lexical  weights  genera8on  demonstrates   improvements  in  MT  quality  on  two  tasks  despite  the  fact  that  wsup   were  es8mated  automa8cally  or  even  naïvely  as  uniform  distribu;ons   2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST
  • 19. 6.  Impression   15/10/15 19 2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST
  • 20. End  Slide   15/10/15 20 2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST