Insertion Position Selection
Model for Flexible Non-Terminals
in Dependency Tree-to-Tree
Machine Translation
Toshiaki Nakazawa
Japan Science and Technology Agency
(JST)
John Richardson Sadao Kurohashi
Kyoto University
4/11/2016 @ EMNLP2016
Where to insert?
I found Pikachu by chance
yesterday
insertion positions
0.70.25 0.02 0.01prob. 0.010.01
2
Where to insert?
I found Pikachu by chance yesterday
in the park
insertion positions
0.20.1 0.6 0.01
0.01
@Texas State Capitol
0.01
0.1
3
Pikachu
Dependency Tree-to-Tree Translation
私
は
昨日
公園
で
ピカチュウ
を
見つけた
私
は
を
見つけた
I
found
by
Input Translation Rules Output
ピカチュウ Pikachu
偶然
[X7]
[X7]
偶然
chance
I
found
by
[X7]
chance
公園 the
park
昨日 yesterday
で 4
Dependency Tree-to-Tree Translation
私
は
昨日
公園
で
ピカチュウ
を
見つけた
私
は
を
見つけた
Input Translation Rules Output
ピカチュウ Pikachu
偶然
公園 the
park
[X7]
偶然
昨日 yesterday
で
[X]
[X]
[X]
[X]
found
by
chance
[X]
I
[X7]
found
Pikachu
by
I
chance
yesterday
the
park
in
found
Pikachu
by
I
chance
yesterday
Pikachu
I
found
by
chance
Flexible Non-terminals
[Richardson+, 2016]
floating
subtree
floating
subtree
5
Translation Quality and Decoding Speed
w/ and w/o Flexible Non-terminals
• Using ASPEC (Asian Scientific Paper Excerpt
Corpus) JE and JC
• Time is a relative decoding time
Ja->En En->Ja Ja->Zh Zh->Ja
BLEU
Tim
e
BLEU
Tim
e
BLEU
Tim
e
BLEU
Tim
e
w/o Flex
20.2
8
1.00
28.7
7
1.00
24.8
5
1.00
30.5
1
1.00
w/ Flex
21.6
1
6.28
30.5
7
3.30
28.7
9
5.16
34.3
2
5.28
6
Appropriate Insertion Position Selection
• roughly half of all translation rules were
augmented with flexible non-terminals
[Richardson+, 2016]
• flexible non-terminals make the search space
much bigger -> slower decoding speed,
increased search error
• reduce the number of possible insertion
positions in translation rules by a Neural
Network model
7
Insertion Position Selection
Model for Flexible Non-Terminals
in Dependency Tree-to-Tree
Machine Translation
Toshiaki Nakazawa
Japan Science and Technology Agency
John Richardson Sadao Kurohashi
Kyoto University
4/11/2016 @ EMNLP2016
INSERTION POSITION SELECTION
MODEL
9
Insertion Position Selection Model
• For each insertion position:
–predict
• scores of the insertion positions
–given
• input: the floating word (I) and its parent word
(Ps) with the distance (Ds)
• target: previous (Sp) and next (Sn) sibling words
of the insertion position and the parent (Pt)
with the distance (Dt)
10
Information for Selection Model
私
は
昨日
公園
で
ピカチュウ
を
見つけた
私
は
を
見つけた
Input Translation Rules
偶然
[X7]
偶然
found
by
chance
I
[X7]
I
Ps
Pt
Sp
Sn
Ds
=
4
[X]
Dt
=
-2
Non-terminals:
reverted to the
original word in
the parallel
corpus
11
[yesterday]
[found]
Information for Selection Model
私
は
昨日
公園
で
ピカチュウ
を
見つけた
私
は
を
見つけた
Input Translation Rules
偶然
[X7]
偶然
found
by
chance
I
[X7]
I
Ps
Pt
Sp
Sn
Ds
=
4
[X]
Dt
=
-3
= [POST-BOTTOM]
12
[yesterday]
[found]
Neural Network Model
220
I
Ps
Pt
Sp
1
Sn
1
Ds
Dt
k 100100
220220220220
100
word to be inserted
parent of I
distance from PS
previous sibling
next sibling
parent of the
insertion position
distance from Pt
fully-connected
feed-forward network
()
・・・
1
1
1
・・・
insertion position 2
insertion position N
scores
0.1
0.6
・
・
・
0.1
0
1
・
・
・
0
()
softmax gold
loss =
softmax cross-entropy
insertion position 1
13
Training Data Creation
• Training data for the NN model can be
automatically created from the word-aligned
parallel corpus
– consider each alignment as the floating word and
remove it from the target tree
14
私
は
を
見つけた
I
found
by
ピカチュウ
Pikachu
偶然
chance
[X]
[X]
[X]
[X]
label
0
0
0
1
EXPERIMENTS
15
Insertion Position Selection Experiment
• Parallel corpus: ASPEC-JE/JC (2M/680K
sentences)
• Data size
• Comparison
– L2-regularized logistic regression (using Multi-core
LIBLINEAR)
Ja-
>En
En-
>Ja
Ja-
>Zh
Zh-
>Ja
Training 15.7M 5.7M
Development 160K 58K
Test 160K 58K
Ave. # IP 3.39 3.15 3.72 3.41
16
Experimental Results
Ja->En En->Ja Ja->Zh Zh->Ja
Training 15.7M 5.7M
Development 160K 58K
Test 160K 58K
Ave. # IP 3.39 3.15 3.72 3.41
Mean loss 0.089 0.058 0.105 0.056
Top 1 Accuracy (%) 97.08 97.72 96.51 97.99
Top 2 Accuracy (%) 98.94 99.52 98.97 99.56
Logit Accuracy (%) 55.00 89.03 68.04 83.16
17
Translation Experiment
• Parallel corpus: ASPEC-JE/JC (2M/680K
sentences)
• Decoder: KyotoEBMT [Richardson+, 2014]
• 5 Settings
– Phrase-based and hierarchical phrase-based SMTs
– w/o Flex: not using flexible non-terminals
– w/ Flex: baseline with flexible non-terminals
– Prop: using insertion position selection (only top 1)
• BLEU and relative decoding time
18
Translation Experimental Results
Ja->En En->Ja Ja->Zh Zh->Ja
BLEU Time BLEU Time BLEU Time BLEU Time
PBSMT 18.45 - 27.48 - 27.96 - 34.65 -
HPBSMT 18.72 - 30.19 - 27.71 - 35.43 -
w/o Flex 20.28 1.00 28.77 1.00 24.85 1.00 30.51 1.00
w/ Flex 21.61 6.28 30.57 3.30 28.79 5.16 34.32 5.28
Prop 22.07 2.25 30.50 1.27 29.83 2.21 34.71 1.89
19
Conclusion
• Proposed insertion position selection model
to reduced the number of insertion positions
for flexible non-terminals in the translation
rules
• Automatic evaluation scores and decoding
speed are improved
20
Future Work
• Use grand-children’s info
– Recursive NN [Liu et al., 2015] or Convolutional
NN [Mou et al., 2015]
• Shift to NMT!!
– Actually, we’ve already shifted and participated
WAT2016 shared tasks
• However, NMT is still far from perfect
21
J->E Adequacy in WAT2016
22
3.76 3.71
21.75 21
37.25
51.75
46.75
30.50
20.75
26.75
16.25
4.75 5
10.00
1 0.5
6.00
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1
2
3
4
5
3.83Average adequacy
BLEU 26.22 26.39 25.41
Kyoto-U
(NMT)
NAIST/CMU
(NMT)
NAIST
(2015 best, F2T)
Team name
23
Thank You!
AD I’m co-organizing
The 3rd Workshop on Asian Translation
(WAT2016)
in conjunction with COLING 2016
Invited talk by Google about GNMT!
Please come to the workshop!
http://lotus.kuee.kyoto-u.ac.jp/WAT/

Insertion Position Selection Model for Flexible Non-Terminals in Dependency Tree-to-Tree

  • 1.
    Insertion Position Selection Modelfor Flexible Non-Terminals in Dependency Tree-to-Tree Machine Translation Toshiaki Nakazawa Japan Science and Technology Agency (JST) John Richardson Sadao Kurohashi Kyoto University 4/11/2016 @ EMNLP2016
  • 2.
    Where to insert? Ifound Pikachu by chance yesterday insertion positions 0.70.25 0.02 0.01prob. 0.010.01 2
  • 3.
    Where to insert? Ifound Pikachu by chance yesterday in the park insertion positions 0.20.1 0.6 0.01 0.01 @Texas State Capitol 0.01 0.1 3
  • 4.
    Pikachu Dependency Tree-to-Tree Translation 私 は 昨日 公園 で ピカチュウ を 見つけた 私 は を 見つけた I found by InputTranslation Rules Output ピカチュウ Pikachu 偶然 [X7] [X7] 偶然 chance I found by [X7] chance 公園 the park 昨日 yesterday で 4
  • 5.
    Dependency Tree-to-Tree Translation 私 は 昨日 公園 で ピカチュウ を 見つけた 私 は を 見つけた InputTranslation Rules Output ピカチュウ Pikachu 偶然 公園 the park [X7] 偶然 昨日 yesterday で [X] [X] [X] [X] found by chance [X] I [X7] found Pikachu by I chance yesterday the park in found Pikachu by I chance yesterday Pikachu I found by chance Flexible Non-terminals [Richardson+, 2016] floating subtree floating subtree 5
  • 6.
    Translation Quality andDecoding Speed w/ and w/o Flexible Non-terminals • Using ASPEC (Asian Scientific Paper Excerpt Corpus) JE and JC • Time is a relative decoding time Ja->En En->Ja Ja->Zh Zh->Ja BLEU Tim e BLEU Tim e BLEU Tim e BLEU Tim e w/o Flex 20.2 8 1.00 28.7 7 1.00 24.8 5 1.00 30.5 1 1.00 w/ Flex 21.6 1 6.28 30.5 7 3.30 28.7 9 5.16 34.3 2 5.28 6
  • 7.
    Appropriate Insertion PositionSelection • roughly half of all translation rules were augmented with flexible non-terminals [Richardson+, 2016] • flexible non-terminals make the search space much bigger -> slower decoding speed, increased search error • reduce the number of possible insertion positions in translation rules by a Neural Network model 7
  • 8.
    Insertion Position Selection Modelfor Flexible Non-Terminals in Dependency Tree-to-Tree Machine Translation Toshiaki Nakazawa Japan Science and Technology Agency John Richardson Sadao Kurohashi Kyoto University 4/11/2016 @ EMNLP2016
  • 9.
  • 10.
    Insertion Position SelectionModel • For each insertion position: –predict • scores of the insertion positions –given • input: the floating word (I) and its parent word (Ps) with the distance (Ds) • target: previous (Sp) and next (Sn) sibling words of the insertion position and the parent (Pt) with the distance (Dt) 10
  • 11.
    Information for SelectionModel 私 は 昨日 公園 で ピカチュウ を 見つけた 私 は を 見つけた Input Translation Rules 偶然 [X7] 偶然 found by chance I [X7] I Ps Pt Sp Sn Ds = 4 [X] Dt = -2 Non-terminals: reverted to the original word in the parallel corpus 11 [yesterday] [found]
  • 12.
    Information for SelectionModel 私 は 昨日 公園 で ピカチュウ を 見つけた 私 は を 見つけた Input Translation Rules 偶然 [X7] 偶然 found by chance I [X7] I Ps Pt Sp Sn Ds = 4 [X] Dt = -3 = [POST-BOTTOM] 12 [yesterday] [found]
  • 13.
    Neural Network Model 220 I Ps Pt Sp 1 Sn 1 Ds Dt k100100 220220220220 100 word to be inserted parent of I distance from PS previous sibling next sibling parent of the insertion position distance from Pt fully-connected feed-forward network () ・・・ 1 1 1 ・・・ insertion position 2 insertion position N scores 0.1 0.6 ・ ・ ・ 0.1 0 1 ・ ・ ・ 0 () softmax gold loss = softmax cross-entropy insertion position 1 13
  • 14.
    Training Data Creation •Training data for the NN model can be automatically created from the word-aligned parallel corpus – consider each alignment as the floating word and remove it from the target tree 14 私 は を 見つけた I found by ピカチュウ Pikachu 偶然 chance [X] [X] [X] [X] label 0 0 0 1
  • 15.
  • 16.
    Insertion Position SelectionExperiment • Parallel corpus: ASPEC-JE/JC (2M/680K sentences) • Data size • Comparison – L2-regularized logistic regression (using Multi-core LIBLINEAR) Ja- >En En- >Ja Ja- >Zh Zh- >Ja Training 15.7M 5.7M Development 160K 58K Test 160K 58K Ave. # IP 3.39 3.15 3.72 3.41 16
  • 17.
    Experimental Results Ja->En En->JaJa->Zh Zh->Ja Training 15.7M 5.7M Development 160K 58K Test 160K 58K Ave. # IP 3.39 3.15 3.72 3.41 Mean loss 0.089 0.058 0.105 0.056 Top 1 Accuracy (%) 97.08 97.72 96.51 97.99 Top 2 Accuracy (%) 98.94 99.52 98.97 99.56 Logit Accuracy (%) 55.00 89.03 68.04 83.16 17
  • 18.
    Translation Experiment • Parallelcorpus: ASPEC-JE/JC (2M/680K sentences) • Decoder: KyotoEBMT [Richardson+, 2014] • 5 Settings – Phrase-based and hierarchical phrase-based SMTs – w/o Flex: not using flexible non-terminals – w/ Flex: baseline with flexible non-terminals – Prop: using insertion position selection (only top 1) • BLEU and relative decoding time 18
  • 19.
    Translation Experimental Results Ja->EnEn->Ja Ja->Zh Zh->Ja BLEU Time BLEU Time BLEU Time BLEU Time PBSMT 18.45 - 27.48 - 27.96 - 34.65 - HPBSMT 18.72 - 30.19 - 27.71 - 35.43 - w/o Flex 20.28 1.00 28.77 1.00 24.85 1.00 30.51 1.00 w/ Flex 21.61 6.28 30.57 3.30 28.79 5.16 34.32 5.28 Prop 22.07 2.25 30.50 1.27 29.83 2.21 34.71 1.89 19
  • 20.
    Conclusion • Proposed insertionposition selection model to reduced the number of insertion positions for flexible non-terminals in the translation rules • Automatic evaluation scores and decoding speed are improved 20
  • 21.
    Future Work • Usegrand-children’s info – Recursive NN [Liu et al., 2015] or Convolutional NN [Mou et al., 2015] • Shift to NMT!! – Actually, we’ve already shifted and participated WAT2016 shared tasks • However, NMT is still far from perfect 21
  • 22.
    J->E Adequacy inWAT2016 22 3.76 3.71 21.75 21 37.25 51.75 46.75 30.50 20.75 26.75 16.25 4.75 5 10.00 1 0.5 6.00 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 3.83Average adequacy BLEU 26.22 26.39 25.41 Kyoto-U (NMT) NAIST/CMU (NMT) NAIST (2015 best, F2T) Team name
  • 23.
    23 Thank You! AD I’mco-organizing The 3rd Workshop on Asian Translation (WAT2016) in conjunction with COLING 2016 Invited talk by Google about GNMT! Please come to the workshop! http://lotus.kuee.kyoto-u.ac.jp/WAT/

Editor's Notes

  • #7 (top1) group scoring
  • #20 (top1) group scoring
  • #23 Struggle for victory
  • #27 (top1) group scoring
  • #28 (top1) group scoring