2. Abstract
• Beam Search & Greedy Strategy
• Improvement vs. Cost
• Actor → decoder.hidden_state[t]
• Train: K outputs of beam search, argmax with a target quality
metric like BLEU (pseudo-parallel corpus / base model);
• No reinforcement learning and universal;
• Experiments: 3 corpora & 3 architectures [Q↑ & S↑]
3. Intro
• Seq-seq: conditioned left-to-right
• Infinite and exponential with seq_len
• Greedy & Beam: 2+BLEU | 3+ROUGE
• Related: termination criterion & search function
• Train: ordinary BP on a model-specific corpus
• Corpus: generated by running the un-augmented model on the training set with large-beam
beam search, and selecting outputs from the resulting k-best list which score highly on our
target metric.
• Evaluation:
• RNN-based (Luong et al., 2015), ConvS2S (Gehring et al., 2017) and Transformer (Vaswani et
al., 2017)
• IWSLT16 De-En, WMT15 Fi-En and WMT14 De-En
4. Background
• 2.1 NMT
• 2.2 Decoding
• Greedy (1); Beam (K)
• Noisy parallel approximate decoding (NPAD; Cho, 2016)
Noice → decoder.hidden_state[t]
[idea] Stay active, even random! (better study at home cafe)
• Trainable Greedy Decoding (Gu et al., 2017)
FFNN RL actor → decoder.hidden_state[t] (やっぱり at lab!)
approximate the maximum-a-posteriori → BLEU [不不安定、残念]
5. Method
• I/O: at = actor(ht, et, st-1) ← dec/att/(state)
• Form of actor: ff(5), ff2(6), GRU(7), gated ff(gate(8)) …
6.
7. Training
• Pseudo-parallel corpus generated by a base model:
• High model likelihood (not highest it)
• High-quality translations (not highest xt)
• Gen: K beams → highest internal argmax external score
• Train the actor with pseudo-D and fixed base model
8. Experiments
• 4.1 Settings
• IWSLT16, tst2013(validation) and tst2014(test),
• WMT15, newstest2013(validation) and newstest2015(test)
• WMT14, newstest2013(validation) and newstest2014(test)
• + BPE
• evaluations
• tokenized and cased BLEU (primary).
• METEOR and TER, multeval with tokenized and case-insensitive scoring.
• Base models are trained from scratch, except for ConvS2S WMT14 En-De
translation (trained model as well as training data) provided by Gehring et al. (2017).
RNN: OpenNMT’s default, Luong
rnn, emb = [500, 500; 600, 300]
ConvS2S: IWSLT16 and WMT
Transformer: Gu et al. (2018)
(Pseudo-D beam k = 35)
12. Two Questions
• Two factors: actor & pseudo-D, which one matters?
• Silver standard is a better choice than golden one?
(Pseudo-D seems much more kind to the little driver/actor)
13. Impact
Likelihood (conditioned LM)Magnitude of Action Vector
L2 norms over the training course on the
IWSLT16 De-En validation set with Transformer.
This suggests that the action adjusts the
decoders hidden state slightly, rather than
overwriting it, enabling to find a sequence that is
not highly scored but corresponding to a high
value of the target metric. (more confident)
14. Actor & Data
Yes, silver data is the best.
Even bronze is better than gold!
16. 感想
• Modification of network:
• Elements & Structure (organs of body)
• A little guy on the shoulder of a blind giant (Transformer).
• Contextual Parameter Generator: language embedding
• The power of data seems much more greater than the
elaborative work of network. A little actor can make the
giant more flexible.