Efficient Attention using a Fixed-Size
Memory Representation
Denny Britz, Melody Y. Guan and Minh-Thang
Luong
11/22 @EMNLP2017 Reading
Reader: M1 Hayahide Yamagishi
Introduction
● The most of encoder-decoder architectures are equipped
with attention mechanism.
● But, human translators don’t reread previously translated
source text words. (It’s shown by the eye-tracking test)
● They believe “it may be unnecessary to look back at the
entire original source sequence at each step.”
● They proposed an alternative attention mechanism
○ Smaller computational time
Attention [Bahdanau+ ICLR2015, Luong+ EMNLP2015]
● Attention mechanism aims to make the context vectors c.
○ s: encoder state, h: decoder state
○
● Computational time: O(D2
|S||T|)
○ D: state size of the encoder and decoder
○ |S| and |T| represents the length of the source and target, respectively.
○ If we use the Luong’s dot attention, computational time is O(D|S||T|)
○ Luong’s dot attention: hi
T
sj
Memory-Based Attention Model (Proposed method)
● During encoding, they computed an attention matrix C.
○ size of C and W: K×D
○ K: the number of attention vectors
○ computational time: O(KD|S|)
● C is regarded as compact fixed-length memory.
Memory-Based Attention Model (Proposed method)
● During decoding, they computed the context vector c.
○ They used C for computing the attention instead of encoder states.
● Total computational time: O(KD(|S| + |T|))
○ They expected their model to be faster than (O(D2
|S||T|))
○ For long sequences (|S| is large), this model will be faster than dot attention
● They used a sigmoid function instead of softmax for
calculating the attention scores.
Position Encoding
● Calculating C doesn’t depend on k.
○ “we would hope for the model to learn to generate distinct attention contexts”
● They add position encodings
Experiment 1: Toy copying (like NTM[Graves+ 2014])
● Copying the random sequence.
○ Length: from 0 to {10, 50, 100, 200}
● Vocabulary: 20
● 2-layer, bi-directional LSTM (256 units)
● Dropout: 0.2
● Train : test = 100,000 : 1,000
○ batch size: 128
○ They trained for 200,000 steps.
● K40m × 1
Result
Result
Result
● Vanilla enc-dec is weak.
● The number of K depends on the data length.
● Decoding process becomes faster when the length of
sequence is larger.
● “Traditional attention may be representing the source with
redundancy and wasting computational resources.”
Experiment 2: Neural Machine Translation
● WMT’17
○ English-Czech (52M sentences)
○ English-German (5.9M sentences)
○ English-Finish (2.6M sentences)
○ English-Turkish (207K sentences)
○ Dev: newstest2015, Test: newstest2016 (not included en-tr)
○ Average length for the test data is 35.
● Hyperparameters
○ Vocabulary: 16,000 subwords (BPE)
○ hidden states: 512
○ Other parameters are the same as copy experiments’.
Result
Learning curves
Discussion
● “Our memory attention model performs on-par with, or
slightly better, than the baseline model”
● Position encoding improves model performance.
● If we set the task to be K << T, we will get the good
performance by this model. (for example, summarization)
● Decoding time decreased.
Discussion
● softmax/softmax performs badly.
Visualizing: each memory (no position encoding)
Visualizing: when K is small
Visualizing: when the model try to translate
Conclusion
● They proposed a memory-based attention mechanism.
● Their technique lead to speedup.
● It can fit complex data like an NMT.
所感(日本語ですみません)
● 性能がさほど変わらないのに速くなるのはいいこと
● [Luong+ EMNLP’15] のlocal attentionと比較してほしかった
○ local attention的なものを事前に計算しているようなもの?
○ Luongが共著なのに
● Position EncodingがないのにAttentionがばらけているのが謎
● 要約の方が向いていると思う

[EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation

  • 1.
    Efficient Attention usinga Fixed-Size Memory Representation Denny Britz, Melody Y. Guan and Minh-Thang Luong 11/22 @EMNLP2017 Reading Reader: M1 Hayahide Yamagishi
  • 2.
    Introduction ● The mostof encoder-decoder architectures are equipped with attention mechanism. ● But, human translators don’t reread previously translated source text words. (It’s shown by the eye-tracking test) ● They believe “it may be unnecessary to look back at the entire original source sequence at each step.” ● They proposed an alternative attention mechanism ○ Smaller computational time
  • 4.
    Attention [Bahdanau+ ICLR2015,Luong+ EMNLP2015] ● Attention mechanism aims to make the context vectors c. ○ s: encoder state, h: decoder state ○ ● Computational time: O(D2 |S||T|) ○ D: state size of the encoder and decoder ○ |S| and |T| represents the length of the source and target, respectively. ○ If we use the Luong’s dot attention, computational time is O(D|S||T|) ○ Luong’s dot attention: hi T sj
  • 5.
    Memory-Based Attention Model(Proposed method) ● During encoding, they computed an attention matrix C. ○ size of C and W: K×D ○ K: the number of attention vectors ○ computational time: O(KD|S|) ● C is regarded as compact fixed-length memory.
  • 6.
    Memory-Based Attention Model(Proposed method) ● During decoding, they computed the context vector c. ○ They used C for computing the attention instead of encoder states. ● Total computational time: O(KD(|S| + |T|)) ○ They expected their model to be faster than (O(D2 |S||T|)) ○ For long sequences (|S| is large), this model will be faster than dot attention ● They used a sigmoid function instead of softmax for calculating the attention scores.
  • 7.
    Position Encoding ● CalculatingC doesn’t depend on k. ○ “we would hope for the model to learn to generate distinct attention contexts” ● They add position encodings
  • 8.
    Experiment 1: Toycopying (like NTM[Graves+ 2014]) ● Copying the random sequence. ○ Length: from 0 to {10, 50, 100, 200} ● Vocabulary: 20 ● 2-layer, bi-directional LSTM (256 units) ● Dropout: 0.2 ● Train : test = 100,000 : 1,000 ○ batch size: 128 ○ They trained for 200,000 steps. ● K40m × 1
  • 9.
  • 10.
  • 11.
    Result ● Vanilla enc-decis weak. ● The number of K depends on the data length. ● Decoding process becomes faster when the length of sequence is larger. ● “Traditional attention may be representing the source with redundancy and wasting computational resources.”
  • 12.
    Experiment 2: NeuralMachine Translation ● WMT’17 ○ English-Czech (52M sentences) ○ English-German (5.9M sentences) ○ English-Finish (2.6M sentences) ○ English-Turkish (207K sentences) ○ Dev: newstest2015, Test: newstest2016 (not included en-tr) ○ Average length for the test data is 35. ● Hyperparameters ○ Vocabulary: 16,000 subwords (BPE) ○ hidden states: 512 ○ Other parameters are the same as copy experiments’.
  • 13.
  • 14.
  • 15.
    Discussion ● “Our memoryattention model performs on-par with, or slightly better, than the baseline model” ● Position encoding improves model performance. ● If we set the task to be K << T, we will get the good performance by this model. (for example, summarization) ● Decoding time decreased.
  • 16.
  • 17.
    Visualizing: each memory(no position encoding)
  • 18.
  • 19.
    Visualizing: when themodel try to translate
  • 20.
    Conclusion ● They proposeda memory-based attention mechanism. ● Their technique lead to speedup. ● It can fit complex data like an NMT.
  • 21.
    所感(日本語ですみません) ● 性能がさほど変わらないのに速くなるのはいいこと ● [Luong+EMNLP’15] のlocal attentionと比較してほしかった ○ local attention的なものを事前に計算しているようなもの? ○ Luongが共著なのに ● Position EncodingがないのにAttentionがばらけているのが謎 ● 要約の方が向いていると思う