Advanced Machine Learning for Business Professionals
[EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation
1. Efficient Attention using a Fixed-Size
Memory Representation
Denny Britz, Melody Y. Guan and Minh-Thang
Luong
11/22 @EMNLP2017 Reading
Reader: M1 Hayahide Yamagishi
2. Introduction
● The most of encoder-decoder architectures are equipped
with attention mechanism.
● But, human translators don’t reread previously translated
source text words. (It’s shown by the eye-tracking test)
● They believe “it may be unnecessary to look back at the
entire original source sequence at each step.”
● They proposed an alternative attention mechanism
○ Smaller computational time
3.
4. Attention [Bahdanau+ ICLR2015, Luong+ EMNLP2015]
● Attention mechanism aims to make the context vectors c.
○ s: encoder state, h: decoder state
○
● Computational time: O(D2
|S||T|)
○ D: state size of the encoder and decoder
○ |S| and |T| represents the length of the source and target, respectively.
○ If we use the Luong’s dot attention, computational time is O(D|S||T|)
○ Luong’s dot attention: hi
T
sj
5. Memory-Based Attention Model (Proposed method)
● During encoding, they computed an attention matrix C.
○ size of C and W: K×D
○ K: the number of attention vectors
○ computational time: O(KD|S|)
● C is regarded as compact fixed-length memory.
6. Memory-Based Attention Model (Proposed method)
● During decoding, they computed the context vector c.
○ They used C for computing the attention instead of encoder states.
● Total computational time: O(KD(|S| + |T|))
○ They expected their model to be faster than (O(D2
|S||T|))
○ For long sequences (|S| is large), this model will be faster than dot attention
● They used a sigmoid function instead of softmax for
calculating the attention scores.
7. Position Encoding
● Calculating C doesn’t depend on k.
○ “we would hope for the model to learn to generate distinct attention contexts”
● They add position encodings
8. Experiment 1: Toy copying (like NTM[Graves+ 2014])
● Copying the random sequence.
○ Length: from 0 to {10, 50, 100, 200}
● Vocabulary: 20
● 2-layer, bi-directional LSTM (256 units)
● Dropout: 0.2
● Train : test = 100,000 : 1,000
○ batch size: 128
○ They trained for 200,000 steps.
● K40m × 1
11. Result
● Vanilla enc-dec is weak.
● The number of K depends on the data length.
● Decoding process becomes faster when the length of
sequence is larger.
● “Traditional attention may be representing the source with
redundancy and wasting computational resources.”
12. Experiment 2: Neural Machine Translation
● WMT’17
○ English-Czech (52M sentences)
○ English-German (5.9M sentences)
○ English-Finish (2.6M sentences)
○ English-Turkish (207K sentences)
○ Dev: newstest2015, Test: newstest2016 (not included en-tr)
○ Average length for the test data is 35.
● Hyperparameters
○ Vocabulary: 16,000 subwords (BPE)
○ hidden states: 512
○ Other parameters are the same as copy experiments’.
15. Discussion
● “Our memory attention model performs on-par with, or
slightly better, than the baseline model”
● Position encoding improves model performance.
● If we set the task to be K << T, we will get the good
performance by this model. (for example, summarization)
● Decoding time decreased.