Attention mechanism
in Neural Machine Translation
PHAM QUANG KHANG
Machine Translation task in NLP
1. Definition: translating text in one language to another language
2. Evaluation Data set: public datasets including sentence pairs of 2 language (source
language and target language)
a. Main in researches: Eng-Fra, Eng-Ger
b. Vietnamese: Eng-Vi 133k sentence pairs
PHAM QUANG KHANG 2
French-to-English translations from newstest2014
(Artetxe et al., 2018)
Source Reference
And of course, we all share
the same adaptive
imperatives.
Và tất nhiên, tất cả chúng ta
đều trải qua quá trình phát
triển và thích nghi như nhau.
Eng-to-Vi dataset in IWSLT 15 (133K sentence pairs)
BLEU score: standard evaluation for MT
1. Def: BLEU score compare n-grams of the candidate with n-grams of the reference
translation and count the number of matches (Papineni et al. 2002)
2. Calculation:
PHAM QUANG KHANG 3
Example:
Candidate: the the the the the the
Ref: The cat is on the mat
Countclip(the) = 2, Count(the) = 7
precision = 2/7
Papineni et al,. BLEU: a Method for Automatic Evaluation of Machine Translation
Precision
NMT as a hot topic for research
 Number of public paper on NMT got a spike from last year
 Key players:
 Facebook: tackling low-resource
language (Turkish, Vietnamese …)
 Amazon: improving efficiency
 Google: improving output of NMT
 Business needs is higher than ever
since auto translation can save
massive cost for global firms
PHAM QUANG KHANG 4
Number of NMT paper in the last few years
Source: https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-
had-its-busiest-month-ever/
Paper counted from Arxiv
Main approaches for NMT recently
1. Recurrent Neural Networks (RNNs):
 Using LSTM, GRU, BiDirectional RNNs for encoder and decoders
2. Attention with RNNs: use attention between encoder and decoder
 Using Attention mechanism while decoding to improve the ability to capture long term
dependencies
3. Attention only: Transformer
 Use only Attention for both long term dependency and reducing calculation cost
PHAM QUANG KHANG 5
RNNs: processing sequential data
 Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time
step t, ht can be calculated using input at t and hidden state ht-1
PHAM QUANG KHANG 6
http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Luong et al.2015
RNNs: processing sequential data
 Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time
step t, ht can be calculated using input at t and hidden state ht-1
RNNs has shown promising results: using RNNs to achieve close to state-of-the-art
performance of conventional phrase-based machine translation on English-to-French task.
PHAM QUANG KHANG 7
Luong et al.2015
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
BLEU score of several
architecture on Eng-Ger
dataset
Luong et al.2015
Attention: aligning while translating
 Intuition: each time the proposed model generates a word in a translation, it searches for a
set of positions in a source sentence where the most relevant information is concentrated
(Bahdanau et al. 2016)
 Advantages over pure RNNs:
1. Do not encode whole input into 1 vector => not losing information
2. Allow adaptive selection to which should the model attend to
PHAM QUANG KHANG 8
Bahdanau et al. 2015https://github.com/tensorflow/nmt
Attention mechanism
 When decode for predicting output, calculate the score between hidden vector of current state
and all hidden vector from input sentence
PHAM QUANG KHANG 9
https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb#scrollTo=TNfHIF71ulLu
Luong et al.2015
Attention for long sentence translation
Bahdanau et al
Compare to pure RNNs decoder on same
dataset and test on combined of WMT 12&13
Thang Luong et al.
Test on WMT’14 Eng-Ger test set
PHAM QUANG KHANG 10
Attention proved to be better than normal RNNs decoder for long sentence translation
Google NMT system
Attention between
encoder and decoder
PHAM QUANG KHANG 11
Wu et al., Google’s Neural Machine Translation System: Bridging the Gap
between Human and Machine Translation
Self-attention for representation learning
 The essence of encoder is to create representation of input
 Self-attention allows connection between words within one sentence while shortening the
path between words, which really differentiate with RNNs and CNNs
 Input and ouput: query, keys and values are from the same source
PHAM QUANG KHANG 12
http://deeplearning.hatenablog.com/entry/transformerLin et al. A Structured Self-Attentive Sentence Embedding
Transformer: Attention is all you need
1. Use self-attention instead of RNNs, CNNs
a. Multi-head Attention for self-attention and source-target
attention
b. Position-wise Feed Forward after Attention
c. Masked Multi-head Attention to prevent target words to
attend to “future” word
d. Word embedding + Positional Encoding
2. Reduce total computational complexity per layer
3. Amount of computation can be parallelized
4. Enhance the ability to learn long-range dependencies
PHAM QUANG KHANG 13
An architecture based solely on attention mechanisms
Attention Is All You Need
Multi-head attention
 Attention is only scaled dot-product between keys and query
 Each (V,K,Q) are projected into multiple set of (v, k, q) for different learning (multi-head)
 Output is concatenated again then linear transformed
PHAM QUANG KHANG 14
State-of-the-art in MT
 Outperforms best (at the time of paper) reported models by more than 2.0 BLEU with
training cost way less than those models
PHAM QUANG KHANG 15
Attention Is All You Need
Visualization of self-attention in transformer
PHAM QUANG KHANG 16
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
The model recognized which does “it” refer to (left hand: animal, right hand: street)
Potential of MT and Transformer
1. Customers oriented applications:
a. Text-to-text translation: google translate, facebook translate…
=> Need better engine for rare and difficult language like: Japanese, Vietnamese, Arabic…
b. Speech  Text, Speech to speech …
2. B2B applications:
a. Full text translation (company documentations ….)
b. Domain specific real time translation: for meeting, for work flow automation…
3. Scientific:
a. Attention and self-attention as an approach for other tasks like: language modeling, question
answering system (BERT – Google: AI outperform human in question answering)
PHAM QUANG KHANG 17
References
1. Attention Is All You Need, NIPS 2017
2. Dzmitry et al. Neural Machine Translation by Jointly Learning to Align and Translate
3. Lin et al. A Structured Self-Attentitive Sentence Embedding
4. Deep Learning (Ian Goodfellow et al, 2016)
5. IWSLT 2015 dataset for English-Vietnamese: https://wit3.fbk.eu/mt.php?release=2015-01
6. Yonghui Wu, et al. Google’s Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation
7. https://machinelearningmastery.com/introduction-neural-machine-translation/
8. http://deeplearning.hatenablog.com/entry/transformer
9. https://medium.com/the-new-nlp/ai-outperforms-humans-in-question-answering-
70554f51136b
PHAM QUANG KHANG 18

Notes on attention mechanism

  • 1.
    Attention mechanism in NeuralMachine Translation PHAM QUANG KHANG
  • 2.
    Machine Translation taskin NLP 1. Definition: translating text in one language to another language 2. Evaluation Data set: public datasets including sentence pairs of 2 language (source language and target language) a. Main in researches: Eng-Fra, Eng-Ger b. Vietnamese: Eng-Vi 133k sentence pairs PHAM QUANG KHANG 2 French-to-English translations from newstest2014 (Artetxe et al., 2018) Source Reference And of course, we all share the same adaptive imperatives. Và tất nhiên, tất cả chúng ta đều trải qua quá trình phát triển và thích nghi như nhau. Eng-to-Vi dataset in IWSLT 15 (133K sentence pairs)
  • 3.
    BLEU score: standardevaluation for MT 1. Def: BLEU score compare n-grams of the candidate with n-grams of the reference translation and count the number of matches (Papineni et al. 2002) 2. Calculation: PHAM QUANG KHANG 3 Example: Candidate: the the the the the the Ref: The cat is on the mat Countclip(the) = 2, Count(the) = 7 precision = 2/7 Papineni et al,. BLEU: a Method for Automatic Evaluation of Machine Translation Precision
  • 4.
    NMT as ahot topic for research  Number of public paper on NMT got a spike from last year  Key players:  Facebook: tackling low-resource language (Turkish, Vietnamese …)  Amazon: improving efficiency  Google: improving output of NMT  Business needs is higher than ever since auto translation can save massive cost for global firms PHAM QUANG KHANG 4 Number of NMT paper in the last few years Source: https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just- had-its-busiest-month-ever/ Paper counted from Arxiv
  • 5.
    Main approaches forNMT recently 1. Recurrent Neural Networks (RNNs):  Using LSTM, GRU, BiDirectional RNNs for encoder and decoders 2. Attention with RNNs: use attention between encoder and decoder  Using Attention mechanism while decoding to improve the ability to capture long term dependencies 3. Attention only: Transformer  Use only Attention for both long term dependency and reducing calculation cost PHAM QUANG KHANG 5
  • 6.
    RNNs: processing sequentialdata  Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time step t, ht can be calculated using input at t and hidden state ht-1 PHAM QUANG KHANG 6 http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Luong et al.2015
  • 7.
    RNNs: processing sequentialdata  Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time step t, ht can be calculated using input at t and hidden state ht-1 RNNs has shown promising results: using RNNs to achieve close to state-of-the-art performance of conventional phrase-based machine translation on English-to-French task. PHAM QUANG KHANG 7 Luong et al.2015 http://colah.github.io/posts/2015-08-Understanding-LSTMs/ BLEU score of several architecture on Eng-Ger dataset Luong et al.2015
  • 8.
    Attention: aligning whiletranslating  Intuition: each time the proposed model generates a word in a translation, it searches for a set of positions in a source sentence where the most relevant information is concentrated (Bahdanau et al. 2016)  Advantages over pure RNNs: 1. Do not encode whole input into 1 vector => not losing information 2. Allow adaptive selection to which should the model attend to PHAM QUANG KHANG 8 Bahdanau et al. 2015https://github.com/tensorflow/nmt
  • 9.
    Attention mechanism  Whendecode for predicting output, calculate the score between hidden vector of current state and all hidden vector from input sentence PHAM QUANG KHANG 9 https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb#scrollTo=TNfHIF71ulLu Luong et al.2015
  • 10.
    Attention for longsentence translation Bahdanau et al Compare to pure RNNs decoder on same dataset and test on combined of WMT 12&13 Thang Luong et al. Test on WMT’14 Eng-Ger test set PHAM QUANG KHANG 10 Attention proved to be better than normal RNNs decoder for long sentence translation
  • 11.
    Google NMT system Attentionbetween encoder and decoder PHAM QUANG KHANG 11 Wu et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
  • 12.
    Self-attention for representationlearning  The essence of encoder is to create representation of input  Self-attention allows connection between words within one sentence while shortening the path between words, which really differentiate with RNNs and CNNs  Input and ouput: query, keys and values are from the same source PHAM QUANG KHANG 12 http://deeplearning.hatenablog.com/entry/transformerLin et al. A Structured Self-Attentive Sentence Embedding
  • 13.
    Transformer: Attention isall you need 1. Use self-attention instead of RNNs, CNNs a. Multi-head Attention for self-attention and source-target attention b. Position-wise Feed Forward after Attention c. Masked Multi-head Attention to prevent target words to attend to “future” word d. Word embedding + Positional Encoding 2. Reduce total computational complexity per layer 3. Amount of computation can be parallelized 4. Enhance the ability to learn long-range dependencies PHAM QUANG KHANG 13 An architecture based solely on attention mechanisms Attention Is All You Need
  • 14.
    Multi-head attention  Attentionis only scaled dot-product between keys and query  Each (V,K,Q) are projected into multiple set of (v, k, q) for different learning (multi-head)  Output is concatenated again then linear transformed PHAM QUANG KHANG 14
  • 15.
    State-of-the-art in MT Outperforms best (at the time of paper) reported models by more than 2.0 BLEU with training cost way less than those models PHAM QUANG KHANG 15 Attention Is All You Need
  • 16.
    Visualization of self-attentionin transformer PHAM QUANG KHANG 16 https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html The model recognized which does “it” refer to (left hand: animal, right hand: street)
  • 17.
    Potential of MTand Transformer 1. Customers oriented applications: a. Text-to-text translation: google translate, facebook translate… => Need better engine for rare and difficult language like: Japanese, Vietnamese, Arabic… b. Speech  Text, Speech to speech … 2. B2B applications: a. Full text translation (company documentations ….) b. Domain specific real time translation: for meeting, for work flow automation… 3. Scientific: a. Attention and self-attention as an approach for other tasks like: language modeling, question answering system (BERT – Google: AI outperform human in question answering) PHAM QUANG KHANG 17
  • 18.
    References 1. Attention IsAll You Need, NIPS 2017 2. Dzmitry et al. Neural Machine Translation by Jointly Learning to Align and Translate 3. Lin et al. A Structured Self-Attentitive Sentence Embedding 4. Deep Learning (Ian Goodfellow et al, 2016) 5. IWSLT 2015 dataset for English-Vietnamese: https://wit3.fbk.eu/mt.php?release=2015-01 6. Yonghui Wu, et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation 7. https://machinelearningmastery.com/introduction-neural-machine-translation/ 8. http://deeplearning.hatenablog.com/entry/transformer 9. https://medium.com/the-new-nlp/ai-outperforms-humans-in-question-answering- 70554f51136b PHAM QUANG KHANG 18