SlideShare a Scribd company logo
Attention mechanism
in Neural Machine Translation
PHAM QUANG KHANG
Machine Translation task in NLP
1. Definition: translating text in one language to another language
2. Evaluation Data set: public datasets including sentence pairs of 2 language (source
language and target language)
a. Main in researches: Eng-Fra, Eng-Ger
b. Vietnamese: Eng-Vi 133k sentence pairs
PHAM QUANG KHANG 2
French-to-English translations from newstest2014
(Artetxe et al., 2018)
Source Reference
And of course, we all share
the same adaptive
imperatives.
Và tất nhiên, tất cả chúng ta
đều trải qua quá trình phát
triển và thích nghi như nhau.
Eng-to-Vi dataset in IWSLT 15 (133K sentence pairs)
BLEU score: standard evaluation for MT
1. Def: BLEU score compare n-grams of the candidate with n-grams of the reference
translation and count the number of matches (Papineni et al. 2002)
2. Calculation:
PHAM QUANG KHANG 3
Example:
Candidate: the the the the the the
Ref: The cat is on the mat
Countclip(the) = 2, Count(the) = 7
precision = 2/7
Papineni et al,. BLEU: a Method for Automatic Evaluation of Machine Translation
Precision
NMT as a hot topic for research
 Number of public paper on NMT got a spike from last year
 Key players:
 Facebook: tackling low-resource
language (Turkish, Vietnamese …)
 Amazon: improving efficiency
 Google: improving output of NMT
 Business needs is higher than ever
since auto translation can save
massive cost for global firms
PHAM QUANG KHANG 4
Number of NMT paper in the last few years
Source: https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-
had-its-busiest-month-ever/
Paper counted from Arxiv
Main approaches for NMT recently
1. Recurrent Neural Networks (RNNs):
 Using LSTM, GRU, BiDirectional RNNs for encoder and decoders
2. Attention with RNNs: use attention between encoder and decoder
 Using Attention mechanism while decoding to improve the ability to capture long term
dependencies
3. Attention only: Transformer
 Use only Attention for both long term dependency and reducing calculation cost
PHAM QUANG KHANG 5
RNNs: processing sequential data
 Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time
step t, ht can be calculated using input at t and hidden state ht-1
PHAM QUANG KHANG 6
http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Luong et al.2015
RNNs: processing sequential data
 Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time
step t, ht can be calculated using input at t and hidden state ht-1
RNNs has shown promising results: using RNNs to achieve close to state-of-the-art
performance of conventional phrase-based machine translation on English-to-French task.
PHAM QUANG KHANG 7
Luong et al.2015
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
BLEU score of several
architecture on Eng-Ger
dataset
Luong et al.2015
Attention: aligning while translating
 Intuition: each time the proposed model generates a word in a translation, it searches for a
set of positions in a source sentence where the most relevant information is concentrated
(Bahdanau et al. 2016)
 Advantages over pure RNNs:
1. Do not encode whole input into 1 vector => not losing information
2. Allow adaptive selection to which should the model attend to
PHAM QUANG KHANG 8
Bahdanau et al. 2015https://github.com/tensorflow/nmt
Attention mechanism
 When decode for predicting output, calculate the score between hidden vector of current state
and all hidden vector from input sentence
PHAM QUANG KHANG 9
https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb#scrollTo=TNfHIF71ulLu
Luong et al.2015
Attention for long sentence translation
Bahdanau et al
Compare to pure RNNs decoder on same
dataset and test on combined of WMT 12&13
Thang Luong et al.
Test on WMT’14 Eng-Ger test set
PHAM QUANG KHANG 10
Attention proved to be better than normal RNNs decoder for long sentence translation
Google NMT system
Attention between
encoder and decoder
PHAM QUANG KHANG 11
Wu et al., Google’s Neural Machine Translation System: Bridging the Gap
between Human and Machine Translation
Self-attention for representation learning
 The essence of encoder is to create representation of input
 Self-attention allows connection between words within one sentence while shortening the
path between words, which really differentiate with RNNs and CNNs
 Input and ouput: query, keys and values are from the same source
PHAM QUANG KHANG 12
http://deeplearning.hatenablog.com/entry/transformerLin et al. A Structured Self-Attentive Sentence Embedding
Transformer: Attention is all you need
1. Use self-attention instead of RNNs, CNNs
a. Multi-head Attention for self-attention and source-target
attention
b. Position-wise Feed Forward after Attention
c. Masked Multi-head Attention to prevent target words to
attend to “future” word
d. Word embedding + Positional Encoding
2. Reduce total computational complexity per layer
3. Amount of computation can be parallelized
4. Enhance the ability to learn long-range dependencies
PHAM QUANG KHANG 13
An architecture based solely on attention mechanisms
Attention Is All You Need
Multi-head attention
 Attention is only scaled dot-product between keys and query
 Each (V,K,Q) are projected into multiple set of (v, k, q) for different learning (multi-head)
 Output is concatenated again then linear transformed
PHAM QUANG KHANG 14
State-of-the-art in MT
 Outperforms best (at the time of paper) reported models by more than 2.0 BLEU with
training cost way less than those models
PHAM QUANG KHANG 15
Attention Is All You Need
Visualization of self-attention in transformer
PHAM QUANG KHANG 16
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
The model recognized which does “it” refer to (left hand: animal, right hand: street)
Potential of MT and Transformer
1. Customers oriented applications:
a. Text-to-text translation: google translate, facebook translate…
=> Need better engine for rare and difficult language like: Japanese, Vietnamese, Arabic…
b. Speech  Text, Speech to speech …
2. B2B applications:
a. Full text translation (company documentations ….)
b. Domain specific real time translation: for meeting, for work flow automation…
3. Scientific:
a. Attention and self-attention as an approach for other tasks like: language modeling, question
answering system (BERT – Google: AI outperform human in question answering)
PHAM QUANG KHANG 17
References
1. Attention Is All You Need, NIPS 2017
2. Dzmitry et al. Neural Machine Translation by Jointly Learning to Align and Translate
3. Lin et al. A Structured Self-Attentitive Sentence Embedding
4. Deep Learning (Ian Goodfellow et al, 2016)
5. IWSLT 2015 dataset for English-Vietnamese: https://wit3.fbk.eu/mt.php?release=2015-01
6. Yonghui Wu, et al. Google’s Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation
7. https://machinelearningmastery.com/introduction-neural-machine-translation/
8. http://deeplearning.hatenablog.com/entry/transformer
9. https://medium.com/the-new-nlp/ai-outperforms-humans-in-question-answering-
70554f51136b
PHAM QUANG KHANG 18

More Related Content

What's hot

Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
Natasha Latysheva
 

What's hot (20)

Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM
 
LSTM
LSTMLSTM
LSTM
 
LSTM Basics
LSTM BasicsLSTM Basics
LSTM Basics
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Lstm
LstmLstm
Lstm
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 

Similar to Notes on attention mechanism

Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
DataWorks Summit
 

Similar to Notes on attention mechanism (20)

Master Thesis of Computer Engineering: OpenTranslator
Master Thesis of Computer Engineering: OpenTranslatorMaster Thesis of Computer Engineering: OpenTranslator
Master Thesis of Computer Engineering: OpenTranslator
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application Trends
 
Recent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesRecent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP Approaches
 
IRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text DetectionIRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text Detection
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
 
Information-Flow Analysis of Design Breaks up
Information-Flow Analysis of Design Breaks upInformation-Flow Analysis of Design Breaks up
Information-Flow Analysis of Design Breaks up
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONEXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
 
Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017
Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017
Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
 
Neural machine translation by jointly learning to align and translate
Neural machine translation by jointly learning to align and translateNeural machine translation by jointly learning to align and translate
Neural machine translation by jointly learning to align and translate
 
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 

Notes on attention mechanism

  • 1. Attention mechanism in Neural Machine Translation PHAM QUANG KHANG
  • 2. Machine Translation task in NLP 1. Definition: translating text in one language to another language 2. Evaluation Data set: public datasets including sentence pairs of 2 language (source language and target language) a. Main in researches: Eng-Fra, Eng-Ger b. Vietnamese: Eng-Vi 133k sentence pairs PHAM QUANG KHANG 2 French-to-English translations from newstest2014 (Artetxe et al., 2018) Source Reference And of course, we all share the same adaptive imperatives. Và tất nhiên, tất cả chúng ta đều trải qua quá trình phát triển và thích nghi như nhau. Eng-to-Vi dataset in IWSLT 15 (133K sentence pairs)
  • 3. BLEU score: standard evaluation for MT 1. Def: BLEU score compare n-grams of the candidate with n-grams of the reference translation and count the number of matches (Papineni et al. 2002) 2. Calculation: PHAM QUANG KHANG 3 Example: Candidate: the the the the the the Ref: The cat is on the mat Countclip(the) = 2, Count(the) = 7 precision = 2/7 Papineni et al,. BLEU: a Method for Automatic Evaluation of Machine Translation Precision
  • 4. NMT as a hot topic for research  Number of public paper on NMT got a spike from last year  Key players:  Facebook: tackling low-resource language (Turkish, Vietnamese …)  Amazon: improving efficiency  Google: improving output of NMT  Business needs is higher than ever since auto translation can save massive cost for global firms PHAM QUANG KHANG 4 Number of NMT paper in the last few years Source: https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just- had-its-busiest-month-ever/ Paper counted from Arxiv
  • 5. Main approaches for NMT recently 1. Recurrent Neural Networks (RNNs):  Using LSTM, GRU, BiDirectional RNNs for encoder and decoders 2. Attention with RNNs: use attention between encoder and decoder  Using Attention mechanism while decoding to improve the ability to capture long term dependencies 3. Attention only: Transformer  Use only Attention for both long term dependency and reducing calculation cost PHAM QUANG KHANG 5
  • 6. RNNs: processing sequential data  Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time step t, ht can be calculated using input at t and hidden state ht-1 PHAM QUANG KHANG 6 http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Luong et al.2015
  • 7. RNNs: processing sequential data  Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time step t, ht can be calculated using input at t and hidden state ht-1 RNNs has shown promising results: using RNNs to achieve close to state-of-the-art performance of conventional phrase-based machine translation on English-to-French task. PHAM QUANG KHANG 7 Luong et al.2015 http://colah.github.io/posts/2015-08-Understanding-LSTMs/ BLEU score of several architecture on Eng-Ger dataset Luong et al.2015
  • 8. Attention: aligning while translating  Intuition: each time the proposed model generates a word in a translation, it searches for a set of positions in a source sentence where the most relevant information is concentrated (Bahdanau et al. 2016)  Advantages over pure RNNs: 1. Do not encode whole input into 1 vector => not losing information 2. Allow adaptive selection to which should the model attend to PHAM QUANG KHANG 8 Bahdanau et al. 2015https://github.com/tensorflow/nmt
  • 9. Attention mechanism  When decode for predicting output, calculate the score between hidden vector of current state and all hidden vector from input sentence PHAM QUANG KHANG 9 https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb#scrollTo=TNfHIF71ulLu Luong et al.2015
  • 10. Attention for long sentence translation Bahdanau et al Compare to pure RNNs decoder on same dataset and test on combined of WMT 12&13 Thang Luong et al. Test on WMT’14 Eng-Ger test set PHAM QUANG KHANG 10 Attention proved to be better than normal RNNs decoder for long sentence translation
  • 11. Google NMT system Attention between encoder and decoder PHAM QUANG KHANG 11 Wu et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
  • 12. Self-attention for representation learning  The essence of encoder is to create representation of input  Self-attention allows connection between words within one sentence while shortening the path between words, which really differentiate with RNNs and CNNs  Input and ouput: query, keys and values are from the same source PHAM QUANG KHANG 12 http://deeplearning.hatenablog.com/entry/transformerLin et al. A Structured Self-Attentive Sentence Embedding
  • 13. Transformer: Attention is all you need 1. Use self-attention instead of RNNs, CNNs a. Multi-head Attention for self-attention and source-target attention b. Position-wise Feed Forward after Attention c. Masked Multi-head Attention to prevent target words to attend to “future” word d. Word embedding + Positional Encoding 2. Reduce total computational complexity per layer 3. Amount of computation can be parallelized 4. Enhance the ability to learn long-range dependencies PHAM QUANG KHANG 13 An architecture based solely on attention mechanisms Attention Is All You Need
  • 14. Multi-head attention  Attention is only scaled dot-product between keys and query  Each (V,K,Q) are projected into multiple set of (v, k, q) for different learning (multi-head)  Output is concatenated again then linear transformed PHAM QUANG KHANG 14
  • 15. State-of-the-art in MT  Outperforms best (at the time of paper) reported models by more than 2.0 BLEU with training cost way less than those models PHAM QUANG KHANG 15 Attention Is All You Need
  • 16. Visualization of self-attention in transformer PHAM QUANG KHANG 16 https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html The model recognized which does “it” refer to (left hand: animal, right hand: street)
  • 17. Potential of MT and Transformer 1. Customers oriented applications: a. Text-to-text translation: google translate, facebook translate… => Need better engine for rare and difficult language like: Japanese, Vietnamese, Arabic… b. Speech  Text, Speech to speech … 2. B2B applications: a. Full text translation (company documentations ….) b. Domain specific real time translation: for meeting, for work flow automation… 3. Scientific: a. Attention and self-attention as an approach for other tasks like: language modeling, question answering system (BERT – Google: AI outperform human in question answering) PHAM QUANG KHANG 17
  • 18. References 1. Attention Is All You Need, NIPS 2017 2. Dzmitry et al. Neural Machine Translation by Jointly Learning to Align and Translate 3. Lin et al. A Structured Self-Attentitive Sentence Embedding 4. Deep Learning (Ian Goodfellow et al, 2016) 5. IWSLT 2015 dataset for English-Vietnamese: https://wit3.fbk.eu/mt.php?release=2015-01 6. Yonghui Wu, et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation 7. https://machinelearningmastery.com/introduction-neural-machine-translation/ 8. http://deeplearning.hatenablog.com/entry/transformer 9. https://medium.com/the-new-nlp/ai-outperforms-humans-in-question-answering- 70554f51136b PHAM QUANG KHANG 18