SlideShare a Scribd company logo
1 of 36
Attention Is All you need
Reading Seminar
Kyoto University, Kashima lab
Daiki Tanaka
Today’s Paper : ‘Attention Is All you need‘
• Conference : NIPS 2017
• Cited 966 times.
• Authors :
• Ashish Vaswani (Google Brain)
• Noam Shazeer (Google Brain)
• Niki Parmar (Google Research)
• Jakob Uszkoreit (Google Research)
• Llion Jones (Google Research)
• Aidan N. Gomez (University of Toronto)
• Łukasz Kaiser (Google Brain)
• Illia Polosukhin
Background : Natural language processing is the important
application of ML
• There are many tasks which are solved by NLP technique.
• Sentence classification
• Sentence to sentence
• Sentence to tag …
Background : former techniques are not good at parallelization
• RNN, LSTM, and GRN have been SOTA in sequence
modeling problems.
• RNN generates a hidden states, as a function of the previous
hidden states and the input.
• This sequential nature prevents parallelization within
training samples especially when sequence length is long.
• Attention mechanisms have become an integral part of
recent models, but such attention mechanisms are used with
a recurrent network.
Background – challenge of computational cost
• Reducing sequential computation is achieved by using CNN,
or computing hidden states in parallel.
• But in these methods, the number of operations to relate two
input and output positions grows in the distance between
distances. → It is difficult to learn dependencies between
distant positions.
MATOME
• Using attention mechanisms allow us to draw global
dependencies between input and output by a constant
number of operations.
• In this work, they propose the Transformer which doesn’t
use recurrent architecture or convolutional architecture, and
reaches a state-of-the-art in translation quality.
Problem Setting : Sequence to sequence
• Input :
• Sequence of symbol representations (𝑥1, 𝑥2, … , 𝑥 𝑛)
• “This is a pen.”
• Output
• Sequence of symbol representations (𝑦1, 𝑦2, … , 𝑦𝑛)
• 「これ は ペン です。」
Entire Model Architecture
• Left side : Encoder
• Right side : Decoder
• Consisting layers:
• Multi-Head Attention layer
• Position-wise Feed-Forward layer
• Positional Encoding
• (Residual Adding and Normalization layer)
Self attention in Encoder Self attention in Decoder
Encoder-Decoder attention
1. Attentions
What is Attention?
• Attention : (vector, matrix, matrix) →vector
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑞𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒
What is Attention?
• 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑞𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒
Attention Weight
Search keys which are
similar to a query, and
return the corresponding
values.
An attention function can be
described as a dictionary object.
What is Attention?
• Single query Attention : (vector, matrix, matrix) →vector
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑞𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒
• We can input multiple queries at one time :
(matrix, matrix, matrix) →matrix
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒
Scaled Dot-Product Attention
• There are two most commonly used attention
functions : additive attention and dot-product
attention.
• Additive attention : 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝜎(𝑊 𝑄, 𝐾 + 𝑏))
• Dot-product attention : 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾 𝑇
• Dot-product attention is much faster and
space efficient.
• They compute the attention as:
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥
𝑄𝐾 𝑇
𝑑 𝑘
𝑉
where 𝑑 𝑘 is the scaling factor preventing
softmax function pushed into regions where it
has extremely small gradients.
Multi-Head Attention
• Instead of calculating single dot-product attention,
they calculate multiple attentions. (for example, h = 8)
• They linearly project Q, K, and V h-times with
different projections to 𝑑 𝑘, 𝑑 𝑘 𝑎𝑛𝑑 𝑑 𝑣 dimensions.
(They use 𝑑 𝑘 = 𝑑 𝑣 =
𝑑 𝑚𝑜𝑑𝑒𝑙
ℎ
)
𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝑄, 𝐾, 𝑉 = 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂
𝑤ℎ𝑒𝑟𝑒 ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝑊𝑖
𝑄
, 𝐾𝑊𝑖
𝐾
, 𝑉𝑊𝑖
𝑉
)
Parameter weight matrices
Applications of Multi-Head Attention in model
• Transformer uses multi-head attention in three ways.
1. encoder-decoder attention : The queries(Q) come from
the previous decoder layer, and keys(K) and values(V)
come from the output of the encoder. (traditional
attention mechanisms)
2. Self-attention layers in the encoder : All of the keys(K),
values(V) and queries(Q) come from the same place, in
this case, the output of the previous layer in the
encoder.
3. Self-attention layers in the decoder : K,V,Q come from the
output of the previous layer in the decoder. We need to
prevent leftward information flow in the decoder to
preserve the auto-regressive property.
Data Flow in Attention (Multi-Head)
Input
Output
Cited from : https://jalammar.github.io/illustrated-transformer/
Self-Attention in Decoder
• In the decoder, the self-attention layer is only allowed to
attend to earlier positions in the output sequence. This is
done by masking future positions (setting them to -inf)
before the softmax step in the self-attention calculation.
• For example, if a sentence “I play soccer in the morning.” is
given in the decoder, and we want to apply self-attention for
“soccer” (let “soccer” be a query), we can only attend “I” and
“play” but cannot attend “in”, “the”, and “morning”.
Why Self-Attention?
1. Total computational complexity per layer
2. The amount of compotation that can be parallelized
3. The path length between long-range dependencies
4. (As side benefit, self-attention could yield more interpretable
models.)
2. Feed Forward Layer
Position-wise Feed-Forward Networks
• Feed-forward networks are applied to each position
separately.
𝐹𝐹𝑁 𝑥 = 𝑅𝑒𝐿𝑈 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2
x1
x2
x3
x4
x5
x6
x7
word1
Embedding dimension (512)
𝑅𝑒𝐿𝑈 𝑥1 𝑊1 + 𝑏1 𝑊2 + 𝑏2
𝑅𝑒𝐿𝑈 𝑥2 𝑊1 + 𝑏1 𝑊2 + 𝑏2
𝑅𝑒𝐿𝑈 𝑥4 𝑊1 + 𝑏1 𝑊2 + 𝑏2
𝑅𝑒𝐿𝑈 𝑥3 𝑊1 + 𝑏1 𝑊2 + 𝑏2
𝑅𝑒𝐿𝑈 𝑥7 𝑊1 + 𝑏1 𝑊2 + 𝑏2
𝑅𝑒𝐿𝑈 𝑥6 𝑊1 + 𝑏1 𝑊2 + 𝑏2
𝑅𝑒𝐿𝑈 𝑥5 𝑊1 + 𝑏1 𝑊2 + 𝑏2
dimension (512)
word2
word3
word4
word5
word7
word6
3. Positional Encoding
Positional Encoding
• Since their model doesn’t contain recurrence and convolution,
it is needed to inject some information about the position
of the tokens in the sequence.
• So they add “positional encoding” to input embedding.
• Each element of PE is as following :
𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin
𝑝𝑜𝑠
10000
2𝑖
𝑑 𝑚𝑜𝑑𝑒𝑙
, 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 + 1 = cos
𝑝𝑜𝑠
10000
2𝑖
𝑑 𝑚𝑜𝑑𝑒𝑙
• 𝑝𝑜𝑠 is the location of the word, 𝑖 is the index of dimension in word
embedding.
Positional Encoding
-1.0~+1.0
Wordindexinsequence
Dimension in embedding vector
Cited from :
https://github.com/soskek/attention_is_all_you_need/blob/ma
ster/Observe_Position_Encoding.ipynb
Positional Encoding
Cited from : https://jalammar.github.io/illustrated-transformer/
4. FC and Softmax layer
Final FC and softmax layer
Cited from : https://jalammar.github.io/illustrated-transformer/
Using Beam-search in selecting model prediction
• When selecting model output, we can take the word with the
highest probability and throw away the rest word
candidates. : greedy decoding
• Another way to select model output is beam-search.
0.1 0.2 0.1 0.6
I you he she
Beam-search
• beam-search
• Instead of only predicting the token with the best score,
we keep track of k hypotheses (for example k=5, we refer
to k as the beam size).
• At each new time step, for these k hypotheses, we have V
new possible tokens. It makes a total of kV new
hypotheses. Then, only keep top k hypotheses, … .
• The length of words to hold is also a parameter.
Experiment- sequence to sequence task
• Data
• WMT2014 English-German : 4.5 million sentence pairs
• WMT2014 English-French : 36 million sentences
• Hardware and Schedule
• 8 NVIDIA P100 GPUs
• Base model : 100,000 steps or 12 hours
• Big model : 300,000 steps (3.5 days)
• Optimizer : Adam
• Warm up, and then decrease learning rate.
Experiment – evaluation metrics
• BLEU (Bilingual Evaluation Understudy )
• Evaluation metrics on how machine translation(MT)
and ground truth(GT) translation are similar.
𝐵𝐿𝐸𝑈 = 𝐵𝑃𝐵𝐿𝐸𝑈 ∙ exp
𝑛=1
𝑁
1
𝑁
𝑙𝑜𝑔𝑝 𝑛
Usually, N=4.
• 𝐵𝑃𝐵𝐿𝐸𝑈 : penalty if multiplied when len(MT) < len(GT)
• 𝑝 𝑛 = 𝑖 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛−𝑔𝑟𝑎𝑚𝑠 𝑠𝑎𝑚𝑒 𝑖𝑛 𝑀𝑇 𝑖 𝑎𝑛𝑑 𝐺𝑇 𝑖
𝑖 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛−𝑔𝑟𝑎𝑚𝑠 𝑖𝑛 𝑀𝑇 𝑖
Experiment - Result
Experiment – Changing the parameters in Transformer
Head number
Base model
Big model
Experiment – Self-attention visualization
Completing the phrase ”making … more difficult”
Experiment – Self-attention visualization
Two different heads have
different attention weights.
Conclusion
• They presented the Transformer, the first sequence
transduction model based entirely on attention, replacing the
recurrent layers most commonly used in encoder-decoder
architectures with multi-headed self-attention.
• The Transformer can be trained significantly faster than
architectures based on recurrent or convolutional layers.
See detail at:
• https://jalammar.github.io/illustrated-transformer/
• http://deeplearning.hatenablog.com/entry/transformer

More Related Content

What's hot

Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning健程 杨
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkYan Xu
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Transformer xl
Transformer xlTransformer xl
Transformer xlSan Kim
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanismKhang Pham
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep LearningNatasha Latysheva
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsOVHcloud
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewPoo Kuan Hoong
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroBill Liu
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models Chia-Wen Cheng
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]Dongmin Choi
 

What's hot (20)

Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
Bert
BertBert
Bert
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Transformer xl
Transformer xlTransformer xl
Transformer xl
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
Rnn & Lstm
Rnn & LstmRnn & Lstm
Rnn & Lstm
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models
 
Transformers
TransformersTransformers
Transformers
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 

Similar to [Paper Reading] Attention is All You Need

Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...NILESH VERMA
 
Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networksananth
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Balázs Hidasi
 
Paper Study: Transformer dissection
Paper Study: Transformer dissectionPaper Study: Transformer dissection
Paper Study: Transformer dissectionChenYiHuang5
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptxthanhdowork
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Recursion and Sorting Algorithms
Recursion and Sorting AlgorithmsRecursion and Sorting Algorithms
Recursion and Sorting AlgorithmsAfaq Mansoor Khan
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlowBarbara Fusinska
 
Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018Apache MXNet
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!ChenYiHuang5
 
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Association for Computational Linguistics
 
Neural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxNeural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxssuser2624f71
 
recurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptxrecurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptxSagarTekwani4
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsBenjamin Le
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksSangwoo Mo
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 

Similar to [Paper Reading] Attention is All You Need (20)

Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
 
Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networks
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
Paper Study: Transformer dissection
Paper Study: Transformer dissectionPaper Study: Transformer dissection
Paper Study: Transformer dissection
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Recursion and Sorting Algorithms
Recursion and Sorting AlgorithmsRecursion and Sorting Algorithms
Recursion and Sorting Algorithms
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!
 
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
 
Neural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxNeural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptx
 
recurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptxrecurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptx
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 

More from Daiki Tanaka

[Paper Reading] Theoretical Analysis of Self-Training with Deep Networks on U...
[Paper Reading] Theoretical Analysis of Self-Training with Deep Networks on U...[Paper Reading] Theoretical Analysis of Self-Training with Deep Networks on U...
[Paper Reading] Theoretical Analysis of Self-Training with Deep Networks on U...Daiki Tanaka
 
カーネル法:正定値カーネルの理論
カーネル法:正定値カーネルの理論カーネル法:正定値カーネルの理論
カーネル法:正定値カーネルの理論Daiki Tanaka
 
[Paper Reading] Causal Bandits: Learning Good Interventions via Causal Inference
[Paper Reading] Causal Bandits: Learning Good Interventions via Causal Inference[Paper Reading] Causal Bandits: Learning Good Interventions via Causal Inference
[Paper Reading] Causal Bandits: Learning Good Interventions via Causal InferenceDaiki Tanaka
 
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...Daiki Tanaka
 
Selective inference
Selective inferenceSelective inference
Selective inferenceDaiki Tanaka
 
Anomaly Detection with VAEGAN and Attention [JSAI2019 report]
Anomaly Detection with VAEGAN and Attention [JSAI2019 report]Anomaly Detection with VAEGAN and Attention [JSAI2019 report]
Anomaly Detection with VAEGAN and Attention [JSAI2019 report]Daiki Tanaka
 
オンライン学習 : Online learning
オンライン学習 : Online learningオンライン学習 : Online learning
オンライン学習 : Online learningDaiki Tanaka
 
Local Outlier Detection with Interpretation
Local Outlier Detection with InterpretationLocal Outlier Detection with Interpretation
Local Outlier Detection with InterpretationDaiki Tanaka
 
Interpretability of machine learning
Interpretability of machine learningInterpretability of machine learning
Interpretability of machine learningDaiki Tanaka
 
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...Daiki Tanaka
 
The Limits of Popularity-Based Recommendations, and the Role of Social Ties
The Limits of Popularity-Based Recommendations, and the Role of Social TiesThe Limits of Popularity-Based Recommendations, and the Role of Social Ties
The Limits of Popularity-Based Recommendations, and the Role of Social TiesDaiki Tanaka
 
Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...
Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...
Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...Daiki Tanaka
 
Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data
Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series DataToeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data
Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series DataDaiki Tanaka
 

More from Daiki Tanaka (13)

[Paper Reading] Theoretical Analysis of Self-Training with Deep Networks on U...
[Paper Reading] Theoretical Analysis of Self-Training with Deep Networks on U...[Paper Reading] Theoretical Analysis of Self-Training with Deep Networks on U...
[Paper Reading] Theoretical Analysis of Self-Training with Deep Networks on U...
 
カーネル法:正定値カーネルの理論
カーネル法:正定値カーネルの理論カーネル法:正定値カーネルの理論
カーネル法:正定値カーネルの理論
 
[Paper Reading] Causal Bandits: Learning Good Interventions via Causal Inference
[Paper Reading] Causal Bandits: Learning Good Interventions via Causal Inference[Paper Reading] Causal Bandits: Learning Good Interventions via Causal Inference
[Paper Reading] Causal Bandits: Learning Good Interventions via Causal Inference
 
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
 
Selective inference
Selective inferenceSelective inference
Selective inference
 
Anomaly Detection with VAEGAN and Attention [JSAI2019 report]
Anomaly Detection with VAEGAN and Attention [JSAI2019 report]Anomaly Detection with VAEGAN and Attention [JSAI2019 report]
Anomaly Detection with VAEGAN and Attention [JSAI2019 report]
 
オンライン学習 : Online learning
オンライン学習 : Online learningオンライン学習 : Online learning
オンライン学習 : Online learning
 
Local Outlier Detection with Interpretation
Local Outlier Detection with InterpretationLocal Outlier Detection with Interpretation
Local Outlier Detection with Interpretation
 
Interpretability of machine learning
Interpretability of machine learningInterpretability of machine learning
Interpretability of machine learning
 
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
 
The Limits of Popularity-Based Recommendations, and the Role of Social Ties
The Limits of Popularity-Based Recommendations, and the Role of Social TiesThe Limits of Popularity-Based Recommendations, and the Role of Social Ties
The Limits of Popularity-Based Recommendations, and the Role of Social Ties
 
Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...
Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...
Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...
 
Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data
Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series DataToeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data
Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data
 

Recently uploaded

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

[Paper Reading] Attention is All You Need

  • 1. Attention Is All you need Reading Seminar Kyoto University, Kashima lab Daiki Tanaka
  • 2. Today’s Paper : ‘Attention Is All you need‘ • Conference : NIPS 2017 • Cited 966 times. • Authors : • Ashish Vaswani (Google Brain) • Noam Shazeer (Google Brain) • Niki Parmar (Google Research) • Jakob Uszkoreit (Google Research) • Llion Jones (Google Research) • Aidan N. Gomez (University of Toronto) • Łukasz Kaiser (Google Brain) • Illia Polosukhin
  • 3. Background : Natural language processing is the important application of ML • There are many tasks which are solved by NLP technique. • Sentence classification • Sentence to sentence • Sentence to tag …
  • 4. Background : former techniques are not good at parallelization • RNN, LSTM, and GRN have been SOTA in sequence modeling problems. • RNN generates a hidden states, as a function of the previous hidden states and the input. • This sequential nature prevents parallelization within training samples especially when sequence length is long. • Attention mechanisms have become an integral part of recent models, but such attention mechanisms are used with a recurrent network.
  • 5. Background – challenge of computational cost • Reducing sequential computation is achieved by using CNN, or computing hidden states in parallel. • But in these methods, the number of operations to relate two input and output positions grows in the distance between distances. → It is difficult to learn dependencies between distant positions.
  • 6. MATOME • Using attention mechanisms allow us to draw global dependencies between input and output by a constant number of operations. • In this work, they propose the Transformer which doesn’t use recurrent architecture or convolutional architecture, and reaches a state-of-the-art in translation quality.
  • 7. Problem Setting : Sequence to sequence • Input : • Sequence of symbol representations (𝑥1, 𝑥2, … , 𝑥 𝑛) • “This is a pen.” • Output • Sequence of symbol representations (𝑦1, 𝑦2, … , 𝑦𝑛) • 「これ は ペン です。」
  • 8. Entire Model Architecture • Left side : Encoder • Right side : Decoder • Consisting layers: • Multi-Head Attention layer • Position-wise Feed-Forward layer • Positional Encoding • (Residual Adding and Normalization layer)
  • 9. Self attention in Encoder Self attention in Decoder Encoder-Decoder attention 1. Attentions
  • 10. What is Attention? • Attention : (vector, matrix, matrix) →vector 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑞𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒
  • 11. What is Attention? • 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑞𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒 Attention Weight Search keys which are similar to a query, and return the corresponding values. An attention function can be described as a dictionary object.
  • 12. What is Attention? • Single query Attention : (vector, matrix, matrix) →vector 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑞𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒 • We can input multiple queries at one time : (matrix, matrix, matrix) →matrix 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒
  • 13. Scaled Dot-Product Attention • There are two most commonly used attention functions : additive attention and dot-product attention. • Additive attention : 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝜎(𝑊 𝑄, 𝐾 + 𝑏)) • Dot-product attention : 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾 𝑇 • Dot-product attention is much faster and space efficient. • They compute the attention as: 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾 𝑇 𝑑 𝑘 𝑉 where 𝑑 𝑘 is the scaling factor preventing softmax function pushed into regions where it has extremely small gradients.
  • 14. Multi-Head Attention • Instead of calculating single dot-product attention, they calculate multiple attentions. (for example, h = 8) • They linearly project Q, K, and V h-times with different projections to 𝑑 𝑘, 𝑑 𝑘 𝑎𝑛𝑑 𝑑 𝑣 dimensions. (They use 𝑑 𝑘 = 𝑑 𝑣 = 𝑑 𝑚𝑜𝑑𝑒𝑙 ℎ ) 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝑄, 𝐾, 𝑉 = 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂 𝑤ℎ𝑒𝑟𝑒 ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝑊𝑖 𝑄 , 𝐾𝑊𝑖 𝐾 , 𝑉𝑊𝑖 𝑉 ) Parameter weight matrices
  • 15. Applications of Multi-Head Attention in model • Transformer uses multi-head attention in three ways. 1. encoder-decoder attention : The queries(Q) come from the previous decoder layer, and keys(K) and values(V) come from the output of the encoder. (traditional attention mechanisms) 2. Self-attention layers in the encoder : All of the keys(K), values(V) and queries(Q) come from the same place, in this case, the output of the previous layer in the encoder. 3. Self-attention layers in the decoder : K,V,Q come from the output of the previous layer in the decoder. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property.
  • 16. Data Flow in Attention (Multi-Head) Input Output Cited from : https://jalammar.github.io/illustrated-transformer/
  • 17. Self-Attention in Decoder • In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation. • For example, if a sentence “I play soccer in the morning.” is given in the decoder, and we want to apply self-attention for “soccer” (let “soccer” be a query), we can only attend “I” and “play” but cannot attend “in”, “the”, and “morning”.
  • 18. Why Self-Attention? 1. Total computational complexity per layer 2. The amount of compotation that can be parallelized 3. The path length between long-range dependencies 4. (As side benefit, self-attention could yield more interpretable models.)
  • 20. Position-wise Feed-Forward Networks • Feed-forward networks are applied to each position separately. 𝐹𝐹𝑁 𝑥 = 𝑅𝑒𝐿𝑈 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2 x1 x2 x3 x4 x5 x6 x7 word1 Embedding dimension (512) 𝑅𝑒𝐿𝑈 𝑥1 𝑊1 + 𝑏1 𝑊2 + 𝑏2 𝑅𝑒𝐿𝑈 𝑥2 𝑊1 + 𝑏1 𝑊2 + 𝑏2 𝑅𝑒𝐿𝑈 𝑥4 𝑊1 + 𝑏1 𝑊2 + 𝑏2 𝑅𝑒𝐿𝑈 𝑥3 𝑊1 + 𝑏1 𝑊2 + 𝑏2 𝑅𝑒𝐿𝑈 𝑥7 𝑊1 + 𝑏1 𝑊2 + 𝑏2 𝑅𝑒𝐿𝑈 𝑥6 𝑊1 + 𝑏1 𝑊2 + 𝑏2 𝑅𝑒𝐿𝑈 𝑥5 𝑊1 + 𝑏1 𝑊2 + 𝑏2 dimension (512) word2 word3 word4 word5 word7 word6
  • 22. Positional Encoding • Since their model doesn’t contain recurrence and convolution, it is needed to inject some information about the position of the tokens in the sequence. • So they add “positional encoding” to input embedding. • Each element of PE is as following : 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin 𝑝𝑜𝑠 10000 2𝑖 𝑑 𝑚𝑜𝑑𝑒𝑙 , 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 + 1 = cos 𝑝𝑜𝑠 10000 2𝑖 𝑑 𝑚𝑜𝑑𝑒𝑙 • 𝑝𝑜𝑠 is the location of the word, 𝑖 is the index of dimension in word embedding.
  • 23. Positional Encoding -1.0~+1.0 Wordindexinsequence Dimension in embedding vector Cited from : https://github.com/soskek/attention_is_all_you_need/blob/ma ster/Observe_Position_Encoding.ipynb
  • 24. Positional Encoding Cited from : https://jalammar.github.io/illustrated-transformer/
  • 25. 4. FC and Softmax layer
  • 26. Final FC and softmax layer Cited from : https://jalammar.github.io/illustrated-transformer/
  • 27. Using Beam-search in selecting model prediction • When selecting model output, we can take the word with the highest probability and throw away the rest word candidates. : greedy decoding • Another way to select model output is beam-search. 0.1 0.2 0.1 0.6 I you he she
  • 28. Beam-search • beam-search • Instead of only predicting the token with the best score, we keep track of k hypotheses (for example k=5, we refer to k as the beam size). • At each new time step, for these k hypotheses, we have V new possible tokens. It makes a total of kV new hypotheses. Then, only keep top k hypotheses, … . • The length of words to hold is also a parameter.
  • 29. Experiment- sequence to sequence task • Data • WMT2014 English-German : 4.5 million sentence pairs • WMT2014 English-French : 36 million sentences • Hardware and Schedule • 8 NVIDIA P100 GPUs • Base model : 100,000 steps or 12 hours • Big model : 300,000 steps (3.5 days) • Optimizer : Adam • Warm up, and then decrease learning rate.
  • 30. Experiment – evaluation metrics • BLEU (Bilingual Evaluation Understudy ) • Evaluation metrics on how machine translation(MT) and ground truth(GT) translation are similar. 𝐵𝐿𝐸𝑈 = 𝐵𝑃𝐵𝐿𝐸𝑈 ∙ exp 𝑛=1 𝑁 1 𝑁 𝑙𝑜𝑔𝑝 𝑛 Usually, N=4. • 𝐵𝑃𝐵𝐿𝐸𝑈 : penalty if multiplied when len(MT) < len(GT) • 𝑝 𝑛 = 𝑖 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛−𝑔𝑟𝑎𝑚𝑠 𝑠𝑎𝑚𝑒 𝑖𝑛 𝑀𝑇 𝑖 𝑎𝑛𝑑 𝐺𝑇 𝑖 𝑖 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛−𝑔𝑟𝑎𝑚𝑠 𝑖𝑛 𝑀𝑇 𝑖
  • 32. Experiment – Changing the parameters in Transformer Head number Base model Big model
  • 33. Experiment – Self-attention visualization Completing the phrase ”making … more difficult”
  • 34. Experiment – Self-attention visualization Two different heads have different attention weights.
  • 35. Conclusion • They presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. • The Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers.
  • 36. See detail at: • https://jalammar.github.io/illustrated-transformer/ • http://deeplearning.hatenablog.com/entry/transformer

Editor's Notes

  1. memory を KeyKey と ValueValue に分離することで keykey と valuevalue 間の非自明な変換によって高い表現力が得られるという
  2. memory を KeyKey と ValueValue に分離することで keykey と valuevalue 間の非自明な変換によって高い表現力が得られるという
  3.  self-attention layers are faster than recurrent layers when the sequence length n  is smaller than the representation dimensionality d.
  4. Different colors show different heads..
  5. Different colors show different heads..
  6. Different colors show different heads..