SlideShare a Scribd company logo
1 of 44
Download to read offline
Paper Review
Attention Is All You Need
(Ashish et al. 2017) [Arxiv pre-print link]
Strong reference: http://nlp.seas.harvard.edu/2018/04/03/attention.html
Santiago Pascual de la Puente
June 07, 2018
TALP UPC, Barcelona
Table of contents
1. Introduction
2. The Transformer
A Myriad of Attentions
Point-Wise Feed Forward Networks
The Transformer Block
3. Interfacing Token Sequences
Embeddings
Positional Encoding
4. Results
5. Conclusions
1/37
Introduction
Introduction
Recurrent neural networks (RNNs) and their cell variants are firmly
established as state of the art in sequence modeling and transduction
(e.g. machine translation).
In transduction we map a sequence X = {x1, · · · , xT } to another one
Y = {y1, · · · , yM } where T and M can be different, xt ∈ Rde
and
ym ∈ Rdd
.
https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb
2/37
Introduction
1. The encoder RNN will encode source symbols X = {x1, · · · , xT }
into useful abstractions to mix up contextual contents →
H = {h1, · · · , hT }, where ht = tanh(Wxt + Uht−1 + b).
2. Last encoder state hT is typically taken as the summary of the
input, and it is injected into the decoder initial state hd
0 = hT .
3. The decoder RNN will generate one-by-one the target sequence
(autoregressive) by feeding back its previous prediction ym−1 as
input, also conditioned in h0 encoder summarization.
https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb
3/37
Introduction
Encoding a sentence into one vector would be super amazing, but it is
unfeasible. In the real world we need a mechanism that gives the decoder
hints on where to look from encoder to weight the source vectors, not
just get the last → ATTENTION MECHANISM.
• cm =
T−1
t=0 αm
t · ht
• Each cm is a row (and
additional input to dec), and
each αm
t is an orange square.
https://github.com/spro/practical-pytorch/blob/master/seq2seq-
translation/seq2seq-translation.ipynb 4/37
Introduction
• RNNs factor computation along symbol time positions, generating
ht out of ht−1 → cannot parallelize in training:
ht = tanh(Wxt + Uht−1 + b)
• Attention is used with SOTA transduction RNNs → model
dependencies without regard to their distance in the input or output
sequences.
5/37
Introduction
• Let’s get rid of recurrence and rely entirely on attentions to draw
global dependencies b/w input and output.
• The Transformer is born, significantly boosting parallelization and
reaching new SOTA in translation.
6/37
The Transformer
The Transformer
We will have a new encoder-decoder structure, without any recurrence:
only fully connected layers (independent at every time-step) and
self-attention to merge global info in the sequences.
• Encoder will map X = {x1, · · · , xT } to a sequence of continuous
representations Z = {z1, · · · , zT }.
• Given Z the decoder will generate Y = {y1, · · · , yN }
• Still auto-regressive! But no recurrent connections at all.
7/37
The Transformer
8/37
Attention Generic Formulation
• Attention function maps a query and a set of key-value pairs to an
output: query, keys, values, and output are all vectors:
o = f (q, k, v)
• Output is computed as a weighted sum of the values.
• Weight assigned to each value is computed by a compatibility
function of the query with the corresponding key.
o =
T−1
t=0
g(qi , ki
t ) · vt
9/37
Scaled Dot-Product Attention
• Input: queries and keys of dimension dk and values of dimension dv .
• Compute the dot products of the query with all keys, divide each by
2
√
dk and apply Softmax → obtain weights on the values.
10/37
Scaled Dot-Product Attention
• Input: queries and keys of dimension dk and values of dimension dv .
• Compute the dot products of the query with all keys, divide each by
2
√
dk and apply Softmax → obtain weights on the values.
• FAST TRICK: compute the att on a set of queries simultaneously,
packing matrices Q, K, V .
Attention(Q, K, V ) = Softmax(
QKT
2
√
dk
)V
11/37
The Fault In Our Scale
Wait... why do we scale the output from the matching function between
query and key by 2
√
dk ?
12/37
The Fault In Our Scale
Two most commonly used attention methods (to merge k and q):
• Additive: MLP with one hidden layer where vectors are
concatenated at input of MLP.
• Multiplicative: dot-product seen here → MUCH faster and more
space-efficient.
For small values of dk both behave similarly, but additive outperforms
dot-product for larger dk .
Suspicion: for large values dk , dot-products grow large in magnitude,
pushing Softmax into regions with extremely small gradients.
Assume components of q and k are independent random variables with
µ = 0 and σ = 1 ⇒ is q · k =
dk
i=1 qi · ki with µ = 0 and σ = 2
√
dk .
We counteract this effect by scaling 1
2
√
dk
.
13/37
Multi-Head Attention
Multi-head attention allows the model to jointly attend to information
from different representation subspaces at different positions. With a
single attention head, averaging inhibits this.
14/37
Multi-Head Attention
MultiHead(Q, K, V ) = Concat(head1, · · · .headh)W0
headi = Attention(QW Q
i , KW K
i , VW V
i )
W Q
i ∈ Rdmodel ×dk
, W K
i ∈ Rdmodel ×dk
, W V
i ∈ Rdmodel ×dv
, W 0
∈ Rhdv ×dmodel
In this work h = 8 and dk = dv = dmodel
h = 64.
15/37
Multi-Head Attention
Transformer uses multi-head attention in three different ways:
1. Encoder-decoder attention layers: queries come from previous
decoder layer, and keys and values come from output of the encoder.
Every position in the decoder attends over all positions in the input
sequence. (Same type of attention as classical seq2seq).
16/37
Multi-Head Attention
Transformer uses multi-head attention in three different ways:
1. Encoder-decoder attention layers: queries come from previous
decoder layer, and keys and values come from output of the encoder.
Every position in the decoder attends over all positions in the input
sequence. (Same type of attention as classical seq2seq).
2. Encoder contains self-attention layers: all keys, values and queries
come from same place, the previous encoder layer output. Thus
each position in the encoder can attend to all positions in the
encoder’s previous layer.
17/37
Multi-Head Attention
Transformer uses multi-head attention in three different ways:
1. Encoder-decoder attention layers: queries come from previous
decoder layer, and keys and values come from output of the encoder.
Every position in the decoder attends over all positions in the input
sequence. (Same type of attention as classical seq2seq).
2. Encoder contains self-attention layers: all keys, values and queries
come from same place, the previous encoder layer output. Thus
each position in the encoder can attend to all positions in the
encoder’s previous layer.
3. The decoder has the same self-attention mechanism. BUT!...
prevent leftward information flow (it must be autoregressive).
18/37
Decoder Attention Mask
Prevent leftward information flow inside of scaled dot-product attention,
by masking out (setting to − inf) all values in the input of the Softmax
which correspond to ”illegal” connections.
19/37
Point-Wise Feed Forward Networks
Simply an MLP to each time position with the same parameters:
FFN(x) = max(0, xW1 + b1)W2 + b2
These can be seen as two Convolutions1D with kwidth = 1. The
dimensionality of input and output is dmodel = 512 and inner layer has
dimensionality dff = 2048.
20/37
Point-Wise Feed Forward Networks
21/37
The Transformer Block
If we mix a spoon of Multi-Head Attention, another of Point-Wise FFN,
a pinch of res-connections and a spoon of Add&LayerNorm ops we obtain
the Transformer block:
22/37
The Transformer Block
We can see how N stacks of these blocks form the whole Transformer
END-TO-END network. Note the extra enc-dec-attention in the
decoder blocks.
23/37
Interfacing Token Sequences
Embeddings
As in seq2seq models, we use learned embeddings to convert input tokens
and output tokens to dense vectors of dimension dmodel . There is also (of
course) an output linear transformation to go from dmodel to number of
classes and Softmax.
In the Transformer, all these 3 matrices are tied (same parameters apply),
and in the embeddings layers weights are multiplied by 2
√
dmodel .
24/37
Embeddings
In the Transformer, all these 3 matrices are tied (same parameters apply),
and in the embeddings layers weights are multiplied by 2
√
dmodel .
25/37
Embeddings
26/37
Positional Encoding
• Are we processing sequences? YES.
• Are we taking care of this fact?
27/37
Positional Encoding
• Are we processing sequences? YES.
• Are we taking care of this fact? NO.
So let’s work it out.
28/37
Positional Encoding
• In order for the model to make use of the order of the sequence, we
must inject some information about the relative or absolute position
of the tokens in the sequence.
• Add positional encodings joint with the embeddings, summing them
up such that the positional info is merged in the input.
PE(pos, 2i ) = sin(
pos
10000
2i
dmodel
)
PE(pos, 2i+1) = cos(
pos
10000
2i
dmodel
)
Where i is the dimension and pos the position (time-step). Each
dimension corresponds to a sinusoid, with wavelengths forming a
geometric progression. The frequency and offset of the wave is different
for each dimension.
29/37
Positional Encoding
At every time-step we will have a combination of sinusoids telling us
where are relative to the beginning (with combination of phases).
Advantage of these codes: generalization to any length in test (cyclic
nature of sinusoids rather than growing indefinitely).
30/37
Results
Results
31/37
Results
• On the WMT 2014 English-to-German translation task, the big
transformer model (Transformer (big)) outperforms the best
previously reported models (including ensembles) by more than 2.0
BLEU! (new SOTA of 28.4).
• Training took 3.5 days on 8 P100 GPUs. Even their base model
surpasses all previously published models and ensembles, at a
fraction of the training cost of any of the competitive models.
• On the WMT 2014 English-to-French translation task, the big
model achieves a BLEU score of 41.0, outperforming all of the
previously published single models, at less than 1
4 the training cost.
32/37
Results
Enc Layer2
33/37
Results
Enc Layer6
34/37
Results
Dec Layer2
35/37
Results
Dec-SRC Layer2
36/37
Conclusions
Conclusions
• The Transformer is the first sequence transduction model based
entirely on attention ( replacing the recurrent layers most commonly
used in encoder-decoder architectures with multi-headed
self-attention).
• For translation tasks, the Transformer can be trained significantly
faster than architectures based on recurrent or convolutional layer.
• New SOTA on WMT 2014 English-to-German and WMT 2014
English-to-French translation tasks.
• Code used to train and evaluate original models is available at
https://github.com/tensorflow/tensor2tensor. .
37/37
Thanks!
@santty128
37/37

More Related Content

What's hot

Attention is all you need
Attention is all you needAttention is all you need
Attention is all you needHoon Heo
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Attention scores and mechanisms
Attention scores and mechanismsAttention scores and mechanisms
Attention scores and mechanismsJaeHo Jang
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanismSwatiNarkhede1
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptxTuCaoMinh2
 
PR-409: Denoising Diffusion Probabilistic Models
PR-409: Denoising Diffusion Probabilistic ModelsPR-409: Denoising Diffusion Probabilistic Models
PR-409: Denoising Diffusion Probabilistic ModelsHyeongmin Lee
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflowKeon Kim
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Fwdays
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...Edge AI and Vision Alliance
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
 
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)Tomoyuki Hioki
 

What's hot (20)

Attention is all you need
Attention is all you needAttention is all you need
Attention is all you need
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Attention scores and mechanisms
Attention scores and mechanismsAttention scores and mechanisms
Attention scores and mechanisms
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its Applications
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
 
PR-409: Denoising Diffusion Probabilistic Models
PR-409: Denoising Diffusion Probabilistic ModelsPR-409: Denoising Diffusion Probabilistic Models
PR-409: Denoising Diffusion Probabilistic Models
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflow
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
 

Similar to Attention is all you need (UPC Reading Group 2018, by Santi Pascual)

240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptxthanhdowork
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model佳蓉 倪
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...Jisang Yoon
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipeliningjagrat123
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Thien Q. Tran
 
01 - DAA - PPT.pptx
01 - DAA - PPT.pptx01 - DAA - PPT.pptx
01 - DAA - PPT.pptxKokilaK25
 
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfCD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfRajJain516913
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptxhtn540
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphstuxette
 

Similar to Attention is all you need (UPC Reading Group 2018, by Santi Pascual) (20)

240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Chapter 4 pc
Chapter 4 pcChapter 4 pc
Chapter 4 pc
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
ECE 565 FInal Project
ECE 565 FInal ProjectECE 565 FInal Project
ECE 565 FInal Project
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipelining
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
 
SPAA11
SPAA11SPAA11
SPAA11
 
Dsp manual print
Dsp manual printDsp manual print
Dsp manual print
 
01 - DAA - PPT.pptx
01 - DAA - PPT.pptx01 - DAA - PPT.pptx
01 - DAA - PPT.pptx
 
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfCD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptx
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphs
 
Fx3111501156
Fx3111501156Fx3111501156
Fx3111501156
 

More from Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 

Recently uploaded

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Attention is all you need (UPC Reading Group 2018, by Santi Pascual)

  • 1. Paper Review Attention Is All You Need (Ashish et al. 2017) [Arxiv pre-print link] Strong reference: http://nlp.seas.harvard.edu/2018/04/03/attention.html Santiago Pascual de la Puente June 07, 2018 TALP UPC, Barcelona
  • 2. Table of contents 1. Introduction 2. The Transformer A Myriad of Attentions Point-Wise Feed Forward Networks The Transformer Block 3. Interfacing Token Sequences Embeddings Positional Encoding 4. Results 5. Conclusions 1/37
  • 4. Introduction Recurrent neural networks (RNNs) and their cell variants are firmly established as state of the art in sequence modeling and transduction (e.g. machine translation). In transduction we map a sequence X = {x1, · · · , xT } to another one Y = {y1, · · · , yM } where T and M can be different, xt ∈ Rde and ym ∈ Rdd . https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb 2/37
  • 5. Introduction 1. The encoder RNN will encode source symbols X = {x1, · · · , xT } into useful abstractions to mix up contextual contents → H = {h1, · · · , hT }, where ht = tanh(Wxt + Uht−1 + b). 2. Last encoder state hT is typically taken as the summary of the input, and it is injected into the decoder initial state hd 0 = hT . 3. The decoder RNN will generate one-by-one the target sequence (autoregressive) by feeding back its previous prediction ym−1 as input, also conditioned in h0 encoder summarization. https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb 3/37
  • 6. Introduction Encoding a sentence into one vector would be super amazing, but it is unfeasible. In the real world we need a mechanism that gives the decoder hints on where to look from encoder to weight the source vectors, not just get the last → ATTENTION MECHANISM. • cm = T−1 t=0 αm t · ht • Each cm is a row (and additional input to dec), and each αm t is an orange square. https://github.com/spro/practical-pytorch/blob/master/seq2seq- translation/seq2seq-translation.ipynb 4/37
  • 7. Introduction • RNNs factor computation along symbol time positions, generating ht out of ht−1 → cannot parallelize in training: ht = tanh(Wxt + Uht−1 + b) • Attention is used with SOTA transduction RNNs → model dependencies without regard to their distance in the input or output sequences. 5/37
  • 8. Introduction • Let’s get rid of recurrence and rely entirely on attentions to draw global dependencies b/w input and output. • The Transformer is born, significantly boosting parallelization and reaching new SOTA in translation. 6/37
  • 10. The Transformer We will have a new encoder-decoder structure, without any recurrence: only fully connected layers (independent at every time-step) and self-attention to merge global info in the sequences. • Encoder will map X = {x1, · · · , xT } to a sequence of continuous representations Z = {z1, · · · , zT }. • Given Z the decoder will generate Y = {y1, · · · , yN } • Still auto-regressive! But no recurrent connections at all. 7/37
  • 12. Attention Generic Formulation • Attention function maps a query and a set of key-value pairs to an output: query, keys, values, and output are all vectors: o = f (q, k, v) • Output is computed as a weighted sum of the values. • Weight assigned to each value is computed by a compatibility function of the query with the corresponding key. o = T−1 t=0 g(qi , ki t ) · vt 9/37
  • 13. Scaled Dot-Product Attention • Input: queries and keys of dimension dk and values of dimension dv . • Compute the dot products of the query with all keys, divide each by 2 √ dk and apply Softmax → obtain weights on the values. 10/37
  • 14. Scaled Dot-Product Attention • Input: queries and keys of dimension dk and values of dimension dv . • Compute the dot products of the query with all keys, divide each by 2 √ dk and apply Softmax → obtain weights on the values. • FAST TRICK: compute the att on a set of queries simultaneously, packing matrices Q, K, V . Attention(Q, K, V ) = Softmax( QKT 2 √ dk )V 11/37
  • 15. The Fault In Our Scale Wait... why do we scale the output from the matching function between query and key by 2 √ dk ? 12/37
  • 16. The Fault In Our Scale Two most commonly used attention methods (to merge k and q): • Additive: MLP with one hidden layer where vectors are concatenated at input of MLP. • Multiplicative: dot-product seen here → MUCH faster and more space-efficient. For small values of dk both behave similarly, but additive outperforms dot-product for larger dk . Suspicion: for large values dk , dot-products grow large in magnitude, pushing Softmax into regions with extremely small gradients. Assume components of q and k are independent random variables with µ = 0 and σ = 1 ⇒ is q · k = dk i=1 qi · ki with µ = 0 and σ = 2 √ dk . We counteract this effect by scaling 1 2 √ dk . 13/37
  • 17. Multi-Head Attention Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. 14/37
  • 18. Multi-Head Attention MultiHead(Q, K, V ) = Concat(head1, · · · .headh)W0 headi = Attention(QW Q i , KW K i , VW V i ) W Q i ∈ Rdmodel ×dk , W K i ∈ Rdmodel ×dk , W V i ∈ Rdmodel ×dv , W 0 ∈ Rhdv ×dmodel In this work h = 8 and dk = dv = dmodel h = 64. 15/37
  • 19. Multi-Head Attention Transformer uses multi-head attention in three different ways: 1. Encoder-decoder attention layers: queries come from previous decoder layer, and keys and values come from output of the encoder. Every position in the decoder attends over all positions in the input sequence. (Same type of attention as classical seq2seq). 16/37
  • 20. Multi-Head Attention Transformer uses multi-head attention in three different ways: 1. Encoder-decoder attention layers: queries come from previous decoder layer, and keys and values come from output of the encoder. Every position in the decoder attends over all positions in the input sequence. (Same type of attention as classical seq2seq). 2. Encoder contains self-attention layers: all keys, values and queries come from same place, the previous encoder layer output. Thus each position in the encoder can attend to all positions in the encoder’s previous layer. 17/37
  • 21. Multi-Head Attention Transformer uses multi-head attention in three different ways: 1. Encoder-decoder attention layers: queries come from previous decoder layer, and keys and values come from output of the encoder. Every position in the decoder attends over all positions in the input sequence. (Same type of attention as classical seq2seq). 2. Encoder contains self-attention layers: all keys, values and queries come from same place, the previous encoder layer output. Thus each position in the encoder can attend to all positions in the encoder’s previous layer. 3. The decoder has the same self-attention mechanism. BUT!... prevent leftward information flow (it must be autoregressive). 18/37
  • 22. Decoder Attention Mask Prevent leftward information flow inside of scaled dot-product attention, by masking out (setting to − inf) all values in the input of the Softmax which correspond to ”illegal” connections. 19/37
  • 23. Point-Wise Feed Forward Networks Simply an MLP to each time position with the same parameters: FFN(x) = max(0, xW1 + b1)W2 + b2 These can be seen as two Convolutions1D with kwidth = 1. The dimensionality of input and output is dmodel = 512 and inner layer has dimensionality dff = 2048. 20/37
  • 24. Point-Wise Feed Forward Networks 21/37
  • 25. The Transformer Block If we mix a spoon of Multi-Head Attention, another of Point-Wise FFN, a pinch of res-connections and a spoon of Add&LayerNorm ops we obtain the Transformer block: 22/37
  • 26. The Transformer Block We can see how N stacks of these blocks form the whole Transformer END-TO-END network. Note the extra enc-dec-attention in the decoder blocks. 23/37
  • 28. Embeddings As in seq2seq models, we use learned embeddings to convert input tokens and output tokens to dense vectors of dimension dmodel . There is also (of course) an output linear transformation to go from dmodel to number of classes and Softmax. In the Transformer, all these 3 matrices are tied (same parameters apply), and in the embeddings layers weights are multiplied by 2 √ dmodel . 24/37
  • 29. Embeddings In the Transformer, all these 3 matrices are tied (same parameters apply), and in the embeddings layers weights are multiplied by 2 √ dmodel . 25/37
  • 31. Positional Encoding • Are we processing sequences? YES. • Are we taking care of this fact? 27/37
  • 32. Positional Encoding • Are we processing sequences? YES. • Are we taking care of this fact? NO. So let’s work it out. 28/37
  • 33. Positional Encoding • In order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. • Add positional encodings joint with the embeddings, summing them up such that the positional info is merged in the input. PE(pos, 2i ) = sin( pos 10000 2i dmodel ) PE(pos, 2i+1) = cos( pos 10000 2i dmodel ) Where i is the dimension and pos the position (time-step). Each dimension corresponds to a sinusoid, with wavelengths forming a geometric progression. The frequency and offset of the wave is different for each dimension. 29/37
  • 34. Positional Encoding At every time-step we will have a combination of sinusoids telling us where are relative to the beginning (with combination of phases). Advantage of these codes: generalization to any length in test (cyclic nature of sinusoids rather than growing indefinitely). 30/37
  • 37. Results • On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU! (new SOTA of 28.4). • Training took 3.5 days on 8 P100 GPUs. Even their base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models. • On the WMT 2014 English-to-French translation task, the big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1 4 the training cost. 32/37
  • 43. Conclusions • The Transformer is the first sequence transduction model based entirely on attention ( replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention). • For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layer. • New SOTA on WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks. • Code used to train and evaluate original models is available at https://github.com/tensorflow/tensor2tensor. . 37/37