240318_JW_labseminar[Attention Is All You Need].pptx
1. Jin-Woo Jeong
Network Science Lab
Dept. of Mathematics
The Catholic University of Korea
E-mail: zeus0208b@gmail.com
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Εukasz Kaiser, Illia Polosukhin
2. 1
ο Introduction
ο Model Architecture
β’ Encoder and Decoder Stacks
β’ Attention
β’ Scaled Dot-Product Attention
β’ Multi-Head Attention
β’ Applications of Attention in our Model
β’ Position-wise Feed-Forward Networks
β’ Embeddings and Softmax
β’ Positional Encoding
οTraining
β’ Training Data
β’ Optimizer
β’ Regularization
οResults
οConclusion
οQ/A
3. 2
Introduction
Introduction
ο Until the emergence of this paper, the RNN-based Encoder-Decoder architecture had established itself as
the state-of-the-art approach in sequence modeling and transformation tasks such as language modeling
and machine translation. However, RNN-based models suffer from inherent sequential nature, making
parallelization impossible during training and severely limiting batch processing across examples due to
memory constraints, particularly as the length of training data increases.
ο The attention mechanism has become an essential component of powerful sequence modeling and
transformation models across various tasks, allowing the modeling of dependencies irrespective of the
distance in input or output sequences. However, so far, in most cases, the attention mechanism has been
used in conjunction with RNNs.
ο This paper presents a model called 'Transformer' which avoids recurrence altogether and instead utilizes
only the attention mechanism to capture dependencies between input and output. The Transformer is
explicitly designed to enable greater parallelization and has the potential to become a new state-of-the-art
technology in translation quality.
4. 3
Model Architecture
Model Architecture
ο In the Transformer, like other competitive deep learning
transformation models, it also employs an encoder-
decoder architecture. In the Transformer, the encoder
maps an input sequence of symbol representations π₯ =
(π₯1, β¦ , π₯π) to a sequence of continuous representations
z = ( )
π§1, β¦ , π§π . Given π§, the decoder generates an
output sequence of symbols y = ( )
π¦1, β¦ , π¦π one
element at a time. At each step, the model utilizes
autoregression, using previously generated symbols as
additional inputs for the next generation.
5. 4
Model Architecture
Encoder and Decoder Stacks
ο Encoder :
ο The encoder consists of individual N=6 layers. Each layer has two sub-layers: the first one is the attention
mechanism, and the second one is a fully connected feedforward network. Additionally, after passing
through each layer, a residual connection is added, followed by layer normalization.
ο Decoder :
ο The decoder also consists of individual N=6 layers. In contrast to the encoder, the decoder has an
additional layer that performs multi-head attention over the outputs of the encoder stack. Similarly to the
encoder, in each layer of the decoder, after passing through, a residual connection is added, followed by
layer normalization. During self-attention in the decoder, masking is applied to prevent attending to
subsequent words when predicting the next word. This is done to avoid leakage of information from future
tokens during training.
6. 5
Attention
Attention
ο Attention function can be described as mapping a set of queries and key-value pairs to an output, where
queries, keys, and values are all vectors. The output is calculated as a weighted sum of the values
Scaled Dot-Product Attention
ο The dimension of queries and keys is ππ, and the dimension of values is ππ£. The attention function involves the
following steps: first, taking the dot product of queries and keys, then scaling by ππ. Next, passing through a
softmax function to compute the weights based on the similarity between each key and the query. Finally,
multiplying the values by these weights to obtain the weighted sum, which constitutes the output. Each
π, πΎ, π matrix contains the query, key, and value information for each word in a single row. This process, known
as scaled attention, computes attention scores through matrix operations. Scaling by ππ is to prevent the
softmax function from outputting extremely small gradients due to large inputs, which helps mitigate the
vanishing gradient problem.
π΄π‘π‘πππ‘πππ π, πΎ, π = π πππ‘πππ₯
ππΎπ
ππ
π
7. 6
Attention
Multi-Head Attention
ο Using multiple heads (here, 8) where separate sets of queries, keys, and values are learned and concatenated
has been found to improve performance compared to using a single attention head.
8. 7
Attention
Applications of Attention in Model
ο In this paper, multi-head attention is employed in three different ways:
1. In the "encoder-decoder attention" layer, queries are obtained from the previous decoder step, while keys and
values are obtained from the last output of the encoder.
2. The encoder includes self-attention layers, where queries, keys, and values are generated from the same
vector.
3. The decoder also has self-attention layers, but with masking applied. During training, teacher forcing is used,
while during testing, the output from the previous step is used as input. If the correct target sequence is directly
used in multi-head attention, it could lead to referencing future words for prediction, which is contradictory.
Hence, masking is applied by setting the next predicted word to an extremely small negative value just before
passing through the softmax function.
2. Self-Attention
1. encoder-decoder attention 3. Masking
9. 8
Position-wise Feed-Forward Networks
Position-wise Feed-Forward Networks
ο Both the encoder and decoder include Feed-Forward Network layers of the same size. Each layer has different
parameters. The architecture consists of a linear transformation followed by a ReLU activation function
between the two linear transformations.
πΉπΉπ π₯ = max 0, π₯π1 + π1 π1 + π2
10. 9
Positional Encoding
Positional Encoding
ο Attention mechanism alone cannot incorporate positional information of each token because dot products do
not inherently reflect positional information. Therefore, in this paper, positional encoding is applied to provide
positional information as follows.
ππΈ(πππ ,2π) = sin(πππ /100002π/ππππππ)
ππΈ(πππ ,2π+1) = coπ (πππ /100002π/ππππππ)
11. 10
Training
Training
ο Dataset: WMT 2014 English-German dataset / WMT 2014 English-French dataset
ο Optimizer: Adam with π½1 = 0.9, π½2 = 0.98 πππ π = 10β9
. They introduced a new Noam scheduler instead of
using a learning rate.
lπππ‘π = ππππππ
β0.5
β min(π π‘ππππ’π
β0.5, π π‘ππ_ππ’π β π€ππππ’π_π π‘πππ β1.5)
ο In this paper they used π€ππππ’ππ π‘πππ = 4000
ο Warmup
ο Decay
12. 11
Results
Results
ο The table above demonstrates that the Transformer achieves better
performance with fewer parameters compared to other models.
20. 19
Conclusion
Conclusion
ο This study introduces the Transformer, the first sequence-to-sequence model that replaces the commonly
used RNN layer with multi-head attention in an encoder-decoder architecture.
ο In translation tasks, the Transformer can learn much faster than architectures based on recurrent or
convolutional layers. It achieved state-of-the-art performance in WMT 2014 English-German and English-
French translation tasks.
ο They are optimistic about the future of attention-based models and plan to apply them to other tasks. They aim
to extend the Transformer to handle problems with inputs and outputs beyond text, such as images, audio,
and video, efficiently processing large inputs and outputs using local, restricted attention mechanisms. One of
their other research goals is to make the generation process less sequential.