Jin-Woo Jeong
Network Science Lab
Dept. of Mathematics
The Catholic University of Korea
E-mail: zeus0208b@gmail.com
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
1
 Introduction
 Model Architecture
• Encoder and Decoder Stacks
• Attention
• Scaled Dot-Product Attention
• Multi-Head Attention
• Applications of Attention in our Model
• Position-wise Feed-Forward Networks
• Embeddings and Softmax
• Positional Encoding
Training
• Training Data
• Optimizer
• Regularization
Results
Conclusion
Q/A
2
Introduction
Introduction
 Until the emergence of this paper, the RNN-based Encoder-Decoder architecture had established itself as
the state-of-the-art approach in sequence modeling and transformation tasks such as language modeling
and machine translation. However, RNN-based models suffer from inherent sequential nature, making
parallelization impossible during training and severely limiting batch processing across examples due to
memory constraints, particularly as the length of training data increases.
 The attention mechanism has become an essential component of powerful sequence modeling and
transformation models across various tasks, allowing the modeling of dependencies irrespective of the
distance in input or output sequences. However, so far, in most cases, the attention mechanism has been
used in conjunction with RNNs.
 This paper presents a model called 'Transformer' which avoids recurrence altogether and instead utilizes
only the attention mechanism to capture dependencies between input and output. The Transformer is
explicitly designed to enable greater parallelization and has the potential to become a new state-of-the-art
technology in translation quality.
3
Model Architecture
Model Architecture
 In the Transformer, like other competitive deep learning
transformation models, it also employs an encoder-
decoder architecture. In the Transformer, the encoder
maps an input sequence of symbol representations 𝑥 =
(𝑥1, … , 𝑥𝑛) to a sequence of continuous representations
z = ( )
𝑧1, … , 𝑧𝑛 . Given 𝑧, the decoder generates an
output sequence of symbols y = ( )
𝑦1, … , 𝑦𝑚 one
element at a time. At each step, the model utilizes
autoregression, using previously generated symbols as
additional inputs for the next generation.
4
Model Architecture
Encoder and Decoder Stacks
 Encoder :
 The encoder consists of individual N=6 layers. Each layer has two sub-layers: the first one is the attention
mechanism, and the second one is a fully connected feedforward network. Additionally, after passing
through each layer, a residual connection is added, followed by layer normalization.
 Decoder :
 The decoder also consists of individual N=6 layers. In contrast to the encoder, the decoder has an
additional layer that performs multi-head attention over the outputs of the encoder stack. Similarly to the
encoder, in each layer of the decoder, after passing through, a residual connection is added, followed by
layer normalization. During self-attention in the decoder, masking is applied to prevent attending to
subsequent words when predicting the next word. This is done to avoid leakage of information from future
tokens during training.
5
Attention
Attention
 Attention function can be described as mapping a set of queries and key-value pairs to an output, where
queries, keys, and values are all vectors. The output is calculated as a weighted sum of the values
Scaled Dot-Product Attention
 The dimension of queries and keys is 𝑑𝑘, and the dimension of values is 𝑑𝑣. The attention function involves the
following steps: first, taking the dot product of queries and keys, then scaling by 𝑑𝑘. Next, passing through a
softmax function to compute the weights based on the similarity between each key and the query. Finally,
multiplying the values by these weights to obtain the weighted sum, which constitutes the output. Each
𝑄, 𝐾, 𝑉 matrix contains the query, key, and value information for each word in a single row. This process, known
as scaled attention, computes attention scores through matrix operations. Scaling by 𝑑𝑘 is to prevent the
softmax function from outputting extremely small gradients due to large inputs, which helps mitigate the
vanishing gradient problem.
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑄𝐾𝑇
𝑑𝑘
𝑉
6
Attention
Multi-Head Attention
 Using multiple heads (here, 8) where separate sets of queries, keys, and values are learned and concatenated
has been found to improve performance compared to using a single attention head.
7
Attention
Applications of Attention in Model
 In this paper, multi-head attention is employed in three different ways:
1. In the "encoder-decoder attention" layer, queries are obtained from the previous decoder step, while keys and
values are obtained from the last output of the encoder.
2. The encoder includes self-attention layers, where queries, keys, and values are generated from the same
vector.
3. The decoder also has self-attention layers, but with masking applied. During training, teacher forcing is used,
while during testing, the output from the previous step is used as input. If the correct target sequence is directly
used in multi-head attention, it could lead to referencing future words for prediction, which is contradictory.
Hence, masking is applied by setting the next predicted word to an extremely small negative value just before
passing through the softmax function.
2. Self-Attention
1. encoder-decoder attention 3. Masking
8
Position-wise Feed-Forward Networks
Position-wise Feed-Forward Networks
 Both the encoder and decoder include Feed-Forward Network layers of the same size. Each layer has different
parameters. The architecture consists of a linear transformation followed by a ReLU activation function
between the two linear transformations.
𝐹𝐹𝑁 𝑥 = max 0, 𝑥𝑊1 + 𝑏1 𝑊1 + 𝑏2
9
Positional Encoding
Positional Encoding
 Attention mechanism alone cannot incorporate positional information of each token because dot products do
not inherently reflect positional information. Therefore, in this paper, positional encoding is applied to provide
positional information as follows.
𝑃𝐸(𝑝𝑜𝑠,2𝑖) = sin(𝑝𝑜𝑠/100002𝑖/𝑑𝑚𝑜𝑑𝑒𝑙)
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = co𝑠(𝑝𝑜𝑠/100002𝑖/𝑑𝑚𝑜𝑑𝑒𝑙)
10
Training
Training
 Dataset: WMT 2014 English-German dataset / WMT 2014 English-French dataset
 Optimizer: Adam with 𝛽1 = 0.9, 𝛽2 = 0.98 𝑎𝑛𝑑 𝜖 = 10−9
. They introduced a new Noam scheduler instead of
using a learning rate.
l𝑟𝑎𝑡𝑒 = 𝑑𝑚𝑜𝑑𝑒𝑙
−0.5
∙ min(𝑠𝑡𝑒𝑝𝑛𝑢𝑚
−0.5, 𝑠𝑡𝑒𝑝_𝑛𝑢𝑚 ∙ 𝑤𝑎𝑟𝑚𝑢𝑝_𝑠𝑡𝑒𝑝𝑠−1.5)
 In this paper they used 𝑤𝑎𝑟𝑚𝑢𝑝𝑠𝑡𝑒𝑝𝑠 = 4000
 Warmup
 Decay
11
Results
Results
 The table above demonstrates that the Transformer achieves better
performance with fewer parameters compared to other models.
12
Results
Results
13
Results
Results
14
Results
Results
 (A) : Experimenting with different
values of h while keeping ℎ × 𝑑𝑘 =
512.
15
Results
Results
 (B) : Experimenting by reducing only
𝑑𝑘.
16
Results
Results
 (C) : Increasing the number of
parameters improves performance.
17
Results
Results
 (D) : Preventing overfitting with
dropout / Label smoothing value.
18
Results
Results
 (B) : When "positional encoding" is
changed to "positional embedding"
19
Conclusion
Conclusion
 This study introduces the Transformer, the first sequence-to-sequence model that replaces the commonly
used RNN layer with multi-head attention in an encoder-decoder architecture.
 In translation tasks, the Transformer can learn much faster than architectures based on recurrent or
convolutional layers. It achieved state-of-the-art performance in WMT 2014 English-German and English-
French translation tasks.
 They are optimistic about the future of attention-based models and plan to apply them to other tasks. They aim
to extend the Transformer to handle problems with inputs and outputs beyond text, such as images, audio,
and video, efficiently processing large inputs and outputs using local, restricted attention mechanisms. One of
their other research goals is to make the generation process less sequential.
20
Q & A
Q / A

240318_JW_labseminar[Attention Is All You Need].pptx

  • 1.
    Jin-Woo Jeong Network ScienceLab Dept. of Mathematics The Catholic University of Korea E-mail: zeus0208b@gmail.com Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
  • 2.
    1  Introduction  ModelArchitecture • Encoder and Decoder Stacks • Attention • Scaled Dot-Product Attention • Multi-Head Attention • Applications of Attention in our Model • Position-wise Feed-Forward Networks • Embeddings and Softmax • Positional Encoding Training • Training Data • Optimizer • Regularization Results Conclusion Q/A
  • 3.
    2 Introduction Introduction  Until theemergence of this paper, the RNN-based Encoder-Decoder architecture had established itself as the state-of-the-art approach in sequence modeling and transformation tasks such as language modeling and machine translation. However, RNN-based models suffer from inherent sequential nature, making parallelization impossible during training and severely limiting batch processing across examples due to memory constraints, particularly as the length of training data increases.  The attention mechanism has become an essential component of powerful sequence modeling and transformation models across various tasks, allowing the modeling of dependencies irrespective of the distance in input or output sequences. However, so far, in most cases, the attention mechanism has been used in conjunction with RNNs.  This paper presents a model called 'Transformer' which avoids recurrence altogether and instead utilizes only the attention mechanism to capture dependencies between input and output. The Transformer is explicitly designed to enable greater parallelization and has the potential to become a new state-of-the-art technology in translation quality.
  • 4.
    3 Model Architecture Model Architecture In the Transformer, like other competitive deep learning transformation models, it also employs an encoder- decoder architecture. In the Transformer, the encoder maps an input sequence of symbol representations 𝑥 = (𝑥1, … , 𝑥𝑛) to a sequence of continuous representations z = ( ) 𝑧1, … , 𝑧𝑛 . Given 𝑧, the decoder generates an output sequence of symbols y = ( ) 𝑦1, … , 𝑦𝑚 one element at a time. At each step, the model utilizes autoregression, using previously generated symbols as additional inputs for the next generation.
  • 5.
    4 Model Architecture Encoder andDecoder Stacks  Encoder :  The encoder consists of individual N=6 layers. Each layer has two sub-layers: the first one is the attention mechanism, and the second one is a fully connected feedforward network. Additionally, after passing through each layer, a residual connection is added, followed by layer normalization.  Decoder :  The decoder also consists of individual N=6 layers. In contrast to the encoder, the decoder has an additional layer that performs multi-head attention over the outputs of the encoder stack. Similarly to the encoder, in each layer of the decoder, after passing through, a residual connection is added, followed by layer normalization. During self-attention in the decoder, masking is applied to prevent attending to subsequent words when predicting the next word. This is done to avoid leakage of information from future tokens during training.
  • 6.
    5 Attention Attention  Attention functioncan be described as mapping a set of queries and key-value pairs to an output, where queries, keys, and values are all vectors. The output is calculated as a weighted sum of the values Scaled Dot-Product Attention  The dimension of queries and keys is 𝑑𝑘, and the dimension of values is 𝑑𝑣. The attention function involves the following steps: first, taking the dot product of queries and keys, then scaling by 𝑑𝑘. Next, passing through a softmax function to compute the weights based on the similarity between each key and the query. Finally, multiplying the values by these weights to obtain the weighted sum, which constitutes the output. Each 𝑄, 𝐾, 𝑉 matrix contains the query, key, and value information for each word in a single row. This process, known as scaled attention, computes attention scores through matrix operations. Scaling by 𝑑𝑘 is to prevent the softmax function from outputting extremely small gradients due to large inputs, which helps mitigate the vanishing gradient problem. 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇 𝑑𝑘 𝑉
  • 7.
    6 Attention Multi-Head Attention  Usingmultiple heads (here, 8) where separate sets of queries, keys, and values are learned and concatenated has been found to improve performance compared to using a single attention head.
  • 8.
    7 Attention Applications of Attentionin Model  In this paper, multi-head attention is employed in three different ways: 1. In the "encoder-decoder attention" layer, queries are obtained from the previous decoder step, while keys and values are obtained from the last output of the encoder. 2. The encoder includes self-attention layers, where queries, keys, and values are generated from the same vector. 3. The decoder also has self-attention layers, but with masking applied. During training, teacher forcing is used, while during testing, the output from the previous step is used as input. If the correct target sequence is directly used in multi-head attention, it could lead to referencing future words for prediction, which is contradictory. Hence, masking is applied by setting the next predicted word to an extremely small negative value just before passing through the softmax function. 2. Self-Attention 1. encoder-decoder attention 3. Masking
  • 9.
    8 Position-wise Feed-Forward Networks Position-wiseFeed-Forward Networks  Both the encoder and decoder include Feed-Forward Network layers of the same size. Each layer has different parameters. The architecture consists of a linear transformation followed by a ReLU activation function between the two linear transformations. 𝐹𝐹𝑁 𝑥 = max 0, 𝑥𝑊1 + 𝑏1 𝑊1 + 𝑏2
  • 10.
    9 Positional Encoding Positional Encoding Attention mechanism alone cannot incorporate positional information of each token because dot products do not inherently reflect positional information. Therefore, in this paper, positional encoding is applied to provide positional information as follows. 𝑃𝐸(𝑝𝑜𝑠,2𝑖) = sin(𝑝𝑜𝑠/100002𝑖/𝑑𝑚𝑜𝑑𝑒𝑙) 𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = co𝑠(𝑝𝑜𝑠/100002𝑖/𝑑𝑚𝑜𝑑𝑒𝑙)
  • 11.
    10 Training Training  Dataset: WMT2014 English-German dataset / WMT 2014 English-French dataset  Optimizer: Adam with 𝛽1 = 0.9, 𝛽2 = 0.98 𝑎𝑛𝑑 𝜖 = 10−9 . They introduced a new Noam scheduler instead of using a learning rate. l𝑟𝑎𝑡𝑒 = 𝑑𝑚𝑜𝑑𝑒𝑙 −0.5 ∙ min(𝑠𝑡𝑒𝑝𝑛𝑢𝑚 −0.5, 𝑠𝑡𝑒𝑝_𝑛𝑢𝑚 ∙ 𝑤𝑎𝑟𝑚𝑢𝑝_𝑠𝑡𝑒𝑝𝑠−1.5)  In this paper they used 𝑤𝑎𝑟𝑚𝑢𝑝𝑠𝑡𝑒𝑝𝑠 = 4000  Warmup  Decay
  • 12.
    11 Results Results  The tableabove demonstrates that the Transformer achieves better performance with fewer parameters compared to other models.
  • 13.
  • 14.
  • 15.
    14 Results Results  (A) :Experimenting with different values of h while keeping ℎ × 𝑑𝑘 = 512.
  • 16.
    15 Results Results  (B) :Experimenting by reducing only 𝑑𝑘.
  • 17.
    16 Results Results  (C) :Increasing the number of parameters improves performance.
  • 18.
    17 Results Results  (D) :Preventing overfitting with dropout / Label smoothing value.
  • 19.
    18 Results Results  (B) :When "positional encoding" is changed to "positional embedding"
  • 20.
    19 Conclusion Conclusion  This studyintroduces the Transformer, the first sequence-to-sequence model that replaces the commonly used RNN layer with multi-head attention in an encoder-decoder architecture.  In translation tasks, the Transformer can learn much faster than architectures based on recurrent or convolutional layers. It achieved state-of-the-art performance in WMT 2014 English-German and English- French translation tasks.  They are optimistic about the future of attention-based models and plan to apply them to other tasks. They aim to extend the Transformer to handle problems with inputs and outputs beyond text, such as images, audio, and video, efficiently processing large inputs and outputs using local, restricted attention mechanisms. One of their other research goals is to make the generation process less sequential.
  • 21.

Editor's Notes

  • #22 thank you, the presentation is concluded