When Quality Assurance Meets Innovation in Higher Education - Report launch w...
240122_Attention Is All You Need (2017 NIPS)2.pptx
1. Min-Seo Kim
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: kms39273@naver.com
2. 1
Background
• The Transformer is a model that takes an input sentence and
generates an output sentence.
• The Transformer is broadly divided into two parts: the Encoder and
the Decoder.
Model of Transformer
3. 2
Background
• The Encoder is a function that takes a sentence as input and generates a
vector.
• The vector created through Encoding is referred to as the context, which,
as the name implies, is a vector that encapsulates the 'context' of the
sentence.
• The Encoder is trained with the goal of properly creating this context
(compressing the information in the sentence without omitting any
details).
Model of Transformer-Encoder
4. 3
Background
• The Decoder is the opposite of the Encoder. It takes the context as input
and generates a sentence as output.
• The Decoder does not only receive the context as input but also a right-
shifted version of the sentence it is generating as output.
• For now, let's simply understand it as the concept of receiving some
sentence as an additional input.
Model of Transformer
5. 4
Previous work
• In a Recurrent Network, to compute the hidden state h_i at time i, it is necessary to have h_(i−1). As the
calculation proceeds sequentially from the beginning to obtain h_0, h_1, ..., h_n, parallel processing is not
possible.
• On the other hand, in Self-Attention, assuming there are n tokens in a sentence, it performs n×n operations
to directly compute the relationships between all tokens.
• Since it establishes direct relationships without going through other intermediate tokens, it can capture
relationships more clearly compared to Recurrent Networks.
RNN vs Self-Attention
6. 5
Model
• Model of Transformer We implement a simple Transformer model using
PyTorch.
• Assuming that the encoder and decoder are already completed, we
receive them as arguments in the class constructor.
Model of Transformer
7. 6
Model
• The Encoder consists of N stacked Encoder Blocks. In the paper, N=6
is used.
• When N Encoder Blocks are stacked to form the Encoder, the input to
the first Encoder Block is the sentence embedding that enters as the
input of the entire Encoder.
• Once the first block generates an output, this is used as the input for
the second block, and so on. The output of the last, Nth block
becomes the output of the entire Encoder, that is, the context.
Encoder
8. 7
Model
• The Encoder is composed of N stacked Encoder Blocks. In the paper,
N=6 is used.
• When N Encoder Blocks are stacked to form the Encoder, the input to
the first Encoder Block becomes the sentence embedding that is the
input to the entire Encoder.
• Once the first block produces an output, it is used as the input for
the second block, and this process continues.
• The output of the last, Nth block becomes the output of the entire
Encoder, which is the context.
Encoder Block
9. 8
Methodology
• The input token embedding vector is placed into a fully connected layer to generate three vectors.
• Query: Represents the current token.
• Key: Represents the target token for which attention is being calculated.
• Value: Also represents the target token for which attention is being calculated (same as the Key token).
• For example, in the sentence 'The animal didn’t cross the street, because it was too tired,' when trying to
determine what 'it' refers to, the Query is fixed as 'it', and Key and Value are exactly the same token,
representing any one of all the tokens from the beginning to the end of the sentence.
• If Key and Value point to 'The', it means calculating the attention between 'it' and 'The'; if they point to the last
'tired', it means calculating the attention between 'it' and 'tired’.
• To find the token that matches the Query best (the one with the highest Attention), Key and Value are explored
from the beginning to the end of the sentence.
• The actual values of Key and Value are different due to the applied weights, but semantically, they still
represent the same token.
• Key and Value are then used separately in the subsequent Attention calculation process.
Query, Key, Value
10. 9
Methodology
• These are the fully connected (FC) layers that produce Q (Query), K (Key), and V (Value).
• Each is obtained through a different FC layer. The input to these FC layers is word embedding vectors, and the outputs
are Q, K, and V, respectively.
• If the dimension of word embedding is d_embed, then the input shape is n×d_embed, and the output shape is n×d_k.
• As each FC layer has a different weight matrix (d_embed×d_k), even though the shapes of the outputs are the same,
the actual values of Q, K, and V are all different.
Query, Key, Value
11. 10
Methodology
• From the fact that the output shapes of the three FC layers
are the same, it can be understood that even though the
specific values of Query, Key, and Value obtained from
separate FC layers are different, they become vectors with the
same shape.
• The shapes of Query, Key, and Value are all identical.
• The Attention for Query is calculated using the following
formula:
Scaled Dot-Product Attention
12. 11
Methodology
• Scaled Dot-Product Attention Q represents the current token, and K and V represent the target tokens for which
Attention is to be computed.
• Let's consider calculating the Attention between 'it' and 'animal' in the sentence 'The animal didn’t cross the street,
because it was too tired.' If dk=3, the shape would be as follows.
Scaled Dot-Product Attention
• When these are multiplied (precisely, after transposing K and then multiplying, i.e., the inner product of the two
vectors), the result will be some scalar value.
• This value is called the Attention Score. Afterwards, scaling is performed to prevent the value from becoming
too large, by dividing it by the square root of dk.
• This is done because if the value is too large, it can lead to gradient vanishing.
13. 12
Methodology
• After calculating 1:1 Attention, let's expand this to calculate 1:N Attention.
• Assuming the operation to calculate Attention is performed for one Q, K and V will be repeated for the length n of the
sentence.
• When calculating Attention for one Q vector, K and V each become n vectors.
Scaled Dot-Product Attention
14. 13
Methodology
• Scaled Dot-Product Attention The result is a single vector with the same dimension (d_k) as the original Q, K, and V.
• This means that, although only one Q vector is received as input, the final output of the operation has the same shape
as the input. Therefore, the Self-Attention operation is also idempotent in terms of shape.
Scaled Dot-Product Attention
15. 14
Methodology
• This pertains to the Attention for a single token, 'it’.
• When expanded to a matrix for all tokens, it would look like the following:
Scaled Dot-Product Attention
16. 15
Methodology
• In the sentence 'The animal didn’t cross the street, because it was too tired,' if we
tokenize the sentence into words, the total number of tokens will be 11.
• If the embedding dimension of a token is d_embed, then the embedding matrix of
the entire sentence will be (11×d_embed).
• During model training, processing is not done sentence by sentence but in mini-
batches of multiple sentences.
• However, if the lengths of each sentence differ, it is not possible to form a batch.
• If we assume the sequence length (seq_len) is 20, then there would be 9 empty
tokens in the above sentence.
• However, attention should not be assigned to these empty pad tokens that are
created.
• Pad masking is the process of ensuring that attention is not assigned to such empty
pad tokens.
Pad Masking