SlideShare a Scribd company logo
1 of 17
Min-Seo Kim
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: kms39273@naver.com
1
Background
• The Transformer is a model that takes an input sentence and
generates an output sentence.
• The Transformer is broadly divided into two parts: the Encoder and
the Decoder.
Model of Transformer
2
Background
• The Encoder is a function that takes a sentence as input and generates a
vector.
• The vector created through Encoding is referred to as the context, which,
as the name implies, is a vector that encapsulates the 'context' of the
sentence.
• The Encoder is trained with the goal of properly creating this context
(compressing the information in the sentence without omitting any
details).
Model of Transformer-Encoder
3
Background
• The Decoder is the opposite of the Encoder. It takes the context as input
and generates a sentence as output.
• The Decoder does not only receive the context as input but also a right-
shifted version of the sentence it is generating as output.
• For now, let's simply understand it as the concept of receiving some
sentence as an additional input.
Model of Transformer
4
Previous work
• In a Recurrent Network, to compute the hidden state h_i at time i, it is necessary to have h_(i−1). As the
calculation proceeds sequentially from the beginning to obtain h_0, h_1, ..., h_n, parallel processing is not
possible.
• On the other hand, in Self-Attention, assuming there are n tokens in a sentence, it performs n×n operations
to directly compute the relationships between all tokens.
• Since it establishes direct relationships without going through other intermediate tokens, it can capture
relationships more clearly compared to Recurrent Networks.
RNN vs Self-Attention
5
Model
• Model of Transformer We implement a simple Transformer model using
PyTorch.
• Assuming that the encoder and decoder are already completed, we
receive them as arguments in the class constructor.
Model of Transformer
6
Model
• The Encoder consists of N stacked Encoder Blocks. In the paper, N=6
is used.
• When N Encoder Blocks are stacked to form the Encoder, the input to
the first Encoder Block is the sentence embedding that enters as the
input of the entire Encoder.
• Once the first block generates an output, this is used as the input for
the second block, and so on. The output of the last, Nth block
becomes the output of the entire Encoder, that is, the context.
Encoder
7
Model
• The Encoder is composed of N stacked Encoder Blocks. In the paper,
N=6 is used.
• When N Encoder Blocks are stacked to form the Encoder, the input to
the first Encoder Block becomes the sentence embedding that is the
input to the entire Encoder.
• Once the first block produces an output, it is used as the input for
the second block, and this process continues.
• The output of the last, Nth block becomes the output of the entire
Encoder, which is the context.
Encoder Block
8
Methodology
• The input token embedding vector is placed into a fully connected layer to generate three vectors.
• Query: Represents the current token.
• Key: Represents the target token for which attention is being calculated.
• Value: Also represents the target token for which attention is being calculated (same as the Key token).
• For example, in the sentence 'The animal didn’t cross the street, because it was too tired,' when trying to
determine what 'it' refers to, the Query is fixed as 'it', and Key and Value are exactly the same token,
representing any one of all the tokens from the beginning to the end of the sentence.
• If Key and Value point to 'The', it means calculating the attention between 'it' and 'The'; if they point to the last
'tired', it means calculating the attention between 'it' and 'tired’.
• To find the token that matches the Query best (the one with the highest Attention), Key and Value are explored
from the beginning to the end of the sentence.
• The actual values of Key and Value are different due to the applied weights, but semantically, they still
represent the same token.
• Key and Value are then used separately in the subsequent Attention calculation process.
Query, Key, Value
9
Methodology
• These are the fully connected (FC) layers that produce Q (Query), K (Key), and V (Value).
• Each is obtained through a different FC layer. The input to these FC layers is word embedding vectors, and the outputs
are Q, K, and V, respectively.
• If the dimension of word embedding is d_embed, then the input shape is n×d_embed, and the output shape is n×d_k.
• As each FC layer has a different weight matrix (d_embed×d_k), even though the shapes of the outputs are the same,
the actual values of Q, K, and V are all different.
Query, Key, Value
10
Methodology
• From the fact that the output shapes of the three FC layers
are the same, it can be understood that even though the
specific values of Query, Key, and Value obtained from
separate FC layers are different, they become vectors with the
same shape.
• The shapes of Query, Key, and Value are all identical.
• The Attention for Query is calculated using the following
formula:
Scaled Dot-Product Attention
11
Methodology
• Scaled Dot-Product Attention Q represents the current token, and K and V represent the target tokens for which
Attention is to be computed.
• Let's consider calculating the Attention between 'it' and 'animal' in the sentence 'The animal didn’t cross the street,
because it was too tired.' If dk=3, the shape would be as follows.
Scaled Dot-Product Attention
• When these are multiplied (precisely, after transposing K and then multiplying, i.e., the inner product of the two
vectors), the result will be some scalar value.
• This value is called the Attention Score. Afterwards, scaling is performed to prevent the value from becoming
too large, by dividing it by the square root of dk.
• This is done because if the value is too large, it can lead to gradient vanishing.
12
Methodology
• After calculating 1:1 Attention, let's expand this to calculate 1:N Attention.
• Assuming the operation to calculate Attention is performed for one Q, K and V will be repeated for the length n of the
sentence.
• When calculating Attention for one Q vector, K and V each become n vectors.
Scaled Dot-Product Attention
13
Methodology
• Scaled Dot-Product Attention The result is a single vector with the same dimension (d_k) as the original Q, K, and V.
• This means that, although only one Q vector is received as input, the final output of the operation has the same shape
as the input. Therefore, the Self-Attention operation is also idempotent in terms of shape.
Scaled Dot-Product Attention
14
Methodology
• This pertains to the Attention for a single token, 'it’.
• When expanded to a matrix for all tokens, it would look like the following:
Scaled Dot-Product Attention
15
Methodology
• In the sentence 'The animal didn’t cross the street, because it was too tired,' if we
tokenize the sentence into words, the total number of tokens will be 11.
• If the embedding dimension of a token is d_embed, then the embedding matrix of
the entire sentence will be (11×d_embed).
• During model training, processing is not done sentence by sentence but in mini-
batches of multiple sentences.
• However, if the lengths of each sentence differ, it is not possible to form a batch.
• If we assume the sequence length (seq_len) is 20, then there would be 9 empty
tokens in the above sentence.
• However, attention should not be assigned to these empty pad tokens that are
created.
• Pad masking is the process of ensuring that attention is not assigned to such empty
pad tokens.
Pad Masking
240122_Attention Is All You Need (2017 NIPS)2.pptx

More Related Content

Similar to 240122_Attention Is All You Need (2017 NIPS)2.pptx

Design and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).pptDesign and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).ppt
ZeenaJaba
 

Similar to 240122_Attention Is All You Need (2017 NIPS)2.pptx (20)

Unit -I Toc.pptx
Unit -I Toc.pptxUnit -I Toc.pptx
Unit -I Toc.pptx
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR Parsing
 
Introduction to Transformers
Introduction to TransformersIntroduction to Transformers
Introduction to Transformers
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
 
Lecture01a correctness
Lecture01a correctnessLecture01a correctness
Lecture01a correctness
 
DAA Notes.pdf
DAA Notes.pdfDAA Notes.pdf
DAA Notes.pdf
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
Design and Analysis of algorithms
Design and Analysis of algorithmsDesign and Analysis of algorithms
Design and Analysis of algorithms
 
Lecture 4: Functions
Lecture 4: FunctionsLecture 4: Functions
Lecture 4: Functions
 
Oed chapter 1
Oed chapter 1Oed chapter 1
Oed chapter 1
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Design and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).pptDesign and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).ppt
 
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
 
Smart Reply - Word-level Sequence to sequence.pptx
Smart Reply - Word-level Sequence to sequence.pptxSmart Reply - Word-level Sequence to sequence.pptx
Smart Reply - Word-level Sequence to sequence.pptx
 
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
 
5 Statements and Control Structures
5 Statements and Control Structures5 Statements and Control Structures
5 Statements and Control Structures
 
vector QUANTIZATION
vector QUANTIZATIONvector QUANTIZATION
vector QUANTIZATION
 
vector QUANTIZATION
vector QUANTIZATIONvector QUANTIZATION
vector QUANTIZATION
 
vector QUANTIZATION
vector QUANTIZATIONvector QUANTIZATION
vector QUANTIZATION
 

More from thanhdowork

More from thanhdowork (20)

[20240520_LabSeminar_Huy]DSTAGNN: Dynamic Spatial-Temporal Aware Graph Neural...
[20240520_LabSeminar_Huy]DSTAGNN: Dynamic Spatial-Temporal Aware Graph Neural...[20240520_LabSeminar_Huy]DSTAGNN: Dynamic Spatial-Temporal Aware Graph Neural...
[20240520_LabSeminar_Huy]DSTAGNN: Dynamic Spatial-Temporal Aware Graph Neural...
 
240520_Thanh_LabSeminar[G-MSM: Unsupervised Multi-Shape Matching with Graph-b...
240520_Thanh_LabSeminar[G-MSM: Unsupervised Multi-Shape Matching with Graph-b...240520_Thanh_LabSeminar[G-MSM: Unsupervised Multi-Shape Matching with Graph-b...
240520_Thanh_LabSeminar[G-MSM: Unsupervised Multi-Shape Matching with Graph-b...
 
240513_Thanh_LabSeminar[Learning and Aggregating Lane Graphs for Urban Automa...
240513_Thanh_LabSeminar[Learning and Aggregating Lane Graphs for Urban Automa...240513_Thanh_LabSeminar[Learning and Aggregating Lane Graphs for Urban Automa...
240513_Thanh_LabSeminar[Learning and Aggregating Lane Graphs for Urban Automa...
 
240513_Thuy_Labseminar[Universal Prompt Tuning for Graph Neural Networks].pptx
240513_Thuy_Labseminar[Universal Prompt Tuning for Graph Neural Networks].pptx240513_Thuy_Labseminar[Universal Prompt Tuning for Graph Neural Networks].pptx
240513_Thuy_Labseminar[Universal Prompt Tuning for Graph Neural Networks].pptx
 
[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx
[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx
[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx
 
240506_JW_labseminar[Structural Deep Network Embedding].pptx
240506_JW_labseminar[Structural Deep Network Embedding].pptx240506_JW_labseminar[Structural Deep Network Embedding].pptx
240506_JW_labseminar[Structural Deep Network Embedding].pptx
 
[20240506_LabSeminar_Huy]Conditional Local Convolution for Spatio-Temporal Me...
[20240506_LabSeminar_Huy]Conditional Local Convolution for Spatio-Temporal Me...[20240506_LabSeminar_Huy]Conditional Local Convolution for Spatio-Temporal Me...
[20240506_LabSeminar_Huy]Conditional Local Convolution for Spatio-Temporal Me...
 
240506_Thanh_LabSeminar[ASG2Caption].pptx
240506_Thanh_LabSeminar[ASG2Caption].pptx240506_Thanh_LabSeminar[ASG2Caption].pptx
240506_Thanh_LabSeminar[ASG2Caption].pptx
 
240506_Thuy_Labseminar[GraphPrompt: Unifying Pre-Training and Downstream Task...
240506_Thuy_Labseminar[GraphPrompt: Unifying Pre-Training and Downstream Task...240506_Thuy_Labseminar[GraphPrompt: Unifying Pre-Training and Downstream Task...
240506_Thuy_Labseminar[GraphPrompt: Unifying Pre-Training and Downstream Task...
 
[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...
[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...
[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...
 
240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...
240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...
240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...
 
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
 
240422_Thanh_LabSeminar[Dynamic Graph Enhanced Contrastive Learning for Chest...
240422_Thanh_LabSeminar[Dynamic Graph Enhanced Contrastive Learning for Chest...240422_Thanh_LabSeminar[Dynamic Graph Enhanced Contrastive Learning for Chest...
240422_Thanh_LabSeminar[Dynamic Graph Enhanced Contrastive Learning for Chest...
 
[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx
 
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
 
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
 
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
 
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
 
240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx
 
240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...
240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...
240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...
 

Recently uploaded

會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
中 央社
 
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes GuàrdiaPersonalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
EADTU
 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
Peter Brusilovsky
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
AnaAcapella
 

Recently uploaded (20)

The Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDFThe Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDF
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
 
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjStl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
 
Observing-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptxObserving-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptx
 
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes GuàrdiaPersonalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
 
Supporting Newcomer Multilingual Learners
Supporting Newcomer  Multilingual LearnersSupporting Newcomer  Multilingual Learners
Supporting Newcomer Multilingual Learners
 
An Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge AppAn Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge App
 
male presentation...pdf.................
male presentation...pdf.................male presentation...pdf.................
male presentation...pdf.................
 
VAMOS CUIDAR DO NOSSO PLANETA! .
VAMOS CUIDAR DO NOSSO PLANETA!                    .VAMOS CUIDAR DO NOSSO PLANETA!                    .
VAMOS CUIDAR DO NOSSO PLANETA! .
 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
 
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinhĐề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
 
Mattingly "AI & Prompt Design: Named Entity Recognition"
Mattingly "AI & Prompt Design: Named Entity Recognition"Mattingly "AI & Prompt Design: Named Entity Recognition"
Mattingly "AI & Prompt Design: Named Entity Recognition"
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
 
diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....
 
e-Sealing at EADTU by Kamakshi Rajagopal
e-Sealing at EADTU by Kamakshi Rajagopale-Sealing at EADTU by Kamakshi Rajagopal
e-Sealing at EADTU by Kamakshi Rajagopal
 
Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"
 
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUMDEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
 
Improved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppImproved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio App
 
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
 

240122_Attention Is All You Need (2017 NIPS)2.pptx

  • 1. Min-Seo Kim Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: kms39273@naver.com
  • 2. 1 Background • The Transformer is a model that takes an input sentence and generates an output sentence. • The Transformer is broadly divided into two parts: the Encoder and the Decoder. Model of Transformer
  • 3. 2 Background • The Encoder is a function that takes a sentence as input and generates a vector. • The vector created through Encoding is referred to as the context, which, as the name implies, is a vector that encapsulates the 'context' of the sentence. • The Encoder is trained with the goal of properly creating this context (compressing the information in the sentence without omitting any details). Model of Transformer-Encoder
  • 4. 3 Background • The Decoder is the opposite of the Encoder. It takes the context as input and generates a sentence as output. • The Decoder does not only receive the context as input but also a right- shifted version of the sentence it is generating as output. • For now, let's simply understand it as the concept of receiving some sentence as an additional input. Model of Transformer
  • 5. 4 Previous work • In a Recurrent Network, to compute the hidden state h_i at time i, it is necessary to have h_(i−1). As the calculation proceeds sequentially from the beginning to obtain h_0, h_1, ..., h_n, parallel processing is not possible. • On the other hand, in Self-Attention, assuming there are n tokens in a sentence, it performs n×n operations to directly compute the relationships between all tokens. • Since it establishes direct relationships without going through other intermediate tokens, it can capture relationships more clearly compared to Recurrent Networks. RNN vs Self-Attention
  • 6. 5 Model • Model of Transformer We implement a simple Transformer model using PyTorch. • Assuming that the encoder and decoder are already completed, we receive them as arguments in the class constructor. Model of Transformer
  • 7. 6 Model • The Encoder consists of N stacked Encoder Blocks. In the paper, N=6 is used. • When N Encoder Blocks are stacked to form the Encoder, the input to the first Encoder Block is the sentence embedding that enters as the input of the entire Encoder. • Once the first block generates an output, this is used as the input for the second block, and so on. The output of the last, Nth block becomes the output of the entire Encoder, that is, the context. Encoder
  • 8. 7 Model • The Encoder is composed of N stacked Encoder Blocks. In the paper, N=6 is used. • When N Encoder Blocks are stacked to form the Encoder, the input to the first Encoder Block becomes the sentence embedding that is the input to the entire Encoder. • Once the first block produces an output, it is used as the input for the second block, and this process continues. • The output of the last, Nth block becomes the output of the entire Encoder, which is the context. Encoder Block
  • 9. 8 Methodology • The input token embedding vector is placed into a fully connected layer to generate three vectors. • Query: Represents the current token. • Key: Represents the target token for which attention is being calculated. • Value: Also represents the target token for which attention is being calculated (same as the Key token). • For example, in the sentence 'The animal didn’t cross the street, because it was too tired,' when trying to determine what 'it' refers to, the Query is fixed as 'it', and Key and Value are exactly the same token, representing any one of all the tokens from the beginning to the end of the sentence. • If Key and Value point to 'The', it means calculating the attention between 'it' and 'The'; if they point to the last 'tired', it means calculating the attention between 'it' and 'tired’. • To find the token that matches the Query best (the one with the highest Attention), Key and Value are explored from the beginning to the end of the sentence. • The actual values of Key and Value are different due to the applied weights, but semantically, they still represent the same token. • Key and Value are then used separately in the subsequent Attention calculation process. Query, Key, Value
  • 10. 9 Methodology • These are the fully connected (FC) layers that produce Q (Query), K (Key), and V (Value). • Each is obtained through a different FC layer. The input to these FC layers is word embedding vectors, and the outputs are Q, K, and V, respectively. • If the dimension of word embedding is d_embed, then the input shape is n×d_embed, and the output shape is n×d_k. • As each FC layer has a different weight matrix (d_embed×d_k), even though the shapes of the outputs are the same, the actual values of Q, K, and V are all different. Query, Key, Value
  • 11. 10 Methodology • From the fact that the output shapes of the three FC layers are the same, it can be understood that even though the specific values of Query, Key, and Value obtained from separate FC layers are different, they become vectors with the same shape. • The shapes of Query, Key, and Value are all identical. • The Attention for Query is calculated using the following formula: Scaled Dot-Product Attention
  • 12. 11 Methodology • Scaled Dot-Product Attention Q represents the current token, and K and V represent the target tokens for which Attention is to be computed. • Let's consider calculating the Attention between 'it' and 'animal' in the sentence 'The animal didn’t cross the street, because it was too tired.' If dk=3, the shape would be as follows. Scaled Dot-Product Attention • When these are multiplied (precisely, after transposing K and then multiplying, i.e., the inner product of the two vectors), the result will be some scalar value. • This value is called the Attention Score. Afterwards, scaling is performed to prevent the value from becoming too large, by dividing it by the square root of dk. • This is done because if the value is too large, it can lead to gradient vanishing.
  • 13. 12 Methodology • After calculating 1:1 Attention, let's expand this to calculate 1:N Attention. • Assuming the operation to calculate Attention is performed for one Q, K and V will be repeated for the length n of the sentence. • When calculating Attention for one Q vector, K and V each become n vectors. Scaled Dot-Product Attention
  • 14. 13 Methodology • Scaled Dot-Product Attention The result is a single vector with the same dimension (d_k) as the original Q, K, and V. • This means that, although only one Q vector is received as input, the final output of the operation has the same shape as the input. Therefore, the Self-Attention operation is also idempotent in terms of shape. Scaled Dot-Product Attention
  • 15. 14 Methodology • This pertains to the Attention for a single token, 'it’. • When expanded to a matrix for all tokens, it would look like the following: Scaled Dot-Product Attention
  • 16. 15 Methodology • In the sentence 'The animal didn’t cross the street, because it was too tired,' if we tokenize the sentence into words, the total number of tokens will be 11. • If the embedding dimension of a token is d_embed, then the embedding matrix of the entire sentence will be (11×d_embed). • During model training, processing is not done sentence by sentence but in mini- batches of multiple sentences. • However, if the lengths of each sentence differ, it is not possible to form a batch. • If we assume the sequence length (seq_len) is 20, then there would be 9 empty tokens in the above sentence. • However, attention should not be assigned to these empty pad tokens that are created. • Pad masking is the process of ensuring that attention is not assigned to such empty pad tokens. Pad Masking

Editor's Notes

  1. I just created the skeleton of the code