SlideShare a Scribd company logo
1 of 21
Jin-Woo Jeong
Network Science Lab
Dept. of Mathematics
The Catholic University of Korea
E-mail: zeus0208b@gmail.com
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
1
οƒ˜ Introduction
οƒ˜ Model Architecture
β€’ Encoder and Decoder Stacks
β€’ Attention
β€’ Scaled Dot-Product Attention
β€’ Multi-Head Attention
β€’ Applications of Attention in our Model
β€’ Position-wise Feed-Forward Networks
β€’ Embeddings and Softmax
β€’ Positional Encoding
οƒ˜Training
β€’ Training Data
β€’ Optimizer
β€’ Regularization
οƒ˜Results
οƒ˜Conclusion
οƒ˜Q/A
2
Introduction
Introduction
οƒ˜ Until the emergence of this paper, the RNN-based Encoder-Decoder architecture had established itself as
the state-of-the-art approach in sequence modeling and transformation tasks such as language modeling
and machine translation. However, RNN-based models suffer from inherent sequential nature, making
parallelization impossible during training and severely limiting batch processing across examples due to
memory constraints, particularly as the length of training data increases.
οƒ˜ The attention mechanism has become an essential component of powerful sequence modeling and
transformation models across various tasks, allowing the modeling of dependencies irrespective of the
distance in input or output sequences. However, so far, in most cases, the attention mechanism has been
used in conjunction with RNNs.
οƒ˜ This paper presents a model called 'Transformer' which avoids recurrence altogether and instead utilizes
only the attention mechanism to capture dependencies between input and output. The Transformer is
explicitly designed to enable greater parallelization and has the potential to become a new state-of-the-art
technology in translation quality.
3
Model Architecture
Model Architecture
οƒ˜ In the Transformer, like other competitive deep learning
transformation models, it also employs an encoder-
decoder architecture. In the Transformer, the encoder
maps an input sequence of symbol representations π‘₯ =
(π‘₯1, … , π‘₯𝑛) to a sequence of continuous representations
z = ( )
𝑧1, … , 𝑧𝑛 . Given 𝑧, the decoder generates an
output sequence of symbols y = ( )
𝑦1, … , π‘¦π‘š one
element at a time. At each step, the model utilizes
autoregression, using previously generated symbols as
additional inputs for the next generation.
4
Model Architecture
Encoder and Decoder Stacks
οƒ˜ Encoder :
οƒ˜ The encoder consists of individual N=6 layers. Each layer has two sub-layers: the first one is the attention
mechanism, and the second one is a fully connected feedforward network. Additionally, after passing
through each layer, a residual connection is added, followed by layer normalization.
οƒ˜ Decoder :
οƒ˜ The decoder also consists of individual N=6 layers. In contrast to the encoder, the decoder has an
additional layer that performs multi-head attention over the outputs of the encoder stack. Similarly to the
encoder, in each layer of the decoder, after passing through, a residual connection is added, followed by
layer normalization. During self-attention in the decoder, masking is applied to prevent attending to
subsequent words when predicting the next word. This is done to avoid leakage of information from future
tokens during training.
5
Attention
Attention
οƒ˜ Attention function can be described as mapping a set of queries and key-value pairs to an output, where
queries, keys, and values are all vectors. The output is calculated as a weighted sum of the values
Scaled Dot-Product Attention
οƒ˜ The dimension of queries and keys is π‘‘π‘˜, and the dimension of values is 𝑑𝑣. The attention function involves the
following steps: first, taking the dot product of queries and keys, then scaling by π‘‘π‘˜. Next, passing through a
softmax function to compute the weights based on the similarity between each key and the query. Finally,
multiplying the values by these weights to obtain the weighted sum, which constitutes the output. Each
𝑄, 𝐾, 𝑉 matrix contains the query, key, and value information for each word in a single row. This process, known
as scaled attention, computes attention scores through matrix operations. Scaling by π‘‘π‘˜ is to prevent the
softmax function from outputting extremely small gradients due to large inputs, which helps mitigate the
vanishing gradient problem.
π΄π‘‘π‘‘π‘’π‘›π‘‘π‘–π‘œπ‘› 𝑄, 𝐾, 𝑉 = π‘ π‘œπ‘“π‘‘π‘šπ‘Žπ‘₯
𝑄𝐾𝑇
π‘‘π‘˜
𝑉
6
Attention
Multi-Head Attention
οƒ˜ Using multiple heads (here, 8) where separate sets of queries, keys, and values are learned and concatenated
has been found to improve performance compared to using a single attention head.
7
Attention
Applications of Attention in Model
οƒ˜ In this paper, multi-head attention is employed in three different ways:
1. In the "encoder-decoder attention" layer, queries are obtained from the previous decoder step, while keys and
values are obtained from the last output of the encoder.
2. The encoder includes self-attention layers, where queries, keys, and values are generated from the same
vector.
3. The decoder also has self-attention layers, but with masking applied. During training, teacher forcing is used,
while during testing, the output from the previous step is used as input. If the correct target sequence is directly
used in multi-head attention, it could lead to referencing future words for prediction, which is contradictory.
Hence, masking is applied by setting the next predicted word to an extremely small negative value just before
passing through the softmax function.
2. Self-Attention
1. encoder-decoder attention 3. Masking
8
Position-wise Feed-Forward Networks
Position-wise Feed-Forward Networks
οƒ˜ Both the encoder and decoder include Feed-Forward Network layers of the same size. Each layer has different
parameters. The architecture consists of a linear transformation followed by a ReLU activation function
between the two linear transformations.
𝐹𝐹𝑁 π‘₯ = max 0, π‘₯π‘Š1 + 𝑏1 π‘Š1 + 𝑏2
9
Positional Encoding
Positional Encoding
οƒ˜ Attention mechanism alone cannot incorporate positional information of each token because dot products do
not inherently reflect positional information. Therefore, in this paper, positional encoding is applied to provide
positional information as follows.
𝑃𝐸(π‘π‘œπ‘ ,2𝑖) = sin(π‘π‘œπ‘ /100002𝑖/π‘‘π‘šπ‘œπ‘‘π‘’π‘™)
𝑃𝐸(π‘π‘œπ‘ ,2𝑖+1) = co𝑠(π‘π‘œπ‘ /100002𝑖/π‘‘π‘šπ‘œπ‘‘π‘’π‘™)
10
Training
Training
οƒ˜ Dataset: WMT 2014 English-German dataset / WMT 2014 English-French dataset
οƒ˜ Optimizer: Adam with 𝛽1 = 0.9, 𝛽2 = 0.98 π‘Žπ‘›π‘‘ πœ– = 10βˆ’9
. They introduced a new Noam scheduler instead of
using a learning rate.
lπ‘Ÿπ‘Žπ‘‘π‘’ = π‘‘π‘šπ‘œπ‘‘π‘’π‘™
βˆ’0.5
βˆ™ min(π‘ π‘‘π‘’π‘π‘›π‘’π‘š
βˆ’0.5, 𝑠𝑑𝑒𝑝_π‘›π‘’π‘š βˆ™ π‘€π‘Žπ‘Ÿπ‘šπ‘’π‘_π‘ π‘‘π‘’π‘π‘ βˆ’1.5)
οƒ˜ In this paper they used π‘€π‘Žπ‘Ÿπ‘šπ‘’π‘π‘ π‘‘π‘’π‘π‘  = 4000
οƒ˜ Warmup
οƒ˜ Decay
11
Results
Results
οƒ˜ The table above demonstrates that the Transformer achieves better
performance with fewer parameters compared to other models.
12
Results
Results
13
Results
Results
14
Results
Results
οƒ˜ (A) : Experimenting with different
values of h while keeping β„Ž Γ— π‘‘π‘˜ =
512.
15
Results
Results
οƒ˜ (B) : Experimenting by reducing only
π‘‘π‘˜.
16
Results
Results
οƒ˜ (C) : Increasing the number of
parameters improves performance.
17
Results
Results
οƒ˜ (D) : Preventing overfitting with
dropout / Label smoothing value.
18
Results
Results
οƒ˜ (B) : When "positional encoding" is
changed to "positional embedding"
19
Conclusion
Conclusion
οƒ˜ This study introduces the Transformer, the first sequence-to-sequence model that replaces the commonly
used RNN layer with multi-head attention in an encoder-decoder architecture.
οƒ˜ In translation tasks, the Transformer can learn much faster than architectures based on recurrent or
convolutional layers. It achieved state-of-the-art performance in WMT 2014 English-German and English-
French translation tasks.
οƒ˜ They are optimistic about the future of attention-based models and plan to apply them to other tasks. They aim
to extend the Transformer to handle problems with inputs and outputs beyond text, such as images, audio,
and video, efficiently processing large inputs and outputs using local, restricted attention mechanisms. One of
their other research goals is to make the generation process less sequential.
20
Q & A
Q / A

More Related Content

Similar to 240318_JW_labseminar[Attention Is All You Need].pptx

[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
Β 
Sign Detection from Hearing Impaired
Sign Detection from Hearing ImpairedSign Detection from Hearing Impaired
Sign Detection from Hearing ImpairedIRJET Journal
Β 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learningNimrita Koul
Β 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxDebabrataPain1
Β 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networksAkash Goel
Β 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingIRJET Journal
Β 
Intel Cluster Poisson Solver Library
Intel Cluster Poisson Solver LibraryIntel Cluster Poisson Solver Library
Intel Cluster Poisson Solver LibraryIlya Kryukov
Β 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model佳蓉 ε€ͺ
Β 
08 neural networks
08 neural networks08 neural networks
08 neural networksankit_ppt
Β 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
Β 
Concepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionConcepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionSaumyaMundra3
Β 
Machine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester ElectiveMachine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester ElectiveMayuraD1
Β 
Natural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning ApproachNatural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning ApproachMinhazul Arefin
Β 
TensorFlow.pptx
TensorFlow.pptxTensorFlow.pptx
TensorFlow.pptxJayesh Patil
Β 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...Tahmid Abtahi
Β 
Handwritten Digit Recognition and performance of various modelsation[autosaved]
Handwritten Digit Recognition and performance of various modelsation[autosaved]Handwritten Digit Recognition and performance of various modelsation[autosaved]
Handwritten Digit Recognition and performance of various modelsation[autosaved]SubhradeepMaji
Β 
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...Faster Training Algorithms in Neural Network Based Approach For Handwritten T...
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...CSCJournals
Β 

Similar to 240318_JW_labseminar[Attention Is All You Need].pptx (20)

[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Β 
Sign Detection from Hearing Impaired
Sign Detection from Hearing ImpairedSign Detection from Hearing Impaired
Sign Detection from Hearing Impaired
Β 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
Β 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
Β 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
Β 
Lesson 39
Lesson 39Lesson 39
Lesson 39
Β 
AI Lesson 39
AI Lesson 39AI Lesson 39
AI Lesson 39
Β 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
Β 
Intel Cluster Poisson Solver Library
Intel Cluster Poisson Solver LibraryIntel Cluster Poisson Solver Library
Intel Cluster Poisson Solver Library
Β 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
Β 
08 neural networks
08 neural networks08 neural networks
08 neural networks
Β 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
Β 
Concepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionConcepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, Attention
Β 
Machine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester ElectiveMachine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester Elective
Β 
Natural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning ApproachNatural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning Approach
Β 
TensorFlow.pptx
TensorFlow.pptxTensorFlow.pptx
TensorFlow.pptx
Β 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
Β 
Handwritten Digit Recognition and performance of various modelsation[autosaved]
Handwritten Digit Recognition and performance of various modelsation[autosaved]Handwritten Digit Recognition and performance of various modelsation[autosaved]
Handwritten Digit Recognition and performance of various modelsation[autosaved]
Β 
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...Faster Training Algorithms in Neural Network Based Approach For Handwritten T...
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...
Β 
Et25897899
Et25897899Et25897899
Et25897899
Β 

More from thanhdowork

[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...
[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...
[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...thanhdowork
Β 
240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...
240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...
240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...thanhdowork
Β 
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...thanhdowork
Β 
240422_Thanh_LabSeminar[Dynamic Graph Enhanced Contrastive Learning for Chest...
240422_Thanh_LabSeminar[Dynamic Graph Enhanced Contrastive Learning for Chest...240422_Thanh_LabSeminar[Dynamic Graph Enhanced Contrastive Learning for Chest...
240422_Thanh_LabSeminar[Dynamic Graph Enhanced Contrastive Learning for Chest...thanhdowork
Β 
[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptxthanhdowork
Β 
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...thanhdowork
Β 
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...thanhdowork
Β 
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...thanhdowork
Β 
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...thanhdowork
Β 
240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptxthanhdowork
Β 
240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...
240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...
240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...thanhdowork
Β 
240122_Attention Is All You Need (2017 NIPS)2.pptx
240122_Attention Is All You Need (2017 NIPS)2.pptx240122_Attention Is All You Need (2017 NIPS)2.pptx
240122_Attention Is All You Need (2017 NIPS)2.pptxthanhdowork
Β 
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...thanhdowork
Β 
[20240304_LabSeminar_Huy]DeepWalk: Online Learning of Social Representations....
[20240304_LabSeminar_Huy]DeepWalk: Online Learning of Social Representations....[20240304_LabSeminar_Huy]DeepWalk: Online Learning of Social Representations....
[20240304_LabSeminar_Huy]DeepWalk: Online Learning of Social Representations....thanhdowork
Β 
240304_Thanh_LabSeminar[Pure Transformers are Powerful Graph Learners].pptx
240304_Thanh_LabSeminar[Pure Transformers are Powerful Graph Learners].pptx240304_Thanh_LabSeminar[Pure Transformers are Powerful Graph Learners].pptx
240304_Thanh_LabSeminar[Pure Transformers are Powerful Graph Learners].pptxthanhdowork
Β 
240304_Thuy_Labseminar[SimGRACE: A Simple Framework for Graph Contrastive Lea...
240304_Thuy_Labseminar[SimGRACE: A Simple Framework for Graph Contrastive Lea...240304_Thuy_Labseminar[SimGRACE: A Simple Framework for Graph Contrastive Lea...
240304_Thuy_Labseminar[SimGRACE: A Simple Framework for Graph Contrastive Lea...thanhdowork
Β 
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptxthanhdowork
Β 
[20240311_LabSeminar_Huy]LINE: Large-scale Information Network Embedding.pptx
[20240311_LabSeminar_Huy]LINE: Large-scale Information Network Embedding.pptx[20240311_LabSeminar_Huy]LINE: Large-scale Information Network Embedding.pptx
[20240311_LabSeminar_Huy]LINE: Large-scale Information Network Embedding.pptxthanhdowork
Β 
240311_Thanh_LabSeminar[Translating Embeddings for Modeling Multi-relational ...
240311_Thanh_LabSeminar[Translating Embeddings for Modeling Multi-relational ...240311_Thanh_LabSeminar[Translating Embeddings for Modeling Multi-relational ...
240311_Thanh_LabSeminar[Translating Embeddings for Modeling Multi-relational ...thanhdowork
Β 
240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...
240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...
240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...thanhdowork
Β 

More from thanhdowork (20)

[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...
[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...
[20240429_LabSeminar_Huy]Spatio-Temporal Graph Neural Point Process for Traff...
Β 
240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...
240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...
240429_Thanh_LabSeminar[TranSG: Transformer-Based Skeleton Graph Prototype Co...
Β 
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
Β 
240422_Thanh_LabSeminar[Dynamic Graph Enhanced Contrastive Learning for Chest...
240422_Thanh_LabSeminar[Dynamic Graph Enhanced Contrastive Learning for Chest...240422_Thanh_LabSeminar[Dynamic Graph Enhanced Contrastive Learning for Chest...
240422_Thanh_LabSeminar[Dynamic Graph Enhanced Contrastive Learning for Chest...
Β 
[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx
Β 
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
Β 
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
Β 
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
Β 
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
Β 
240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx
Β 
240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...
240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...
240115_Thanh_LabSeminar[Don't walk, skip! online learning of multi-scale netw...
Β 
240122_Attention Is All You Need (2017 NIPS)2.pptx
240122_Attention Is All You Need (2017 NIPS)2.pptx240122_Attention Is All You Need (2017 NIPS)2.pptx
240122_Attention Is All You Need (2017 NIPS)2.pptx
Β 
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
Β 
[20240304_LabSeminar_Huy]DeepWalk: Online Learning of Social Representations....
[20240304_LabSeminar_Huy]DeepWalk: Online Learning of Social Representations....[20240304_LabSeminar_Huy]DeepWalk: Online Learning of Social Representations....
[20240304_LabSeminar_Huy]DeepWalk: Online Learning of Social Representations....
Β 
240304_Thanh_LabSeminar[Pure Transformers are Powerful Graph Learners].pptx
240304_Thanh_LabSeminar[Pure Transformers are Powerful Graph Learners].pptx240304_Thanh_LabSeminar[Pure Transformers are Powerful Graph Learners].pptx
240304_Thanh_LabSeminar[Pure Transformers are Powerful Graph Learners].pptx
Β 
240304_Thuy_Labseminar[SimGRACE: A Simple Framework for Graph Contrastive Lea...
240304_Thuy_Labseminar[SimGRACE: A Simple Framework for Graph Contrastive Lea...240304_Thuy_Labseminar[SimGRACE: A Simple Framework for Graph Contrastive Lea...
240304_Thuy_Labseminar[SimGRACE: A Simple Framework for Graph Contrastive Lea...
Β 
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
Β 
[20240311_LabSeminar_Huy]LINE: Large-scale Information Network Embedding.pptx
[20240311_LabSeminar_Huy]LINE: Large-scale Information Network Embedding.pptx[20240311_LabSeminar_Huy]LINE: Large-scale Information Network Embedding.pptx
[20240311_LabSeminar_Huy]LINE: Large-scale Information Network Embedding.pptx
Β 
240311_Thanh_LabSeminar[Translating Embeddings for Modeling Multi-relational ...
240311_Thanh_LabSeminar[Translating Embeddings for Modeling Multi-relational ...240311_Thanh_LabSeminar[Translating Embeddings for Modeling Multi-relational ...
240311_Thanh_LabSeminar[Translating Embeddings for Modeling Multi-relational ...
Β 
240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...
240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...
240311_Thuy_Labseminar[Contrastive Multi-View Representation Learning on Grap...
Β 

Recently uploaded

Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
Β 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
Β 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
Β 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
Β 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
Β 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
Β 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
Β 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
Β 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
Β 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
Β 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
Β 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
Β 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
Β 
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
Β 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
Β 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
Β 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
Β 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
Β 

Recently uploaded (20)

Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
Β 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
Β 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
Β 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
Β 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
Β 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
Β 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
Β 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Β 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
Β 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
Β 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
Β 
Model Call Girl in Bikash Puri Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Bikash Puri  Delhi reach out to us at πŸ”9953056974πŸ”Model Call Girl in Bikash Puri  Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Bikash Puri Delhi reach out to us at πŸ”9953056974πŸ”
Β 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
Β 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
Β 
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
Β 
Model Call Girl in Tilak Nagar Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Tilak Nagar Delhi reach out to us at πŸ”9953056974πŸ”Model Call Girl in Tilak Nagar Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Tilak Nagar Delhi reach out to us at πŸ”9953056974πŸ”
Β 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
Β 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Β 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
Β 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
Β 

240318_JW_labseminar[Attention Is All You Need].pptx

  • 1. Jin-Woo Jeong Network Science Lab Dept. of Mathematics The Catholic University of Korea E-mail: zeus0208b@gmail.com Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
  • 2. 1 οƒ˜ Introduction οƒ˜ Model Architecture β€’ Encoder and Decoder Stacks β€’ Attention β€’ Scaled Dot-Product Attention β€’ Multi-Head Attention β€’ Applications of Attention in our Model β€’ Position-wise Feed-Forward Networks β€’ Embeddings and Softmax β€’ Positional Encoding οƒ˜Training β€’ Training Data β€’ Optimizer β€’ Regularization οƒ˜Results οƒ˜Conclusion οƒ˜Q/A
  • 3. 2 Introduction Introduction οƒ˜ Until the emergence of this paper, the RNN-based Encoder-Decoder architecture had established itself as the state-of-the-art approach in sequence modeling and transformation tasks such as language modeling and machine translation. However, RNN-based models suffer from inherent sequential nature, making parallelization impossible during training and severely limiting batch processing across examples due to memory constraints, particularly as the length of training data increases. οƒ˜ The attention mechanism has become an essential component of powerful sequence modeling and transformation models across various tasks, allowing the modeling of dependencies irrespective of the distance in input or output sequences. However, so far, in most cases, the attention mechanism has been used in conjunction with RNNs. οƒ˜ This paper presents a model called 'Transformer' which avoids recurrence altogether and instead utilizes only the attention mechanism to capture dependencies between input and output. The Transformer is explicitly designed to enable greater parallelization and has the potential to become a new state-of-the-art technology in translation quality.
  • 4. 3 Model Architecture Model Architecture οƒ˜ In the Transformer, like other competitive deep learning transformation models, it also employs an encoder- decoder architecture. In the Transformer, the encoder maps an input sequence of symbol representations π‘₯ = (π‘₯1, … , π‘₯𝑛) to a sequence of continuous representations z = ( ) 𝑧1, … , 𝑧𝑛 . Given 𝑧, the decoder generates an output sequence of symbols y = ( ) 𝑦1, … , π‘¦π‘š one element at a time. At each step, the model utilizes autoregression, using previously generated symbols as additional inputs for the next generation.
  • 5. 4 Model Architecture Encoder and Decoder Stacks οƒ˜ Encoder : οƒ˜ The encoder consists of individual N=6 layers. Each layer has two sub-layers: the first one is the attention mechanism, and the second one is a fully connected feedforward network. Additionally, after passing through each layer, a residual connection is added, followed by layer normalization. οƒ˜ Decoder : οƒ˜ The decoder also consists of individual N=6 layers. In contrast to the encoder, the decoder has an additional layer that performs multi-head attention over the outputs of the encoder stack. Similarly to the encoder, in each layer of the decoder, after passing through, a residual connection is added, followed by layer normalization. During self-attention in the decoder, masking is applied to prevent attending to subsequent words when predicting the next word. This is done to avoid leakage of information from future tokens during training.
  • 6. 5 Attention Attention οƒ˜ Attention function can be described as mapping a set of queries and key-value pairs to an output, where queries, keys, and values are all vectors. The output is calculated as a weighted sum of the values Scaled Dot-Product Attention οƒ˜ The dimension of queries and keys is π‘‘π‘˜, and the dimension of values is 𝑑𝑣. The attention function involves the following steps: first, taking the dot product of queries and keys, then scaling by π‘‘π‘˜. Next, passing through a softmax function to compute the weights based on the similarity between each key and the query. Finally, multiplying the values by these weights to obtain the weighted sum, which constitutes the output. Each 𝑄, 𝐾, 𝑉 matrix contains the query, key, and value information for each word in a single row. This process, known as scaled attention, computes attention scores through matrix operations. Scaling by π‘‘π‘˜ is to prevent the softmax function from outputting extremely small gradients due to large inputs, which helps mitigate the vanishing gradient problem. π΄π‘‘π‘‘π‘’π‘›π‘‘π‘–π‘œπ‘› 𝑄, 𝐾, 𝑉 = π‘ π‘œπ‘“π‘‘π‘šπ‘Žπ‘₯ 𝑄𝐾𝑇 π‘‘π‘˜ 𝑉
  • 7. 6 Attention Multi-Head Attention οƒ˜ Using multiple heads (here, 8) where separate sets of queries, keys, and values are learned and concatenated has been found to improve performance compared to using a single attention head.
  • 8. 7 Attention Applications of Attention in Model οƒ˜ In this paper, multi-head attention is employed in three different ways: 1. In the "encoder-decoder attention" layer, queries are obtained from the previous decoder step, while keys and values are obtained from the last output of the encoder. 2. The encoder includes self-attention layers, where queries, keys, and values are generated from the same vector. 3. The decoder also has self-attention layers, but with masking applied. During training, teacher forcing is used, while during testing, the output from the previous step is used as input. If the correct target sequence is directly used in multi-head attention, it could lead to referencing future words for prediction, which is contradictory. Hence, masking is applied by setting the next predicted word to an extremely small negative value just before passing through the softmax function. 2. Self-Attention 1. encoder-decoder attention 3. Masking
  • 9. 8 Position-wise Feed-Forward Networks Position-wise Feed-Forward Networks οƒ˜ Both the encoder and decoder include Feed-Forward Network layers of the same size. Each layer has different parameters. The architecture consists of a linear transformation followed by a ReLU activation function between the two linear transformations. 𝐹𝐹𝑁 π‘₯ = max 0, π‘₯π‘Š1 + 𝑏1 π‘Š1 + 𝑏2
  • 10. 9 Positional Encoding Positional Encoding οƒ˜ Attention mechanism alone cannot incorporate positional information of each token because dot products do not inherently reflect positional information. Therefore, in this paper, positional encoding is applied to provide positional information as follows. 𝑃𝐸(π‘π‘œπ‘ ,2𝑖) = sin(π‘π‘œπ‘ /100002𝑖/π‘‘π‘šπ‘œπ‘‘π‘’π‘™) 𝑃𝐸(π‘π‘œπ‘ ,2𝑖+1) = co𝑠(π‘π‘œπ‘ /100002𝑖/π‘‘π‘šπ‘œπ‘‘π‘’π‘™)
  • 11. 10 Training Training οƒ˜ Dataset: WMT 2014 English-German dataset / WMT 2014 English-French dataset οƒ˜ Optimizer: Adam with 𝛽1 = 0.9, 𝛽2 = 0.98 π‘Žπ‘›π‘‘ πœ– = 10βˆ’9 . They introduced a new Noam scheduler instead of using a learning rate. lπ‘Ÿπ‘Žπ‘‘π‘’ = π‘‘π‘šπ‘œπ‘‘π‘’π‘™ βˆ’0.5 βˆ™ min(π‘ π‘‘π‘’π‘π‘›π‘’π‘š βˆ’0.5, 𝑠𝑑𝑒𝑝_π‘›π‘’π‘š βˆ™ π‘€π‘Žπ‘Ÿπ‘šπ‘’π‘_π‘ π‘‘π‘’π‘π‘ βˆ’1.5) οƒ˜ In this paper they used π‘€π‘Žπ‘Ÿπ‘šπ‘’π‘π‘ π‘‘π‘’π‘π‘  = 4000 οƒ˜ Warmup οƒ˜ Decay
  • 12. 11 Results Results οƒ˜ The table above demonstrates that the Transformer achieves better performance with fewer parameters compared to other models.
  • 15. 14 Results Results οƒ˜ (A) : Experimenting with different values of h while keeping β„Ž Γ— π‘‘π‘˜ = 512.
  • 16. 15 Results Results οƒ˜ (B) : Experimenting by reducing only π‘‘π‘˜.
  • 17. 16 Results Results οƒ˜ (C) : Increasing the number of parameters improves performance.
  • 18. 17 Results Results οƒ˜ (D) : Preventing overfitting with dropout / Label smoothing value.
  • 19. 18 Results Results οƒ˜ (B) : When "positional encoding" is changed to "positional embedding"
  • 20. 19 Conclusion Conclusion οƒ˜ This study introduces the Transformer, the first sequence-to-sequence model that replaces the commonly used RNN layer with multi-head attention in an encoder-decoder architecture. οƒ˜ In translation tasks, the Transformer can learn much faster than architectures based on recurrent or convolutional layers. It achieved state-of-the-art performance in WMT 2014 English-German and English- French translation tasks. οƒ˜ They are optimistic about the future of attention-based models and plan to apply them to other tasks. They aim to extend the Transformer to handle problems with inputs and outputs beyond text, such as images, audio, and video, efficiently processing large inputs and outputs using local, restricted attention mechanisms. One of their other research goals is to make the generation process less sequential.
  • 21. 20 Q & A Q / A

Editor's Notes

  1. thank you, the presentation is concluded