SlideShare a Scribd company logo
Introduction of Transformer
Lab Meeting Material
Yuta Niki
Master 1st
Izumi Lab. UTokyo
This Material’s Objective
◼Transformer and its advanced models(BERT) show
high performance!
◼Experiments with those models are necessary in
NLP×Deep Learning research.
◼First Step (in this slide)
• Learn basic knowledge of Attention
• Understand the architecture of Transformer
◼Next Step (in the future)
• Fine-Tuning for Sentiment Analysis, etc.
• Learn BERT, etc.
※In the last slide, reference materials are collected. You should read them.
※This is written in English because an international student came to the Lab.
2
What is “Transformer”?
◼Paper
• “Attention Is All You Need”[1]
◼Motivation
• Build a model with sufficient representation power for difficult
task(←translation task in the paper)
• Train a model efficiently in parallel(RNN cannot train in parallel)
◼Methods and Results
• Architecture with attention mechanism without RNN
• Less time to train
• Achieve great BLEU score in the translation task
◼Application
• Use Encoder that have acquired strong representation power
for other tasks by fine-tuning.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
3
Transformer’s structure
◼Encoder(Left)
• Stack of 6 layers
• self-attention + feed-forward network(FFN)
◼Decoder(Right)
• Stack of 6 layers
• self-attention + SourceTarget-att + FFN
◼Components
• Positional Encoding
• Multi-Head Attention
• Position-wise Feed-Forward Network
• Residual Connection
◼Regularization
• Residual Dropout
• Label Smoothing
• Attention Dropout
4
Positional Encoding
◼Proposed in “End-To-End Memory Network”[1]
◼Motivation
• Add information about the position of the words in the
sentences(←transformer don’t contain RNN and CNN)
𝑑 𝑚𝑜𝑑𝑒𝑙: the dim. of word embedding
𝑃𝐸(𝑝𝑜𝑠,2𝑖) = 𝑠𝑖𝑛(
𝑝𝑜𝑠
100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙
)
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = 𝑐𝑜𝑠(
𝑝𝑜𝑠
100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙
)
Where 𝑝𝑜𝑠 is the position and 𝑖 is the dimension.
[1] Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. "End-to-end memory networks." Advances in neural information processing systems. 2015.
5
Scaled Dot-Product Attention
Attention 𝑄, 𝐾, 𝑉 = softmax
𝑄𝐾 𝑇
𝑑 𝑘
𝑉
where
𝑄 𝑄 ∈ ℝ 𝑛×𝑑 𝑘 : query matrix
𝐾 𝐾 ∈ ℝ 𝑛×𝑑 𝑘 : key matrix
𝑉 𝑉 ∈ ℝ 𝑛×𝑑 𝑣 : value matrix
𝑛: length of sentence
𝑑 𝑘: dim. of queries and keys
𝑑 𝑣: dim. of values
6
2 Types of Attention
• Additive Attention[1]
𝐴𝑡𝑡 𝐻
= softmax 𝑊𝐻 + 𝑏
• Dot-Product Attention[2,3]
𝐴𝑡𝑡 𝑄, 𝐾, 𝑉
= softmax 𝑄𝐾 𝑇 𝑉
[1] Bahdanau, Dzmitry, et al. “Neural Machine Translation by Jointly Learning to Align an Translate.” ICLR, 2015.
[2] Miller, Alexander, et al. “Key-Value Memory Networks for Directly Reading Documents.” EMNLP, 2016.
[3] Daniluk, Michal, et al. “Frustratingly Short Attention Spans in Neural Language Modeling.” ICLR, 2017.
In Transformer, Dot-Product Attention is Used.
7
Why Use Scaled Dot-Product Attention?
◼Dot-Product Attention is faster and more
efficient than Additive Attention.
• Additive Attention use a feed-forward network as the
compatibility function.
• Dot-Product Attention can be implemented using highly
optimized matrix multiplication code.
◼Use scaling term
1
𝑑 𝑘
to make Dot-Product
Attention high performance with large 𝑑 𝑘
• Additive Attention outperforms Dot-Product Attention
without scaling for larger values of 𝑑 𝑘 [1]
[1] Britz, Denny, et al. “Massive Exploration of Neural Machine Translation Architectures." EMNLP, 2017.
8
Source-Target or Self Attention
◼2 types of Dot-Product Attention
• Source-Target Attention
➢Used in the 2nd Multi-Head Attention Layer of Transformer
Decoder Layer
• Self-Attention
➢Used in the Multi-Head Attention Layer of Transformer
Encoder Layer and the 1st one of Transformer Decoder Layer
◼What is the difference?
• Depends on where query comes from.
➢query from Encoder → Self-Att.
➢query from Decoder → Source-Target Att.
𝐾 𝑉𝑞𝑢𝑒𝑟𝑦𝜎
from Encoder
from Encoder → Self
from Decoder → Target
9
Multi-Head Attention
MultiHead 𝑄, 𝐾, 𝑉 = Concat ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂
where ℎ𝑒𝑎𝑑𝑖 = Attention(𝑄𝑊𝑖
𝑊
, 𝐾𝑊𝑖
𝐾
, 𝑉𝑊𝑖
𝑉
)
where 𝑊𝑖
𝑄
∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑘, 𝑊𝑖
𝐾
∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑘,
𝑊𝑖
𝑉
∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑣 and 𝑊 𝑂
∈ ℝℎ𝑑 𝑣×𝑑 𝑚𝑜𝑑𝑒𝑙.
ℎ: # of parallel attention layers
𝑑 𝑘 = 𝑑 𝑣 = 𝑑 𝑚𝑜𝑑𝑒𝑙/ℎ .
⇒Attention with Dropout
Attention 𝑄, 𝐾, 𝑉 = dropout softmax
𝑄𝐾 𝑇
𝑑 𝑘
𝑉
10
Why Multi-Head Attention?
Experiments(In Table 3 (a)) shows that multi-head
attention model outperforms single-head attention.
“Multi-Head Attention allows the model to jointly
attend to information from different representation
subspaces at difference positions.”[1]
Multi-Head Attention seems
ensemble of attention.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
11
What Multi-Head Attention Learns
◼Learn the importance of the relationship
between words regardless of their distance
• In the figure below, the relationship between
“making” and “difficult” is strong in many Attention.
12Cite from (http://deeplearning.hatenablog.com/entry/transformer)
FFN and Residual Connection
◼Point-wise Feed-Forward Network
FFN 𝑥 = ReLU 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2
where
𝑑 𝑓𝑓(= 2048): dim. of the inner-layer
◼Residual Connection
LayerNorm(𝑥 + Sublayer(𝑥))
⇒Residual Dropout
LayerNorm(𝑥 + Drouput(Sublayer 𝑥 , droprate))
13
Very Thanks for Great Predecessors
◼Summary blogs helped my understanding m(_ _)m
• 論文解説 Attention Is All You Need (Transformer)
➢Commentary including background knowledge necessary for
full understanding
• 論文読み "Attention Is All You Need“
➢Help understand the flow of data in Transformer
• The Annotated Transformer(harvardnlp)
➢PyTorch implementation and corresponding parts of the paper
are explained simply.
• 作って理解する Transformer / Attention
➢I cannot understand how to calculate 𝑄, 𝐾 and 𝑉 in Dot-
Product Attention from paper. This page shows one solution.
14

More Related Content

What's hot

Bert
BertBert
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
Suman Debnath
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
H K Yoon
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Minh Pham
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Sergey Karayev
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
Fwdays
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
Grigory Sapunov
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
Khang Pham
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
TuCaoMinh2
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
JiWenKim
 
Abstractive Text Summarization
Abstractive Text SummarizationAbstractive Text Summarization
Abstractive Text Summarization
Tho Phan
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
Jeong-Gwan Lee
 
Transformers
TransformersTransformers
Transformers
Anup Joseph
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Deep Learning Italia
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
健程 杨
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
leopauly
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
Liangqun Lu
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer Vision
Dongmin Choi
 

What's hot (20)

Bert
BertBert
Bert
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 
Abstractive Text Summarization
Abstractive Text SummarizationAbstractive Text Summarization
Abstractive Text Summarization
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Transformers
TransformersTransformers
Transformers
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer Vision
 

Similar to Transformer Introduction (Seminar Material)

EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONEXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
ijaia
 
Muhammad Usman Akhtar | Ph.D Scholar | Wuhan University | School of Co...
Muhammad Usman Akhtar  |  Ph.D Scholar  |  Wuhan  University  |  School of Co...Muhammad Usman Akhtar  |  Ph.D Scholar  |  Wuhan  University  |  School of Co...
Muhammad Usman Akhtar | Ph.D Scholar | Wuhan University | School of Co...
Wuhan University
 
Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
June-Woo Kim
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
yang947066
 
Introduction to Deep Learning presentation
Introduction to Deep Learning presentationIntroduction to Deep Learning presentation
Introduction to Deep Learning presentation
johanericka2
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Alessandro Suglia
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Claudio Greco
 
LearningAG.ppt
LearningAG.pptLearningAG.ppt
LearningAG.ppt
butest
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKSENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
ijnlc
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
kevig
 
How data science works and how can customers help
How data science works and how can customers helpHow data science works and how can customers help
How data science works and how can customers help
Danko Nikolic
 
Cost-effective Interactive Attention Learning with Neural Attention Process
Cost-effective Interactive Attention Learning with Neural Attention ProcessCost-effective Interactive Attention Learning with Neural Attention Process
Cost-effective Interactive Attention Learning with Neural Attention Process
MLAI2
 
Agile leadership practices for PIONEERS
 Agile leadership practices for PIONEERS Agile leadership practices for PIONEERS
Agile leadership practices for PIONEERS
Stefan Haas
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
ijsc
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
Sangwoo Mo
 
Multi-Task Learning in Deep Neural Networks.pptx
Multi-Task Learning in Deep Neural Networks.pptxMulti-Task Learning in Deep Neural Networks.pptx
Multi-Task Learning in Deep Neural Networks.pptx
ibrahimalshareef3
 
Attention_Mechanisms_Presentation all types.pptx
Attention_Mechanisms_Presentation all types.pptxAttention_Mechanisms_Presentation all types.pptx
Attention_Mechanisms_Presentation all types.pptx
l228296
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
203318pmpc
 
A Multiscale Visualization of Attention in the Transformer Model
A Multiscale Visualization of Attention in the Transformer ModelA Multiscale Visualization of Attention in the Transformer Model
A Multiscale Visualization of Attention in the Transformer Model
taeseon ryu
 
Deep learning.pptx
Deep learning.pptxDeep learning.pptx
Deep learning.pptx
MdMahfoozAlam5
 

Similar to Transformer Introduction (Seminar Material) (20)

EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONEXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
 
Muhammad Usman Akhtar | Ph.D Scholar | Wuhan University | School of Co...
Muhammad Usman Akhtar  |  Ph.D Scholar  |  Wuhan  University  |  School of Co...Muhammad Usman Akhtar  |  Ph.D Scholar  |  Wuhan  University  |  School of Co...
Muhammad Usman Akhtar | Ph.D Scholar | Wuhan University | School of Co...
 
Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
Introduction to Deep Learning presentation
Introduction to Deep Learning presentationIntroduction to Deep Learning presentation
Introduction to Deep Learning presentation
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
 
LearningAG.ppt
LearningAG.pptLearningAG.ppt
LearningAG.ppt
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKSENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
 
How data science works and how can customers help
How data science works and how can customers helpHow data science works and how can customers help
How data science works and how can customers help
 
Cost-effective Interactive Attention Learning with Neural Attention Process
Cost-effective Interactive Attention Learning with Neural Attention ProcessCost-effective Interactive Attention Learning with Neural Attention Process
Cost-effective Interactive Attention Learning with Neural Attention Process
 
Agile leadership practices for PIONEERS
 Agile leadership practices for PIONEERS Agile leadership practices for PIONEERS
Agile leadership practices for PIONEERS
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Multi-Task Learning in Deep Neural Networks.pptx
Multi-Task Learning in Deep Neural Networks.pptxMulti-Task Learning in Deep Neural Networks.pptx
Multi-Task Learning in Deep Neural Networks.pptx
 
Attention_Mechanisms_Presentation all types.pptx
Attention_Mechanisms_Presentation all types.pptxAttention_Mechanisms_Presentation all types.pptx
Attention_Mechanisms_Presentation all types.pptx
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
A Multiscale Visualization of Attention in the Transformer Model
A Multiscale Visualization of Attention in the Transformer ModelA Multiscale Visualization of Attention in the Transformer Model
A Multiscale Visualization of Attention in the Transformer Model
 
Deep learning.pptx
Deep learning.pptxDeep learning.pptx
Deep learning.pptx
 

Recently uploaded

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 

Recently uploaded (20)

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 

Transformer Introduction (Seminar Material)

  • 1. Introduction of Transformer Lab Meeting Material Yuta Niki Master 1st Izumi Lab. UTokyo
  • 2. This Material’s Objective ◼Transformer and its advanced models(BERT) show high performance! ◼Experiments with those models are necessary in NLP×Deep Learning research. ◼First Step (in this slide) • Learn basic knowledge of Attention • Understand the architecture of Transformer ◼Next Step (in the future) • Fine-Tuning for Sentiment Analysis, etc. • Learn BERT, etc. ※In the last slide, reference materials are collected. You should read them. ※This is written in English because an international student came to the Lab. 2
  • 3. What is “Transformer”? ◼Paper • “Attention Is All You Need”[1] ◼Motivation • Build a model with sufficient representation power for difficult task(←translation task in the paper) • Train a model efficiently in parallel(RNN cannot train in parallel) ◼Methods and Results • Architecture with attention mechanism without RNN • Less time to train • Achieve great BLEU score in the translation task ◼Application • Use Encoder that have acquired strong representation power for other tasks by fine-tuning. [1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. 3
  • 4. Transformer’s structure ◼Encoder(Left) • Stack of 6 layers • self-attention + feed-forward network(FFN) ◼Decoder(Right) • Stack of 6 layers • self-attention + SourceTarget-att + FFN ◼Components • Positional Encoding • Multi-Head Attention • Position-wise Feed-Forward Network • Residual Connection ◼Regularization • Residual Dropout • Label Smoothing • Attention Dropout 4
  • 5. Positional Encoding ◼Proposed in “End-To-End Memory Network”[1] ◼Motivation • Add information about the position of the words in the sentences(←transformer don’t contain RNN and CNN) 𝑑 𝑚𝑜𝑑𝑒𝑙: the dim. of word embedding 𝑃𝐸(𝑝𝑜𝑠,2𝑖) = 𝑠𝑖𝑛( 𝑝𝑜𝑠 100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙 ) 𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = 𝑐𝑜𝑠( 𝑝𝑜𝑠 100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙 ) Where 𝑝𝑜𝑠 is the position and 𝑖 is the dimension. [1] Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. "End-to-end memory networks." Advances in neural information processing systems. 2015. 5
  • 6. Scaled Dot-Product Attention Attention 𝑄, 𝐾, 𝑉 = softmax 𝑄𝐾 𝑇 𝑑 𝑘 𝑉 where 𝑄 𝑄 ∈ ℝ 𝑛×𝑑 𝑘 : query matrix 𝐾 𝐾 ∈ ℝ 𝑛×𝑑 𝑘 : key matrix 𝑉 𝑉 ∈ ℝ 𝑛×𝑑 𝑣 : value matrix 𝑛: length of sentence 𝑑 𝑘: dim. of queries and keys 𝑑 𝑣: dim. of values 6
  • 7. 2 Types of Attention • Additive Attention[1] 𝐴𝑡𝑡 𝐻 = softmax 𝑊𝐻 + 𝑏 • Dot-Product Attention[2,3] 𝐴𝑡𝑡 𝑄, 𝐾, 𝑉 = softmax 𝑄𝐾 𝑇 𝑉 [1] Bahdanau, Dzmitry, et al. “Neural Machine Translation by Jointly Learning to Align an Translate.” ICLR, 2015. [2] Miller, Alexander, et al. “Key-Value Memory Networks for Directly Reading Documents.” EMNLP, 2016. [3] Daniluk, Michal, et al. “Frustratingly Short Attention Spans in Neural Language Modeling.” ICLR, 2017. In Transformer, Dot-Product Attention is Used. 7
  • 8. Why Use Scaled Dot-Product Attention? ◼Dot-Product Attention is faster and more efficient than Additive Attention. • Additive Attention use a feed-forward network as the compatibility function. • Dot-Product Attention can be implemented using highly optimized matrix multiplication code. ◼Use scaling term 1 𝑑 𝑘 to make Dot-Product Attention high performance with large 𝑑 𝑘 • Additive Attention outperforms Dot-Product Attention without scaling for larger values of 𝑑 𝑘 [1] [1] Britz, Denny, et al. “Massive Exploration of Neural Machine Translation Architectures." EMNLP, 2017. 8
  • 9. Source-Target or Self Attention ◼2 types of Dot-Product Attention • Source-Target Attention ➢Used in the 2nd Multi-Head Attention Layer of Transformer Decoder Layer • Self-Attention ➢Used in the Multi-Head Attention Layer of Transformer Encoder Layer and the 1st one of Transformer Decoder Layer ◼What is the difference? • Depends on where query comes from. ➢query from Encoder → Self-Att. ➢query from Decoder → Source-Target Att. 𝐾 𝑉𝑞𝑢𝑒𝑟𝑦𝜎 from Encoder from Encoder → Self from Decoder → Target 9
  • 10. Multi-Head Attention MultiHead 𝑄, 𝐾, 𝑉 = Concat ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂 where ℎ𝑒𝑎𝑑𝑖 = Attention(𝑄𝑊𝑖 𝑊 , 𝐾𝑊𝑖 𝐾 , 𝑉𝑊𝑖 𝑉 ) where 𝑊𝑖 𝑄 ∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑘, 𝑊𝑖 𝐾 ∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑘, 𝑊𝑖 𝑉 ∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑣 and 𝑊 𝑂 ∈ ℝℎ𝑑 𝑣×𝑑 𝑚𝑜𝑑𝑒𝑙. ℎ: # of parallel attention layers 𝑑 𝑘 = 𝑑 𝑣 = 𝑑 𝑚𝑜𝑑𝑒𝑙/ℎ . ⇒Attention with Dropout Attention 𝑄, 𝐾, 𝑉 = dropout softmax 𝑄𝐾 𝑇 𝑑 𝑘 𝑉 10
  • 11. Why Multi-Head Attention? Experiments(In Table 3 (a)) shows that multi-head attention model outperforms single-head attention. “Multi-Head Attention allows the model to jointly attend to information from different representation subspaces at difference positions.”[1] Multi-Head Attention seems ensemble of attention. [1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. 11
  • 12. What Multi-Head Attention Learns ◼Learn the importance of the relationship between words regardless of their distance • In the figure below, the relationship between “making” and “difficult” is strong in many Attention. 12Cite from (http://deeplearning.hatenablog.com/entry/transformer)
  • 13. FFN and Residual Connection ◼Point-wise Feed-Forward Network FFN 𝑥 = ReLU 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2 where 𝑑 𝑓𝑓(= 2048): dim. of the inner-layer ◼Residual Connection LayerNorm(𝑥 + Sublayer(𝑥)) ⇒Residual Dropout LayerNorm(𝑥 + Drouput(Sublayer 𝑥 , droprate)) 13
  • 14. Very Thanks for Great Predecessors ◼Summary blogs helped my understanding m(_ _)m • 論文解説 Attention Is All You Need (Transformer) ➢Commentary including background knowledge necessary for full understanding • 論文読み "Attention Is All You Need“ ➢Help understand the flow of data in Transformer • The Annotated Transformer(harvardnlp) ➢PyTorch implementation and corresponding parts of the paper are explained simply. • 作って理解する Transformer / Attention ➢I cannot understand how to calculate 𝑄, 𝐾 and 𝑉 in Dot- Product Attention from paper. This page shows one solution. 14