SlideShare a Scribd company logo
1 of 35
Download to read offline
Photo CC BY-SA 2.0, Gabriel Synnaeve
Google Research
Brain Team, Zürich
Transformers
Lucas Beyer lbeyer@google.com
The classic landscape:
One architecture per
"community"
Computer Vision
Convolutional NNs (+ResNets)
Natural Lang. Proc.
Recurrent NNs (+LSTMs)
Translation
Seq2Seq
Speech
Deep Belief Nets (+non-DL)
h e l l o
[1] CNN image CC-BY-SA by Aphex34 for Wikipedia https://commons.wikimedia.org/wiki/File:Typical_cnn.png
[2] RNN image CC-BY-SA by GChe for Wikipedia https://commons.wikimedia.org/wiki/File:The_LSTM_Cell.svg
GRBM 🔒
RBM 🔒
RBM
DBN
DBN
DBN
h e l l o </s> s a l u t
s a l u t </s>
RL
BC/GAIL
[1]
[2]
The Transformer's takeover:
One community at a time
Computer Vision Natural Lang. Proc.
Translation
Speech
Reinf. Learning
Graphs/Science
Transformer image source: "Attention Is All You Need" paper
The origins:
Translation, learned alignment
Neural Machine Translation by Jointly Learning to Align and Translate
2014, Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio
Attention Is All You Need
2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Attention is a function similar to a "soft" kv dictionary lookup:
Query
q
Keys
k
Values
v
Historical side-note: "non-local NNs" in computer vision and "relational NNs" in RL appeared almost at the same time and contain the same core idea!
Attention weights a
1. Attention weights a1:N
are query-key similarities:
âi
= q · ki
Normalized via softmax: ai
= eâi
/ ∑j
eâj
Output
z
2. Output z is attention-weighted average of values v1:N
:
z = ∑i
âi
vi
= â · v
3. Usually, k and v are derived from the same input x:
k = Wk
· x v = Wv
· x
The query q can come from a separate input y:
q = Wq
· y
Or from the same input x! Then we call it "self attention":
q = Wq
· x
Input
x
Wk
Wv
...
... ...
But that's not actually it! There are a few more details:
Attention Is All You Need
2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Queries
q
1:M
Attention matrix A
Outputs
z
1:M
1. We usually use many queries q1:M
, not just one.
Stacking them leads to the Attention matrix A1:N,1:M
and subsequently to many outputs:
z1:M
= Attn(q1:M
,x) = [Attn(q1
,x) | Attn(q2
,x) | ... | Attn(qM
,x)]
Queries
q
1:M
Attention matrix A
Outputs
z
1:M
Queries
q
1:M
Attention matrix A
Outputs
z
1:M
Queries
q
1:M
Attention matrix A
Outputs
z
1:M
Attn1
(qi
,x)
zi
= ⎛Attn2
(qi
,x) ⎞
....
⎝ AttnK
(qi
,x)⎠
2. We usually use "multi-head" attention. This means the
operation is repeated K times and the results are
concatenated along the feature dimension. Ws differ.
Queries
q
1:M
Attention matrix A
Outputs
z
1:M
Attention matrix A
Attention matrix A
Queries
q
1:M
Attention matrix A
Outputs
z
1:M
3. The most commonly seen formulation:
z = softmax(QK'/√dkey
)V
Note that the complexity is O(N²)
Attention Is All You Need - The Transformer architecture
2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Thanks to Basil Mustafa for slide inspiration
Transformer image source: "Attention Is All You Need" paper
Attention Is All You Need - The Transformer architecture
2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Input text is first split into pieces. Can be characters, word, "tokens":
"The detective investigated" -> [The_] [detective_] [invest] [igat] [ed_]
Tokens are indices into the "vocabulary":
[The_] [detective_] [invest] [igat] [ed_] -> [3 721 68 1337 42]
Each vocab entry corresponds to a learned dmodel
-dimensional vector.
[3 721 68 1337 42] -> [ [0.123, -5.234, ...], [...], [...], [...], [...] ]
Input (Tokenization and) Embedding
Positional Encoding
Remember attention is permutation invariant, but language is not!
Need to encode position of each word; just add something.
Think [The_] + 10 [detective_] + 20 [invest] + 30 ... but smarter.
Thanks to Basil Mustafa for slide inspiration
Transformer image source: "Attention Is All You Need" paper
Attention Is All You Need - The Transformer architecture
2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Multi-headed Self-Attention
Meaning the input sequence is used to create
queries, keys, and values!
Each token can "look around" the whole input,
and decide how to update its representation
based on what it sees.
[The_] [detective_] [invest] [igat] [ed_]
MHSA
Thanks to Basil Mustafa for slide inspiration
Transformer image source: "Attention Is All You Need" paper
Attention Is All You Need - The Transformer architecture
2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Point-wise MLP
Some people like to call it 1x1 convolution.
A simple MLP applied to each token individually:
zi
= W2
GeLU(W1
x + b1
) + b2
Think of it as each token pondering for itself
about what it has observed previously.
There's some weak evidence this is where
"world knowledge" is stored, too.
It contains the bulk of the parameters. When
people make giant models and sparse/moe, this
is what becomes giant.
GeLU
GeLU GeLU
GeLU
[The_] [detective_] [invest] [igat] [ed_]
GeLU
Thanks to Basil Mustafa for slide inspiration
Transformer image source: "Attention Is All You Need" paper
Attention Is All You Need - The Transformer architecture
2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Residual connections
Each module's output has the exact
same shape as its input.
Following ResNets, the module
computes a "residual" instead of a
new value:
zi
= Module(xi
) + xi
This was shown to dramatically
improve trainability.
LayerNorm
Normalization also dramatically improves trainability.
There's post-norm (original) and pre-norm (modern)
zi
= LN(Module(xi
) + xi
) zi
= Module(LN(xi
)) + xi
Block
+ +
Block
"Skip connection" "Residual block"
Thanks to Basil Mustafa for slide inspiration
Transformer image source: "Attention Is All You Need" paper
Attention Is All You Need - The Transformer architecture
2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Encoding / Encoder
Since input and output shapes are identical,
we can stack N such blocks.
Typically, N=6 ("base"), N=12 ("large") or more.
Encoder output is a "heavily processed"
(think: "high level, contextualized") version of
the input tokens, i.e. a sequence.
This has nothing to do with the requested
output yet (think: translation). That comes
with the decoder.
Thanks to Basil Mustafa for slide inspiration
Transformer image source: "Attention Is All You Need" paper
Attention Is All You Need - The Transformer architecture
2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
What we want to model: p(z|x)
for example, in translation: p(z | "the detective investigated") ∀z
Decoding / the Decoder (alternatively Generating / the Generator)
Seems impossible at first, but we can exactly decompose into tokens:
p(z|x) = p(z1
|x) p(z2
|z1
,x) p(z3
|z2
,z1
,x)...
Meaning, we can generate the answer one token at a time.
Each p is a full pass through the model.
For generating p(z3
|z2
,z1
,x):
x comes from the encoder,
z1
, z2
is what we have predicted so far, goes into the decoder.
Once we have p(z|x) we still need to actually sample a sentence such
as "le détective a enquêté". Many strategies: greedy, beam-search, ...
z1
, z2
x
p(z3
|z2
,z1
,x)
Thanks to Basil Mustafa for slide inspiration
Transformer image source: "Attention Is All You Need" paper
Attention Is All You Need - The Transformer architecture
2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
This is regular self-attention as in the encoder, to process what's been
decoded so far, but with a trick...
Masked self-attention
If we had to train on one single p(z3
|z2
,z1
,x) at a time: SLOW!
Instead, train on all p(zi
|z1:i
,x) simultaneously.
How? In the attention weights for zi
, set all entries i:N to 0.
This way, each token only sees the already generated ones.
At generation time
There is no such trick. We need to generate one zi
at a time. This is why
autoregressive decoding is extremely slow.
Thanks to Basil Mustafa for slide inspiration
Transformer image source: "Attention Is All You Need" paper
Attention Is All You Need - The Transformer architecture
2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Each decoded token can "look at" the encoder's output:
Attn(q=Wq
xdec
, k=Wk
xenc
, v=Wv
xenc
)
This is the same as in the 2014 paper.
This is where |x in p(z3
|z2
,z1
,x) comes from.
"Cross" attention
xenc
xdec
Because self-attention is so widely used, people have started just
calling it "attention".
Hence, we now often need to explicitly call this "cross attention".
Thanks to Basil Mustafa for slide inspiration
Transformer image source: "Attention Is All You Need" paper ; heatmap image source: "Neural Machine Translation by Jointly Learning to Align and Translate" paper
Attention Is All You Need - The Transformer architecture
2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Feedforward and stack layers.
Thanks to Basil Mustafa for slide inspiration
Transformer image source: "Attention Is All You Need" paper
Attention Is All You Need - The Transformer architecture
2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Output layer
Assume we have already generated K tokens, generate the next one.
The decoder was used to gather all information necessary to predict a
probability distribution for the next token (K), over the whole vocab.
Simple:
linear projection of token K
SoftMax normalization
Thanks to Basil Mustafa for slide inspiration
Transformer image source: "Attention Is All You Need" paper
Attention Is All You Need - Summary and results
2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Animation source: https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Table source: "Attention Is All You Need" paper
The first (1.5th
) big takeover:
Language Modeling / NLP
Decoder-only
GPT
Encoder-only
BERT
Enc-Dec
T5
[The_] [cat_] [MASK] [on_] [MASK] [mat_]
[*] [*] [sat_] [*] [the_] [*]
[START] [The_] [cat_]
[sat_]
Translate EN-DE: This is good.
Summarize: state authorities dispatched…
Is this toxic: You look beautiful today!
Das ist gut.
A storm in Attala caused 6 victims.
This is not toxic.
Transformer image source: "Attention Is All You Need" paper
The second big takeover:
Computer Vision
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020, A Dosovitskiy, L Beyer, A Kolesnikov, D Weissenborn, X Zhai, T Unterthiner, M Dehghani, M Minderer, G Heigold, S Gelly, J Uszkoreit, N Houlsby
Vision Transformer
(ViT)
Many prior works attempted to introduce
self-attention at the pixel level.
For 224px², that's 50k sequence length, too much!
Thus, most works restrict attention to local pixel
neighborhoods, or as high-level mechanism on
top of detections.
The key breakthrough in using the full
Transformer architecture, standalone, was to
"tokenize" the image by cutting it into
patches of 16px², and treating each patch as a
token, e.g. embedding it into input space.
Side-note: MLP-Mixer
2020, I Tolstikhin, N Houlsby, A Kolesnikov, L Beyer, X Zhai, T Unterthiner, J Yung, A Steiner, D Keysers, J Uszkoreit, M Lucic, A Dosovitskiy
After ViT answered the question "Are convolutions really needed to process images?" with NO...
We wondered if self-attention is really needed?
The role of self-attention is to "mix" information across tokens.
Another simple way to achieve this, is to "transpose" tokens and run that through an MLP:
The third big takeover:
Speech
Conformer: Convolution-augmented Transformer for Speech Recognition
2020, A Gulati, J Qin, C-C Chiu, N Parmar, Y Zhang, J Yu, W Han, S Wang, Z Zhang, Y Wu, R Pang
Largely the same story as in computer vision.
But with spectrograms instead of images.
[The_] [detective_] [invest]
[igat]
Add a third type of block using convolutions, and slightly
reorder blocks, but overall very transformer-like.
Exists as encoder-decoder variant, or as
encoder-only variant with CTC loss.
Transformer image source: "Attention Is All You Need" paper
The fourth big takeover:
Reinforcement Learning
Cast the (supervised/offline) RL problem into a sequence ("language") modeling task:
Decision Transformer: Reinforcement Learning via Sequence Modeling
2021, L Chen, K Lu, A Rajeswaran, K Lee, A Grover, M Laskin, P Abbeel, A Srinivas, I Mordatch
Slide credit: Igor Mordatch
Can generate/decode sequences of actions with desired return (eg skill)
The trick is prompting: "The following is a trajectory of an expert player: [obs] ..."
The Transformer's
Unification of communities
Tokenize different modalities each in their
own way (some kind of "patching"), and send
them all jointly into a Transformer...
Seems to just work...
Currently an explosion of works doing this!
Images from:
[1] LIMoE by B Mustafa, C Riquelme, J Puigcerver, R Jenatton, N Houlsby
[2] MERLOT Reserve by R Zellers, J Lu, X Lu, Y Yu, Y Zhao, M Salehi, A Kusupati, J Hessel, A Farhadi, Y Choi
[3] VATT by H Akbari, L Yuan, R Qian, W-H Chuang, S-F Chang, Y Cui, B Gong
Anything you can tokenize, you can feed to Transformer
ca 2021 and onwards
[1]
[2]
[3]
A note on
Efficient Transformers
A note on Efficient Transformers
The self-attention operation complexity is O(N²) for sequence length N.
Based on "Efficient Transformers: A Survey" by Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler
and "Long Range Arena: A Benchmark for Efficient Transformers" by Y Tay, M Dehghani, S Abnar, Y Shen, D Bahri, P Pham, J Rao, L Yang, S Ruder, D Metzler
We'd like to use large N:
Whole articles or books
Full video movies
High resolution images
Many O(N) approximations to the full self-attention
have been proposed in the past two years.
Unfortunately, none provides a clear improvement.
They always trade-off between speed and quality.
Thanks for your... 🥁
Attention
Transformer image source: "Attention Is All You Need" paper

More Related Content

Similar to Intro to Transformers.pdf

"Demystifying Deep Neural Networks," a Presentation from BDTI
"Demystifying Deep Neural Networks," a Presentation from BDTI"Demystifying Deep Neural Networks," a Presentation from BDTI
"Demystifying Deep Neural Networks," a Presentation from BDTIEdge AI and Vision Alliance
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Satyam Saxena
 
[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路台灣資料科學年會
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksDatabricks
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Universitat Politècnica de Catalunya
 
Particle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer VisionParticle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer Visionzukun
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentShubham Joshi
 
Convolution Neural Network Lecture Slides
Convolution Neural Network Lecture SlidesConvolution Neural Network Lecture Slides
Convolution Neural Network Lecture SlidesAdnanHaider234505
 
The Nature of Code via Cinder - Modeling the Natural World in C++
The Nature of Code via Cinder - Modeling the Natural World in C++The Nature of Code via Cinder - Modeling the Natural World in C++
The Nature of Code via Cinder - Modeling the Natural World in C++Nathan Koch
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Tutorial on convolutional neural networks
Tutorial on convolutional neural networksTutorial on convolutional neural networks
Tutorial on convolutional neural networksHojin Yang
 
15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdf
15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdf15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdf
15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdfMcSwathi
 
Gan seminar
Gan seminarGan seminar
Gan seminarSan Kim
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural NetworkOmer Korech
 
Output Units and Cost Function in FNN
Output Units and Cost Function in FNNOutput Units and Cost Function in FNN
Output Units and Cost Function in FNNLin JiaMing
 
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Julien SIMON
 
Cuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An IntroductionCuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An IntroductionXin-She Yang
 

Similar to Intro to Transformers.pdf (20)

"Demystifying Deep Neural Networks," a Presentation from BDTI
"Demystifying Deep Neural Networks," a Presentation from BDTI"Demystifying Deep Neural Networks," a Presentation from BDTI
"Demystifying Deep Neural Networks," a Presentation from BDTI
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
 
VoxelNet
VoxelNetVoxelNet
VoxelNet
 
Particle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer VisionParticle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer Vision
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_present
 
Ns2 by khan
Ns2 by khan Ns2 by khan
Ns2 by khan
 
Convolution Neural Network Lecture Slides
Convolution Neural Network Lecture SlidesConvolution Neural Network Lecture Slides
Convolution Neural Network Lecture Slides
 
The Nature of Code via Cinder - Modeling the Natural World in C++
The Nature of Code via Cinder - Modeling the Natural World in C++The Nature of Code via Cinder - Modeling the Natural World in C++
The Nature of Code via Cinder - Modeling the Natural World in C++
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Tutorial on convolutional neural networks
Tutorial on convolutional neural networksTutorial on convolutional neural networks
Tutorial on convolutional neural networks
 
15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdf
15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdf15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdf
15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdf
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural Network
 
Output Units and Cost Function in FNN
Output Units and Cost Function in FNNOutput Units and Cost Function in FNN
Output Units and Cost Function in FNN
 
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)
 
Cuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An IntroductionCuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An Introduction
 

Recently uploaded

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingWSO2
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseWSO2
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governanceWSO2
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Recently uploaded (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Intro to Transformers.pdf

  • 1. Photo CC BY-SA 2.0, Gabriel Synnaeve Google Research Brain Team, Zürich Transformers Lucas Beyer lbeyer@google.com
  • 2. The classic landscape: One architecture per "community"
  • 3. Computer Vision Convolutional NNs (+ResNets) Natural Lang. Proc. Recurrent NNs (+LSTMs) Translation Seq2Seq Speech Deep Belief Nets (+non-DL) h e l l o [1] CNN image CC-BY-SA by Aphex34 for Wikipedia https://commons.wikimedia.org/wiki/File:Typical_cnn.png [2] RNN image CC-BY-SA by GChe for Wikipedia https://commons.wikimedia.org/wiki/File:The_LSTM_Cell.svg GRBM 🔒 RBM 🔒 RBM DBN DBN DBN h e l l o </s> s a l u t s a l u t </s> RL BC/GAIL [1] [2]
  • 4. The Transformer's takeover: One community at a time
  • 5. Computer Vision Natural Lang. Proc. Translation Speech Reinf. Learning Graphs/Science Transformer image source: "Attention Is All You Need" paper
  • 7. Neural Machine Translation by Jointly Learning to Align and Translate 2014, Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio
  • 8. Attention Is All You Need 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Attention is a function similar to a "soft" kv dictionary lookup: Query q Keys k Values v Historical side-note: "non-local NNs" in computer vision and "relational NNs" in RL appeared almost at the same time and contain the same core idea! Attention weights a 1. Attention weights a1:N are query-key similarities: âi = q · ki Normalized via softmax: ai = eâi / ∑j eâj Output z 2. Output z is attention-weighted average of values v1:N : z = ∑i âi vi = â · v 3. Usually, k and v are derived from the same input x: k = Wk · x v = Wv · x The query q can come from a separate input y: q = Wq · y Or from the same input x! Then we call it "self attention": q = Wq · x Input x Wk Wv ... ... ...
  • 9. But that's not actually it! There are a few more details: Attention Is All You Need 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Queries q 1:M Attention matrix A Outputs z 1:M 1. We usually use many queries q1:M , not just one. Stacking them leads to the Attention matrix A1:N,1:M and subsequently to many outputs: z1:M = Attn(q1:M ,x) = [Attn(q1 ,x) | Attn(q2 ,x) | ... | Attn(qM ,x)] Queries q 1:M Attention matrix A Outputs z 1:M Queries q 1:M Attention matrix A Outputs z 1:M Queries q 1:M Attention matrix A Outputs z 1:M Attn1 (qi ,x) zi = ⎛Attn2 (qi ,x) ⎞ .... ⎝ AttnK (qi ,x)⎠ 2. We usually use "multi-head" attention. This means the operation is repeated K times and the results are concatenated along the feature dimension. Ws differ. Queries q 1:M Attention matrix A Outputs z 1:M Attention matrix A Attention matrix A Queries q 1:M Attention matrix A Outputs z 1:M 3. The most commonly seen formulation: z = softmax(QK'/√dkey )V Note that the complexity is O(N²)
  • 10. Attention Is All You Need - The Transformer architecture 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Thanks to Basil Mustafa for slide inspiration Transformer image source: "Attention Is All You Need" paper
  • 11. Attention Is All You Need - The Transformer architecture 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Input text is first split into pieces. Can be characters, word, "tokens": "The detective investigated" -> [The_] [detective_] [invest] [igat] [ed_] Tokens are indices into the "vocabulary": [The_] [detective_] [invest] [igat] [ed_] -> [3 721 68 1337 42] Each vocab entry corresponds to a learned dmodel -dimensional vector. [3 721 68 1337 42] -> [ [0.123, -5.234, ...], [...], [...], [...], [...] ] Input (Tokenization and) Embedding Positional Encoding Remember attention is permutation invariant, but language is not! Need to encode position of each word; just add something. Think [The_] + 10 [detective_] + 20 [invest] + 30 ... but smarter. Thanks to Basil Mustafa for slide inspiration Transformer image source: "Attention Is All You Need" paper
  • 12. Attention Is All You Need - The Transformer architecture 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Multi-headed Self-Attention Meaning the input sequence is used to create queries, keys, and values! Each token can "look around" the whole input, and decide how to update its representation based on what it sees. [The_] [detective_] [invest] [igat] [ed_] MHSA Thanks to Basil Mustafa for slide inspiration Transformer image source: "Attention Is All You Need" paper
  • 13. Attention Is All You Need - The Transformer architecture 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Point-wise MLP Some people like to call it 1x1 convolution. A simple MLP applied to each token individually: zi = W2 GeLU(W1 x + b1 ) + b2 Think of it as each token pondering for itself about what it has observed previously. There's some weak evidence this is where "world knowledge" is stored, too. It contains the bulk of the parameters. When people make giant models and sparse/moe, this is what becomes giant. GeLU GeLU GeLU GeLU [The_] [detective_] [invest] [igat] [ed_] GeLU Thanks to Basil Mustafa for slide inspiration Transformer image source: "Attention Is All You Need" paper
  • 14. Attention Is All You Need - The Transformer architecture 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Residual connections Each module's output has the exact same shape as its input. Following ResNets, the module computes a "residual" instead of a new value: zi = Module(xi ) + xi This was shown to dramatically improve trainability. LayerNorm Normalization also dramatically improves trainability. There's post-norm (original) and pre-norm (modern) zi = LN(Module(xi ) + xi ) zi = Module(LN(xi )) + xi Block + + Block "Skip connection" "Residual block" Thanks to Basil Mustafa for slide inspiration Transformer image source: "Attention Is All You Need" paper
  • 15. Attention Is All You Need - The Transformer architecture 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Encoding / Encoder Since input and output shapes are identical, we can stack N such blocks. Typically, N=6 ("base"), N=12 ("large") or more. Encoder output is a "heavily processed" (think: "high level, contextualized") version of the input tokens, i.e. a sequence. This has nothing to do with the requested output yet (think: translation). That comes with the decoder. Thanks to Basil Mustafa for slide inspiration Transformer image source: "Attention Is All You Need" paper
  • 16. Attention Is All You Need - The Transformer architecture 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin What we want to model: p(z|x) for example, in translation: p(z | "the detective investigated") ∀z Decoding / the Decoder (alternatively Generating / the Generator) Seems impossible at first, but we can exactly decompose into tokens: p(z|x) = p(z1 |x) p(z2 |z1 ,x) p(z3 |z2 ,z1 ,x)... Meaning, we can generate the answer one token at a time. Each p is a full pass through the model. For generating p(z3 |z2 ,z1 ,x): x comes from the encoder, z1 , z2 is what we have predicted so far, goes into the decoder. Once we have p(z|x) we still need to actually sample a sentence such as "le détective a enquêté". Many strategies: greedy, beam-search, ... z1 , z2 x p(z3 |z2 ,z1 ,x) Thanks to Basil Mustafa for slide inspiration Transformer image source: "Attention Is All You Need" paper
  • 17. Attention Is All You Need - The Transformer architecture 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin This is regular self-attention as in the encoder, to process what's been decoded so far, but with a trick... Masked self-attention If we had to train on one single p(z3 |z2 ,z1 ,x) at a time: SLOW! Instead, train on all p(zi |z1:i ,x) simultaneously. How? In the attention weights for zi , set all entries i:N to 0. This way, each token only sees the already generated ones. At generation time There is no such trick. We need to generate one zi at a time. This is why autoregressive decoding is extremely slow. Thanks to Basil Mustafa for slide inspiration Transformer image source: "Attention Is All You Need" paper
  • 18. Attention Is All You Need - The Transformer architecture 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Each decoded token can "look at" the encoder's output: Attn(q=Wq xdec , k=Wk xenc , v=Wv xenc ) This is the same as in the 2014 paper. This is where |x in p(z3 |z2 ,z1 ,x) comes from. "Cross" attention xenc xdec Because self-attention is so widely used, people have started just calling it "attention". Hence, we now often need to explicitly call this "cross attention". Thanks to Basil Mustafa for slide inspiration Transformer image source: "Attention Is All You Need" paper ; heatmap image source: "Neural Machine Translation by Jointly Learning to Align and Translate" paper
  • 19. Attention Is All You Need - The Transformer architecture 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Feedforward and stack layers. Thanks to Basil Mustafa for slide inspiration Transformer image source: "Attention Is All You Need" paper
  • 20. Attention Is All You Need - The Transformer architecture 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Output layer Assume we have already generated K tokens, generate the next one. The decoder was used to gather all information necessary to predict a probability distribution for the next token (K), over the whole vocab. Simple: linear projection of token K SoftMax normalization Thanks to Basil Mustafa for slide inspiration Transformer image source: "Attention Is All You Need" paper
  • 21. Attention Is All You Need - Summary and results 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Animation source: https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html Table source: "Attention Is All You Need" paper
  • 22. The first (1.5th ) big takeover: Language Modeling / NLP
  • 23. Decoder-only GPT Encoder-only BERT Enc-Dec T5 [The_] [cat_] [MASK] [on_] [MASK] [mat_] [*] [*] [sat_] [*] [the_] [*] [START] [The_] [cat_] [sat_] Translate EN-DE: This is good. Summarize: state authorities dispatched… Is this toxic: You look beautiful today! Das ist gut. A storm in Attala caused 6 victims. This is not toxic. Transformer image source: "Attention Is All You Need" paper
  • 24. The second big takeover: Computer Vision
  • 25. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 2020, A Dosovitskiy, L Beyer, A Kolesnikov, D Weissenborn, X Zhai, T Unterthiner, M Dehghani, M Minderer, G Heigold, S Gelly, J Uszkoreit, N Houlsby Vision Transformer (ViT) Many prior works attempted to introduce self-attention at the pixel level. For 224px², that's 50k sequence length, too much! Thus, most works restrict attention to local pixel neighborhoods, or as high-level mechanism on top of detections. The key breakthrough in using the full Transformer architecture, standalone, was to "tokenize" the image by cutting it into patches of 16px², and treating each patch as a token, e.g. embedding it into input space.
  • 26. Side-note: MLP-Mixer 2020, I Tolstikhin, N Houlsby, A Kolesnikov, L Beyer, X Zhai, T Unterthiner, J Yung, A Steiner, D Keysers, J Uszkoreit, M Lucic, A Dosovitskiy After ViT answered the question "Are convolutions really needed to process images?" with NO... We wondered if self-attention is really needed? The role of self-attention is to "mix" information across tokens. Another simple way to achieve this, is to "transpose" tokens and run that through an MLP:
  • 27. The third big takeover: Speech
  • 28. Conformer: Convolution-augmented Transformer for Speech Recognition 2020, A Gulati, J Qin, C-C Chiu, N Parmar, Y Zhang, J Yu, W Han, S Wang, Z Zhang, Y Wu, R Pang Largely the same story as in computer vision. But with spectrograms instead of images. [The_] [detective_] [invest] [igat] Add a third type of block using convolutions, and slightly reorder blocks, but overall very transformer-like. Exists as encoder-decoder variant, or as encoder-only variant with CTC loss. Transformer image source: "Attention Is All You Need" paper
  • 29. The fourth big takeover: Reinforcement Learning
  • 30. Cast the (supervised/offline) RL problem into a sequence ("language") modeling task: Decision Transformer: Reinforcement Learning via Sequence Modeling 2021, L Chen, K Lu, A Rajeswaran, K Lee, A Grover, M Laskin, P Abbeel, A Srinivas, I Mordatch Slide credit: Igor Mordatch Can generate/decode sequences of actions with desired return (eg skill) The trick is prompting: "The following is a trajectory of an expert player: [obs] ..."
  • 32. Tokenize different modalities each in their own way (some kind of "patching"), and send them all jointly into a Transformer... Seems to just work... Currently an explosion of works doing this! Images from: [1] LIMoE by B Mustafa, C Riquelme, J Puigcerver, R Jenatton, N Houlsby [2] MERLOT Reserve by R Zellers, J Lu, X Lu, Y Yu, Y Zhao, M Salehi, A Kusupati, J Hessel, A Farhadi, Y Choi [3] VATT by H Akbari, L Yuan, R Qian, W-H Chuang, S-F Chang, Y Cui, B Gong Anything you can tokenize, you can feed to Transformer ca 2021 and onwards [1] [2] [3]
  • 33. A note on Efficient Transformers
  • 34. A note on Efficient Transformers The self-attention operation complexity is O(N²) for sequence length N. Based on "Efficient Transformers: A Survey" by Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler and "Long Range Arena: A Benchmark for Efficient Transformers" by Y Tay, M Dehghani, S Abnar, Y Shen, D Bahri, P Pham, J Rao, L Yang, S Ruder, D Metzler We'd like to use large N: Whole articles or books Full video movies High resolution images Many O(N) approximations to the full self-attention have been proposed in the past two years. Unfortunately, none provides a clear improvement. They always trade-off between speed and quality.
  • 35. Thanks for your... 🥁 Attention Transformer image source: "Attention Is All You Need" paper