Transformers In Vision From Zero to Hero (DLI).pptx

Transformers In Vision
From Zero to Hero!
Davide Coccomini & Nicola Messina
Davide Coccomini Nicola Messina
PostDoc
Researcher at
Italian National
Research Council
PhD Student at
University of Pisa
and Research
Associate at
Italian National
Research Council
Reach me on …
Reach me on …

Outline
Some history: from RNNs to Transformers
Transformers’ attention and self-attention mechanisms
The power of the Transformer Encoder
From text to images: Vision Transformers and Swin Transformers
Transformers: From Language to Vision
From images to videos
Is Attention what we really need?
Convolutional Neural Networks and Vision Transformers
Some interesting real-world applications
Vision Transformers: Challenges and Applications

Videos
Images
Text History
Introduced transformers in NLP
2017
Vision Transformers
2020
2021
Transformers for video
understanding
Now Computer Vision Revolution!
Transformers
Davide Coccomini & Nicola Messina | DLI 2022

A step back: Recurrent Networks (RNNs)
E
The cat jumps the wall
h0
hstart
E
h1
E
h2
E
h3
E
h4
D h5
D
h6
D
h7
D
h8
D
Il gatto salta il
h9
D
<end>
muro
Encoder
Final sentence
embedding
Decoder
<s>

Problems
1. We forget tokens too far in the past
2. We need to wait the previous token to compute the next hidden-state
E
h0
hstart
E
h1
E
h2
E
h3
E
h4
D h5
D h6
D h7
D h8
D
Il gatto salta il muro
h9
D
<end>
<s>

Solving problem 1
"We forget tokens too far in the past"
E
h0
hstar
t
E
h1
E
h2
E
h3
E
h4
D D D
Il gatto salta
+ + + context
=
Attention
Solution
Add an attention mechanism
+
h5 h6
<s>

Solving problem 2
"We need to wait the previous token to compute the next hidden-state"
2017 paper
"Attention Is All You Need"
Solution
Throw away recurrent
connections
E
h0
hstar
t
E
h1
E
h2
E
h3
E
h4
D D D
Il gatto salta
+ + + context
=
Attention
+
h5 h6
<s>

Transformer's Attention Mechanism
Target tokens “from the
point of view” of the
source sequence
Queries
Target
Sequence
Source
Sequence
FFN
FFN
FFN
FFN
Keys & Values
FFN
FFN
∙
∙
∙
∙
Norm
&
Softmax
Dot
product
Il
gatto
salta
The
cat
jumps
the
…
…
FFN

Key Value
“A4”
“N9”
“O7”
“A4”
“N2”
Transformer's Attention Mechanism
From a different perspective
Query
“N5”
Weighted
average
“gatto” token, built by
aggregating value vectors in
the source dictionary
Lookup Table
Target
Sequence
Source
Sequence
Il
gatto
salta
The
cat
jumps
the
…
wall
Soft-matching

Cross-Attention and Self-Attention
Self-Attention
• Source = Target
• Key, Queries, Values
obtained from the
same sentence
• Captures intra-sequence dependencies
Cross-Attention
• Source ≠ Target
• Queries from Target
• Key, Values from Source
• Captures inter-sequence dependencies
I gave my dog Charlie some food
Ho dato da mangiare al mio cane Charlie
To whom? What?
Who?
Multi-Head
Attention
V K Q
Multi-Head
Attention
V K Q
Source Target

Full Transformer Architecture
Input Output
Multi-Head
Attention
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
“Salta” (90%) | “Odia” (9%) | “Perchè” (1%)
“The cat jumps the wall” “Il gatto”
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory
Cross-attention
The elements of the memory
constitute the dictionary used to
contextualize every token in the
deconding stage
Self-attention
Input tokens are contextualized
with tokens from the same
sentence. This is for
understanding the meaning of
the sentences
Key Value
“A4”
“N9”
“O7”
“A4”
“N2”
Lookup Table (source sequence)

Full Transformer Architecture
Input Output
Multi-Head
Attention
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
“Salta” (90%) | “Odia” (9%) | “Perchè” (1%)
“The cat jumps the wall” “Il gatto”
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory

I
gave
my
dog
Charlie
some
food
Attention
calculation is O(n2)
Self-Attention

The Power of the Transformer Encoder
• Many achievements using only the Encoder
• BERT (Devlin et al., 2018)
Next Sentence
Prediction {0, 1}
Transformer Encoder (N layers)
<CLS> <SEP> He ate it
Positional Encoding
Embedding Layer
Masked Language
Modelling «ate»
Memory

Transformers in Computer Vision
Can we use the self-attention mechanism in images?

256px
256px
3906250000
calculations
Impossible!
62500
pixels
• The transformer works with a set of tokens
• What are tokens in images?

• Tokens as the features from an object detector
Tokens!
ROI Pooling
ROI Pooling
ROI Pooling

“An image is worth 16x16 words”
Image
to
Patches
Tokens!
Linear
Projection
256px
256px
16px
16px
Vision Transformers (ViTs)
“An image is worth 16x16 words” | Dosovitskiy et al., 2020 Davide Coccomini & Nicola Messina | DLI 2022

Vision Transformers (ViTs)
0 * 1 2 3 4 5 6 7 8 9
1 2 3
4 5 6
7 8 9
MLP
Head
CLASS
Linear Projection of Flattened Patches
Transformer Encoder

Image Classification on ImageNet

16x 4x
8x
16x
Swin Transformers
Swin Transformer
Vision Transformer
16x
16x
Shifted Window based Self-Attention
Low-level (pixels)
High-level

Patch
Partition
Linear
Embedding
x2
Stage 1
Swin
Transformer
Block
Patch
Merging
Swin
Transformer
Block
x2
Stage 2
Patch
Merging
Swin
Transformer
Block
x6
Stage 3
Patch
Merging
Swin
Transformer
Block
x2
Stage 4
𝑯 × 𝑾 × 𝟑
𝑯
𝟒
×
𝑾
𝟒
× 𝟒𝟖
𝑯
𝟒
×
𝑾
𝟒
× 𝑪
𝑯
𝟖
×
𝑾
𝟖
× 𝟐𝑪
𝑯
𝟏𝟔
×
𝑾
𝟏𝟔
× 𝟒𝑪
𝑯
𝟑𝟐
×
𝑾
𝟑𝟐
× 𝟖𝑪
Swin Transformers
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | Ze Liu et al., 2021 Davide Coccomini & Nicola Messina | DLI 2022

Window-MSA ShiftedW-MHA
Swin Transformer Block
Feed Forward
Add & Norm
Add & Norm
W-MSA
Feed Forward
Add & Norm
Add & Norm
SW-MSA
Two cascading Transformer Encoder Blocks:
• Window Self Attention
• Shifted Window Self Attention

What about video?

ViViT

TimeSformers
Combine space and time attention with Divided Space-Time Attention!
frame
t
-
δ
frame
t
frame
t
+
δ
Space
Time
Time
Is Space-Time Attention All You Need for Video Understanding? | Gedas Bertasius et al., 2021 Davide Coccomini & Nicola Messina | DLI 2022

TimeSformers
Up to several minutes of
analysis!

Video Swin Transformers
Video swin transformer | Ze Liu et al., 2022

Video Swin
CoVer
MTV-H
CoCa
EVL
Action Classification on Kinetics-400

Can we do without Self-Attention?

It is «just» a transformation!
What essentially is the attention
mechanism?
Attention
Mechanism

Input
Attention
Calculation
Embeddings
Feed Forward
Add & Normalize
Dense
Output Prediction
Add & Normalize
Embeddings
Fourier Network
Fourier
Transformation

Why Fourier?
It’s just a transformation!
Image from mriquestion.com
Fourier Transform

Fourier
Transform
Fourier
Transform
Transforming over
the hidden domain
Transforming over
the sequence
domain
Fourier Network
What does it transform?
Input Vectors
FNet: Mixing Tokens with Fourier Transforms | James Lee-Thorp et al. Davide Coccomini & Nicola Messina | DLI 2022

MLP Mixer
1 2 3
4 5 6
7 8 9
Per-patch Fully Connected
N x (Mixer Layer)
Global Average Pooling
Fully-Connected
CLASS
MLP-Mixer: An all-MLP Architecture for Vision | Ilya Tolstikhin et al. Davide Coccomini & Nicola Messina | DLI 2022

Layer
Norm
MLP
MLP
MLP
MLP
Layer
Norm
MLP
MLP
MLP
MLP
Mixer Layer
Transforming over
the sequence
domain
Transforming over
the hidden
domain

Convolutional Neural
Network
Vision Transformer
Not Improving
Anymore Still Improving
Learned Knowledge
What happens during training?

Why are they different?
Able to find
long-term dependencies
Learns
inductive biases
Need very large
dataset for training
Lack of global
understanding
Locality
Sensitive
Translation
Invariant
Convolutional
Neural
Network
Vision
Transformers

A different point of view
ViTs are both local and global!
The ViT learns only global information
with low amount of data
0.7
0.01
Heads focus on farther patches

ViTs are both local and global!
The ViT learns also local information
with more data
0.7
0.01
Higher layers heads still focus on farther
patches
0.01
0.4
0.3
Lower layers heads focus on both farther
and closer patches
0.6
0.06

They learn different representations!
Similar representations
through the layers
Different representations
through the layers
Do Vision Transformers See Like Convolutional Neural Networks? | Maithra Raghu et al. Davide Coccomini & Nicola Messina | DLI 2022

Vision Transformers are very robust!
Occlusion Distribution Shift Adversarial Perturbation Permutation
Intriguing Properties of Vision Transformers | Muzammal Naseer et al. Davide Coccomini & Nicola Messina | DLI 2022

Can we obtain the best of the two
architectures?

28 x 28 24 x 24 8 x 8
12 x 12 4 x 4
What happens in CNNs?
Hey! They are patches!

Convolutional
Neural
Network
Transformer
Encoder CLASS
MLP
Hybrids
A possible configuration!
Combining EfficientNet and Vision Transformers for Video Deepfake Detection | Coccomini et al. Davide Coccomini & Nicola Messina | DLI 2022

Recap
Use a pure Vision Transformer
Improve the internal attention mechanism
Use an alternative transformation
Combine CNNs with Vision Transformers
How can we use Transformers in Vision?

Model soups
CoCa
CoAtNet-7
1°
2°
3°
90.98% top-1 accuracy
Hybrid
Pretrained on JFT
2440M parameters
Vision Transformer
Pretrained on JFT
2100M parameters
Hybrid
Pretrained on JFT
2440M parameters
ImageNet Ranking

Video Supervised Learning DINO
Source: Advancing the state of the art in computer vision with self-supervised Transformers and
10x more efficient training – Facebook AI

Source: Advancing the state of the art in computer vision with self-supervised Transformers and
10x more efficient training – Facebook AI

Source: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction

Source: Image GPT – By OpenAI

Avocados dancing, drinking, singing and partying at a Hawaiian luau

Teddy bears working on new AI research underwater with 1990s technology

Useful links
To build your own Transformers or try pre-trained ones
Lucidrains: https://github.com/lucidrains
Huggingface: https://huggingface.co/

Thank You for
the Attention!
Any question?

Transformers In Vision From Zero to Hero (DLI).pptx

More Related Content

What's hot

Similar to Transformers In Vision From Zero to Hero (DLI).pptx

More from Deep Learning Italia

Recently uploaded

Transformers In Vision From Zero to Hero (DLI).pptx

Editor's Notes