Transformers In Vision
From Zero to Hero!
Davide Coccomini & Nicola Messina
Davide Coccomini Nicola Messina
PostDoc
Researcher at
Italian National
Research Council
PhD Student at
University of Pisa
and Research
Associate at
Italian National
Research Council
Reach me on …
Reach me on …
Outline
Some history: from RNNs to Transformers
Transformers’ attention and self-attention mechanisms
The power of the Transformer Encoder
From text to images: Vision Transformers and Swin Transformers
Transformers: From Language to Vision
From images to videos
Is Attention what we really need?
Convolutional Neural Networks and Vision Transformers
Some interesting real-world applications
Vision Transformers: Challenges and Applications
Videos
Images
Text History
Introduced transformers in NLP
2017
Vision Transformers
2020
2021
Transformers for video
understanding
Now Computer Vision Revolution!
Transformers
Davide Coccomini & Nicola Messina | DLI 2022
A step back: Recurrent Networks (RNNs)
E
The cat jumps the wall
h0
hstart
E
h1
E
h2
E
h3
E
h4
D h5
D
h6
D
h7
D
h8
D
Il gatto salta il
h9
D
<end>
muro
Encoder
Final sentence
embedding
Decoder
<s>
Davide Coccomini & Nicola Messina | DLI 2022
Problems
1. We forget tokens too far in the past
2. We need to wait the previous token to compute the next hidden-state
E
The cat jumps the wall
h0
hstart
E
h1
E
h2
E
h3
E
h4
D h5
D h6
D h7
D h8
D
Il gatto salta il muro
h9
D
<end>
<s>
Davide Coccomini & Nicola Messina | DLI 2022
Solving problem 1
"We forget tokens too far in the past"
E
The cat jumps the wall
h0
hstar
t
E
h1
E
h2
E
h3
E
h4
D D D
Il gatto salta
+ + + context
=
Attention
Solution
Add an attention mechanism
+
h5 h6
<s>
Solving problem 2
"We need to wait the previous token to compute the next hidden-state"
2017 paper
"Attention Is All You Need"
Solution
Throw away recurrent
connections
E
The cat jumps the wall
h0
hstar
t
E
h1
E
h2
E
h3
E
h4
D D D
Il gatto salta
+ + + context
=
Attention
+
h5 h6
<s>
Davide Coccomini & Nicola Messina | DLI 2022
Transformer's Attention Mechanism
Target tokens “from the
point of view” of the
source sequence
Queries
Target
Sequence
Source
Sequence
FFN
FFN
FFN
FFN
Keys & Values
FFN
FFN
∙
∙
∙
∙
Norm
&
Softmax
Dot
product
Il
gatto
salta
The
cat
jumps
the
…
…
FFN
Key Value
“A4”
“N9”
“O7”
“A4”
“N2”
Transformer's Attention Mechanism
From a different perspective
Query
“N5”
Weighted
average
“gatto” token, built by
aggregating value vectors in
the source dictionary
Lookup Table
Target
Sequence
Source
Sequence
Il
gatto
salta
The
cat
jumps
the
…
wall
Soft-matching
Davide Coccomini & Nicola Messina | DLI 2022
Cross-Attention and Self-Attention
Self-Attention
• Source = Target
• Key, Queries, Values
obtained from the
same sentence
• Captures intra-sequence dependencies
Cross-Attention
• Source ≠ Target
• Queries from Target
• Key, Values from Source
• Captures inter-sequence dependencies
I gave my dog Charlie some food
Ho dato da mangiare al mio cane Charlie
I gave my dog Charlie some food
To whom? What?
Who?
Multi-Head
Attention
V K Q
Multi-Head
Attention
V K Q
Source Target
Davide Coccomini & Nicola Messina | DLI 2022
Full Transformer Architecture
Input Output
Multi-Head
Attention
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
“Salta” (90%) | “Odia” (9%) | “Perchè” (1%)
“The cat jumps the wall” “Il gatto”
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory
Davide Coccomini & Nicola Messina | DLI 2022
Cross-attention
The elements of the memory
constitute the dictionary used to
contextualize every token in the
deconding stage
Self-attention
Input tokens are contextualized
with tokens from the same
sentence. This is for
understanding the meaning of
the sentences
Key Value
“A4”
“N9”
“O7”
“A4”
“N2”
Lookup Table (source sequence)
Full Transformer Architecture
Input Output
Multi-Head
Attention
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
“Salta” (90%) | “Odia” (9%) | “Perchè” (1%)
“The cat jumps the wall” “Il gatto”
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory
Davide Coccomini & Nicola Messina | DLI 2022
I gave my dog Charlie some food
I
gave
my
dog
Charlie
some
food
Attention
calculation is O(n2)
Self-Attention
Davide Coccomini & Nicola Messina | DLI 2022
The Power of the Transformer Encoder
• Many achievements using only the Encoder
• BERT (Devlin et al., 2018)
Next Sentence
Prediction {0, 1}
Transformer Encoder (N layers)
I gave my dog Charlie some food
<CLS> <SEP> He ate it
Positional Encoding
Embedding Layer
Masked Language
Modelling «ate»
Memory
Transformers in Computer Vision
Can we use the self-attention mechanism in images?
Davide Coccomini & Nicola Messina | DLI 2022
Transformers in Computer Vision
256px
256px
3906250000
calculations
Impossible!
62500
pixels
• The transformer works with a set of tokens
• What are tokens in images?
Davide Coccomini & Nicola Messina | DLI 2022
• Tokens as the features from an object detector
Transformers in Computer Vision
Tokens!
ROI Pooling
ROI Pooling
ROI Pooling
Davide Coccomini & Nicola Messina | DLI 2022
“An image is worth 16x16 words”
Image
to
Patches
Tokens!
Linear
Projection
256px
256px
16px
16px
Vision Transformers (ViTs)
“An image is worth 16x16 words” | Dosovitskiy et al., 2020 Davide Coccomini & Nicola Messina | DLI 2022
Vision Transformers (ViTs)
0 * 1 2 3 4 5 6 7 8 9
1 2 3
4 5 6
7 8 9
MLP
Head
CLASS
Linear Projection of Flattened Patches
Transformer Encoder
Davide Coccomini & Nicola Messina | DLI 2022
Image Classification on ImageNet
Davide Coccomini & Nicola Messina | DLI 2022
16x 4x
8x
16x
Swin Transformers
Swin Transformer
Vision Transformer
16x
16x
Shifted Window based Self-Attention
Davide Coccomini & Nicola Messina | DLI 2022
Low-level (pixels)
High-level
Patch
Partition
Linear
Embedding
x2
Stage 1
Swin
Transformer
Block
Patch
Merging
Swin
Transformer
Block
x2
Stage 2
Patch
Merging
Swin
Transformer
Block
x6
Stage 3
Patch
Merging
Swin
Transformer
Block
x2
Stage 4
𝑯 × 𝑾 × 𝟑
𝑯
𝟒
×
𝑾
𝟒
× 𝟒𝟖
𝑯
𝟒
×
𝑾
𝟒
× 𝑪
𝑯
𝟖
×
𝑾
𝟖
× 𝟐𝑪
𝑯
𝟏𝟔
×
𝑾
𝟏𝟔
× 𝟒𝑪
𝑯
𝟑𝟐
×
𝑾
𝟑𝟐
× 𝟖𝑪
Swin Transformers
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | Ze Liu et al., 2021 Davide Coccomini & Nicola Messina | DLI 2022
Window-MSA ShiftedW-MHA
Swin Transformer Block
Davide Coccomini & Nicola Messina | DLI 2022
Feed Forward
Add & Norm
Add & Norm
W-MSA
Feed Forward
Add & Norm
Add & Norm
SW-MSA
Two cascading Transformer Encoder Blocks:
• Window Self Attention
• Shifted Window Self Attention
What about video?
Davide Coccomini & Nicola Messina | DLI 2022
Davide Coccomini & Nicola Messina | DLI 2022
ViViT
TimeSformers
Combine space and time attention with Divided Space-Time Attention!
frame
t
-
δ
frame
t
frame
t
+
δ
Space
Time
Time
Is Space-Time Attention All You Need for Video Understanding? | Gedas Bertasius et al., 2021 Davide Coccomini & Nicola Messina | DLI 2022
TimeSformers
Up to several minutes of
analysis!
Davide Coccomini & Nicola Messina | DLI 2022
Davide Coccomini & Nicola Messina | DLI 2022
Video Swin Transformers
Video swin transformer | Ze Liu et al., 2022
Davide Coccomini & Nicola Messina | DLI 2022
Video Swin
CoVer
MTV-H
CoCa
EVL
Action Classification on Kinetics-400
Can we do without Self-Attention?
It is «just» a transformation!
What essentially is the attention
mechanism?
Attention
Mechanism
Davide Coccomini & Nicola Messina | DLI 2022
Input
Attention
Calculation
Embeddings
Feed Forward
Add & Normalize
Dense
Output Prediction
Add & Normalize
Embeddings
Fourier Network
Fourier
Transformation
Davide Coccomini & Nicola Messina | DLI 2022
Why Fourier?
It’s just a transformation!
Image from mriquestion.com
Fourier Transform
Davide Coccomini & Nicola Messina | DLI 2022
Fourier
Transform
Fourier
Transform
Transforming over
the hidden domain
Transforming over
the sequence
domain
Fourier Network
What does it transform?
Input Vectors
FNet: Mixing Tokens with Fourier Transforms | James Lee-Thorp et al. Davide Coccomini & Nicola Messina | DLI 2022
MLP Mixer
1 2 3
4 5 6
7 8 9
Per-patch Fully Connected
N x (Mixer Layer)
Global Average Pooling
Fully-Connected
CLASS
MLP-Mixer: An all-MLP Architecture for Vision | Ilya Tolstikhin et al. Davide Coccomini & Nicola Messina | DLI 2022
Layer
Norm
MLP
MLP
MLP
MLP
Layer
Norm
MLP
MLP
MLP
MLP
Mixer Layer
Transforming over
the sequence
domain
Transforming over
the hidden
domain
Davide Coccomini & Nicola Messina | DLI 2022
Convolutional Neural
Network
Vision Transformer
Not Improving
Anymore Still Improving
Learned Knowledge
What happens during training?
Davide Coccomini & Nicola Messina | DLI 2022
Why are they different?
Able to find
long-term dependencies
Learns
inductive biases
Need very large
dataset for training
Lack of global
understanding
Locality
Sensitive
Translation
Invariant
Convolutional
Neural
Network
Vision
Transformers
Davide Coccomini & Nicola Messina | DLI 2022
A different point of view
ViTs are both local and global!
The ViT learns only global information
with low amount of data
0.7
0.01
Heads focus on farther patches
Davide Coccomini & Nicola Messina | DLI 2022
A different point of view
ViTs are both local and global!
The ViT learns also local information
with more data
0.7
0.01
Higher layers heads still focus on farther
patches
0.01
0.4
0.3
Lower layers heads focus on both farther
and closer patches
0.6
0.06
Davide Coccomini & Nicola Messina | DLI 2022
A different point of view
They learn different representations!
Similar representations
through the layers
Different representations
through the layers
Do Vision Transformers See Like Convolutional Neural Networks? | Maithra Raghu et al. Davide Coccomini & Nicola Messina | DLI 2022
A different point of view
Vision Transformers are very robust!
Occlusion Distribution Shift Adversarial Perturbation Permutation
Intriguing Properties of Vision Transformers | Muzammal Naseer et al. Davide Coccomini & Nicola Messina | DLI 2022
Can we obtain the best of the two
architectures?
28 x 28 24 x 24 8 x 8
12 x 12 4 x 4
What happens in CNNs?
Hey! They are patches!
Davide Coccomini & Nicola Messina | DLI 2022
Convolutional
Neural
Network
Transformer
Encoder CLASS
MLP
Hybrids
A possible configuration!
Combining EfficientNet and Vision Transformers for Video Deepfake Detection | Coccomini et al. Davide Coccomini & Nicola Messina | DLI 2022
Recap
Use a pure Vision Transformer
Improve the internal attention mechanism
Use an alternative transformation
Combine CNNs with Vision Transformers
How can we use Transformers in Vision?
Model soups
CoCa
CoAtNet-7
1°
2°
3°
90.98% top-1 accuracy
Hybrid
Pretrained on JFT
2440M parameters
91.0% top-1 accuracy
Vision Transformer
Pretrained on JFT
2100M parameters
90.88% top-1 accuracy
Hybrid
Pretrained on JFT
2440M parameters
ImageNet Ranking
APPLICATIONS!!!
Video Supervised Learning DINO
Source: Advancing the state of the art in computer vision with self-supervised Transformers and
10x more efficient training – Facebook AI
Source: Advancing the state of the art in computer vision with self-supervised Transformers and
10x more efficient training – Facebook AI
Source: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction
Source: Image GPT – By OpenAI
Avocados dancing, drinking, singing and partying at a Hawaiian luau
Teddy bears working on new AI research underwater with 1990s technology
DALL-E 2
Useful links
To build your own Transformers or try pre-trained ones
Lucidrains: https://github.com/lucidrains
Huggingface: https://huggingface.co/
Thank You for
the Attention!
Any question?

Transformers In Vision From Zero to Hero (DLI).pptx

  • 1.
    Transformers In Vision FromZero to Hero! Davide Coccomini & Nicola Messina Davide Coccomini Nicola Messina PostDoc Researcher at Italian National Research Council PhD Student at University of Pisa and Research Associate at Italian National Research Council Reach me on … Reach me on …
  • 3.
    Outline Some history: fromRNNs to Transformers Transformers’ attention and self-attention mechanisms The power of the Transformer Encoder From text to images: Vision Transformers and Swin Transformers Transformers: From Language to Vision From images to videos Is Attention what we really need? Convolutional Neural Networks and Vision Transformers Some interesting real-world applications Vision Transformers: Challenges and Applications
  • 4.
    Videos Images Text History Introduced transformersin NLP 2017 Vision Transformers 2020 2021 Transformers for video understanding Now Computer Vision Revolution! Transformers Davide Coccomini & Nicola Messina | DLI 2022
  • 5.
    A step back:Recurrent Networks (RNNs) E The cat jumps the wall h0 hstart E h1 E h2 E h3 E h4 D h5 D h6 D h7 D h8 D Il gatto salta il h9 D <end> muro Encoder Final sentence embedding Decoder <s> Davide Coccomini & Nicola Messina | DLI 2022
  • 6.
    Problems 1. We forgettokens too far in the past 2. We need to wait the previous token to compute the next hidden-state E The cat jumps the wall h0 hstart E h1 E h2 E h3 E h4 D h5 D h6 D h7 D h8 D Il gatto salta il muro h9 D <end> <s> Davide Coccomini & Nicola Messina | DLI 2022
  • 7.
    Solving problem 1 "Weforget tokens too far in the past" E The cat jumps the wall h0 hstar t E h1 E h2 E h3 E h4 D D D Il gatto salta + + + context = Attention Solution Add an attention mechanism + h5 h6 <s>
  • 8.
    Solving problem 2 "Weneed to wait the previous token to compute the next hidden-state" 2017 paper "Attention Is All You Need" Solution Throw away recurrent connections E The cat jumps the wall h0 hstar t E h1 E h2 E h3 E h4 D D D Il gatto salta + + + context = Attention + h5 h6 <s> Davide Coccomini & Nicola Messina | DLI 2022
  • 9.
    Transformer's Attention Mechanism Targettokens “from the point of view” of the source sequence Queries Target Sequence Source Sequence FFN FFN FFN FFN Keys & Values FFN FFN ∙ ∙ ∙ ∙ Norm & Softmax Dot product Il gatto salta The cat jumps the … … FFN
  • 10.
    Key Value “A4” “N9” “O7” “A4” “N2” Transformer's AttentionMechanism From a different perspective Query “N5” Weighted average “gatto” token, built by aggregating value vectors in the source dictionary Lookup Table Target Sequence Source Sequence Il gatto salta The cat jumps the … wall Soft-matching Davide Coccomini & Nicola Messina | DLI 2022
  • 11.
    Cross-Attention and Self-Attention Self-Attention •Source = Target • Key, Queries, Values obtained from the same sentence • Captures intra-sequence dependencies Cross-Attention • Source ≠ Target • Queries from Target • Key, Values from Source • Captures inter-sequence dependencies I gave my dog Charlie some food Ho dato da mangiare al mio cane Charlie I gave my dog Charlie some food To whom? What? Who? Multi-Head Attention V K Q Multi-Head Attention V K Q Source Target Davide Coccomini & Nicola Messina | DLI 2022
  • 12.
    Full Transformer Architecture InputOutput Multi-Head Attention Feed Forward Feed Forward Add & Norm Add & Norm Add & Norm Add & Norm Multi-Head Attention Multi-Head Attention + + “Salta” (90%) | “Odia” (9%) | “Perchè” (1%) “The cat jumps the wall” “Il gatto” Positional Encoding Positional Encoding V K Q Nx Mx V K Q V K Q Encoder Decoder Add & Norm Linear + Softmax Memory Davide Coccomini & Nicola Messina | DLI 2022 Cross-attention The elements of the memory constitute the dictionary used to contextualize every token in the deconding stage Self-attention Input tokens are contextualized with tokens from the same sentence. This is for understanding the meaning of the sentences Key Value “A4” “N9” “O7” “A4” “N2” Lookup Table (source sequence)
  • 13.
    Full Transformer Architecture InputOutput Multi-Head Attention Feed Forward Feed Forward Add & Norm Add & Norm Add & Norm Add & Norm Multi-Head Attention Multi-Head Attention + + “Salta” (90%) | “Odia” (9%) | “Perchè” (1%) “The cat jumps the wall” “Il gatto” Positional Encoding Positional Encoding V K Q Nx Mx V K Q V K Q Encoder Decoder Add & Norm Linear + Softmax Memory Davide Coccomini & Nicola Messina | DLI 2022
  • 14.
    I gave mydog Charlie some food I gave my dog Charlie some food Attention calculation is O(n2) Self-Attention Davide Coccomini & Nicola Messina | DLI 2022
  • 15.
    The Power ofthe Transformer Encoder • Many achievements using only the Encoder • BERT (Devlin et al., 2018) Next Sentence Prediction {0, 1} Transformer Encoder (N layers) I gave my dog Charlie some food <CLS> <SEP> He ate it Positional Encoding Embedding Layer Masked Language Modelling «ate» Memory
  • 16.
    Transformers in ComputerVision Can we use the self-attention mechanism in images? Davide Coccomini & Nicola Messina | DLI 2022
  • 17.
    Transformers in ComputerVision 256px 256px 3906250000 calculations Impossible! 62500 pixels • The transformer works with a set of tokens • What are tokens in images? Davide Coccomini & Nicola Messina | DLI 2022
  • 18.
    • Tokens asthe features from an object detector Transformers in Computer Vision Tokens! ROI Pooling ROI Pooling ROI Pooling Davide Coccomini & Nicola Messina | DLI 2022
  • 19.
    “An image isworth 16x16 words” Image to Patches Tokens! Linear Projection 256px 256px 16px 16px Vision Transformers (ViTs) “An image is worth 16x16 words” | Dosovitskiy et al., 2020 Davide Coccomini & Nicola Messina | DLI 2022
  • 20.
    Vision Transformers (ViTs) 0* 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 MLP Head CLASS Linear Projection of Flattened Patches Transformer Encoder Davide Coccomini & Nicola Messina | DLI 2022
  • 21.
    Image Classification onImageNet Davide Coccomini & Nicola Messina | DLI 2022
  • 22.
    16x 4x 8x 16x Swin Transformers SwinTransformer Vision Transformer 16x 16x Shifted Window based Self-Attention Davide Coccomini & Nicola Messina | DLI 2022 Low-level (pixels) High-level
  • 23.
    Patch Partition Linear Embedding x2 Stage 1 Swin Transformer Block Patch Merging Swin Transformer Block x2 Stage 2 Patch Merging Swin Transformer Block x6 Stage3 Patch Merging Swin Transformer Block x2 Stage 4 𝑯 × 𝑾 × 𝟑 𝑯 𝟒 × 𝑾 𝟒 × 𝟒𝟖 𝑯 𝟒 × 𝑾 𝟒 × 𝑪 𝑯 𝟖 × 𝑾 𝟖 × 𝟐𝑪 𝑯 𝟏𝟔 × 𝑾 𝟏𝟔 × 𝟒𝑪 𝑯 𝟑𝟐 × 𝑾 𝟑𝟐 × 𝟖𝑪 Swin Transformers Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | Ze Liu et al., 2021 Davide Coccomini & Nicola Messina | DLI 2022
  • 24.
    Window-MSA ShiftedW-MHA Swin TransformerBlock Davide Coccomini & Nicola Messina | DLI 2022 Feed Forward Add & Norm Add & Norm W-MSA Feed Forward Add & Norm Add & Norm SW-MSA Two cascading Transformer Encoder Blocks: • Window Self Attention • Shifted Window Self Attention
  • 25.
    What about video? DavideCoccomini & Nicola Messina | DLI 2022
  • 26.
    Davide Coccomini &Nicola Messina | DLI 2022 ViViT
  • 27.
    TimeSformers Combine space andtime attention with Divided Space-Time Attention! frame t - δ frame t frame t + δ Space Time Time Is Space-Time Attention All You Need for Video Understanding? | Gedas Bertasius et al., 2021 Davide Coccomini & Nicola Messina | DLI 2022
  • 28.
    TimeSformers Up to severalminutes of analysis! Davide Coccomini & Nicola Messina | DLI 2022
  • 29.
    Davide Coccomini &Nicola Messina | DLI 2022 Video Swin Transformers Video swin transformer | Ze Liu et al., 2022
  • 30.
    Davide Coccomini &Nicola Messina | DLI 2022 Video Swin CoVer MTV-H CoCa EVL Action Classification on Kinetics-400
  • 31.
    Can we dowithout Self-Attention?
  • 32.
    It is «just»a transformation! What essentially is the attention mechanism? Attention Mechanism Davide Coccomini & Nicola Messina | DLI 2022
  • 33.
    Input Attention Calculation Embeddings Feed Forward Add &Normalize Dense Output Prediction Add & Normalize Embeddings Fourier Network Fourier Transformation Davide Coccomini & Nicola Messina | DLI 2022
  • 34.
    Why Fourier? It’s justa transformation! Image from mriquestion.com Fourier Transform Davide Coccomini & Nicola Messina | DLI 2022
  • 35.
    Fourier Transform Fourier Transform Transforming over the hiddendomain Transforming over the sequence domain Fourier Network What does it transform? Input Vectors FNet: Mixing Tokens with Fourier Transforms | James Lee-Thorp et al. Davide Coccomini & Nicola Messina | DLI 2022
  • 36.
    MLP Mixer 1 23 4 5 6 7 8 9 Per-patch Fully Connected N x (Mixer Layer) Global Average Pooling Fully-Connected CLASS MLP-Mixer: An all-MLP Architecture for Vision | Ilya Tolstikhin et al. Davide Coccomini & Nicola Messina | DLI 2022
  • 37.
    Layer Norm MLP MLP MLP MLP Layer Norm MLP MLP MLP MLP Mixer Layer Transforming over thesequence domain Transforming over the hidden domain Davide Coccomini & Nicola Messina | DLI 2022
  • 39.
    Convolutional Neural Network Vision Transformer NotImproving Anymore Still Improving Learned Knowledge What happens during training? Davide Coccomini & Nicola Messina | DLI 2022
  • 40.
    Why are theydifferent? Able to find long-term dependencies Learns inductive biases Need very large dataset for training Lack of global understanding Locality Sensitive Translation Invariant Convolutional Neural Network Vision Transformers Davide Coccomini & Nicola Messina | DLI 2022
  • 41.
    A different pointof view ViTs are both local and global! The ViT learns only global information with low amount of data 0.7 0.01 Heads focus on farther patches Davide Coccomini & Nicola Messina | DLI 2022
  • 42.
    A different pointof view ViTs are both local and global! The ViT learns also local information with more data 0.7 0.01 Higher layers heads still focus on farther patches 0.01 0.4 0.3 Lower layers heads focus on both farther and closer patches 0.6 0.06 Davide Coccomini & Nicola Messina | DLI 2022
  • 43.
    A different pointof view They learn different representations! Similar representations through the layers Different representations through the layers Do Vision Transformers See Like Convolutional Neural Networks? | Maithra Raghu et al. Davide Coccomini & Nicola Messina | DLI 2022
  • 44.
    A different pointof view Vision Transformers are very robust! Occlusion Distribution Shift Adversarial Perturbation Permutation Intriguing Properties of Vision Transformers | Muzammal Naseer et al. Davide Coccomini & Nicola Messina | DLI 2022
  • 45.
    Can we obtainthe best of the two architectures?
  • 47.
    28 x 2824 x 24 8 x 8 12 x 12 4 x 4 What happens in CNNs? Hey! They are patches! Davide Coccomini & Nicola Messina | DLI 2022
  • 48.
    Convolutional Neural Network Transformer Encoder CLASS MLP Hybrids A possibleconfiguration! Combining EfficientNet and Vision Transformers for Video Deepfake Detection | Coccomini et al. Davide Coccomini & Nicola Messina | DLI 2022
  • 50.
    Recap Use a pureVision Transformer Improve the internal attention mechanism Use an alternative transformation Combine CNNs with Vision Transformers How can we use Transformers in Vision?
  • 51.
    Model soups CoCa CoAtNet-7 1° 2° 3° 90.98% top-1accuracy Hybrid Pretrained on JFT 2440M parameters 91.0% top-1 accuracy Vision Transformer Pretrained on JFT 2100M parameters 90.88% top-1 accuracy Hybrid Pretrained on JFT 2440M parameters ImageNet Ranking
  • 52.
  • 53.
    Video Supervised LearningDINO Source: Advancing the state of the art in computer vision with self-supervised Transformers and 10x more efficient training – Facebook AI
  • 54.
    Source: Advancing thestate of the art in computer vision with self-supervised Transformers and 10x more efficient training – Facebook AI
  • 55.
    Source: Paint Transformer:Feed Forward Neural Painting with Stroke Prediction
  • 56.
    Source: Image GPT– By OpenAI
  • 58.
    Avocados dancing, drinking,singing and partying at a Hawaiian luau
  • 59.
    Teddy bears workingon new AI research underwater with 1990s technology
  • 60.
  • 61.
    Useful links To buildyour own Transformers or try pre-trained ones Lucidrains: https://github.com/lucidrains Huggingface: https://huggingface.co/
  • 62.
    Thank You for theAttention! Any question?

Editor's Notes

  • #2 Thank for the kind introduction Thank AICamp for helping us organizing this webinar Of course, Thank you all for attending and for being so numerous. I hope this will be an interesting journey throug one of the most popular and attractive deep learning architectures today: transformers. Bill already introduced us. We would like to say few more words about us I’m Nicola Messina, a phd student from university of Pisa, I’m collaborating with institute of information science and technologies at National R C. I worked with transformers for 2 years now for cross modal.  Davide Coccomini recently completed his master degree and now he also joined the PhD programme in computer engineering in Pisa. His master thesis was on Transformers for video deepfake detection and so he has a solid and recent background on Transformers for Computer Vision. He also wrote many articles on Medium. Check them out. Questions are welcome Let’s start
  • #4 - A pretty accurate meme (circulated on deep learning groups on Facebook) depicting the transformer’s evolution over the years, at least for a computer scientist or a computer engineer. - As you can imagine, in this webinar the transformer will be the one in the rightmost picture.
  • #5 This is the Transformer originally used in the field of NLP for language translation We will break this down in sufficient details to understand why and how this could be applied to image processing.
  • #6 - A brief outline Although this talk sees vision transformers as the main characters of the story we tell you, we would like to start from zero, talking about the Transformers in NLP, that is the field where they were first introduced. I apologize in advance with who already is familiar - Divided in 2 parts
  • #7 In 2017: transformers first introduced in NLP In 2020: transformers fully replaced convolutional architectures in image processing. The vision transformer is out, and the In 2021: same idea is used to process videos Now: computer vision revolution
  • #8 - Before diving directly into Transformers, we need to perform a step back to natural language, and talk about recurrent architectures for language translation, because this is where Transformers first obtained a great success. - Encoder –decoder - Each token transformer in a vector, and used to update the internal state (hidden state) of the encoder. - The final hidden state is used as a vector encoding the whole input sentence. - Then, the decoder uses this sentence embedding to drive the decoding process. - Previous hidden state + Previous inferred token -> Next token. - <end> token
  • #9 1. We forget tokens too far in the past. Why? The hidden state is updated at each encoding / decoding step, so some important information from the past could be overwritten 2. We need to wait the previous token to compute the next hidden-state. Computational problem. We need to wait the previous result to compute the next ->cannot be parallelized.
  • #10 We run the encoder Use an attention mechanism: we save the hidden states produced during the encoding phase and, instead of solely using the final hidden state, we use all the hidden states during the translation. For each decoding step, we compute an attention vector By summing all the contributions, we create a context vector
  • #11 Solving problem 2 A trivial way would be to completely remove the recurrent connections This is actually what was done in the groundbreaking paper in 2017 “attention is all you need” The Transformers are officially out.
  • #12 We have a target and a source sequence. Sticking to the translation problem, target -> translated sentence; source -> sentence to translate We derive queries from the target sequence and keys and values from the source sequence, using fully connected networks that share the weights among the tokens. Given a query token, we compute the affinity that each key has with the given query. After a normalization and softmax, these are normalized relevance scores, encoding the relevance of the query with each one of the keys. These are used to weight the values, which are summed to produce the first output token relative to the first target token. The process is repeated for each target token, to obtain the output. Output sequence has the same cardinality as the target.
  • #13 Let’s see from a different perspective. This is a simplistic view, just for understanding what is going on The source sequence is defining a lookup table. Given a token from the target sequence, we produce a query which is used to soft-match keys in this table. The key and queries used in this example are mnemonic strings, and not a vector, for ease of understanding. The most similar key strings relative to the query are extracted and their values are averaged using a weighted average. This is the analogy with a lookup table data structure, and so we understand why key, query and values.
  • #15 - Encoder Decoder modules. Actually, N layers of encoder and M of decoder - Compact representation: arrows carry sets of vectors The memory: as the hidden state We do not spend too much time into the language translation, as this is not our goal today. The decoder uses the already partially translated sentence and the memory to infer the next token. The inference is still recurrent but not the training process. I’m not going into details about this. The multi head attention. What is this?
  • #16 Cosa importante da notare è che moltissimi risultati interessanti sono stati usati usando solo l’Encoder del Transformer, che ricordiamo essere caratterizzato esclusivamente dal meccanismo della self-attention
  • #17 - Attention map - In the next part of the webinar, Davide will tackle performance issues more in details, together with some recent proposal for solving them.
  • #18 Many achievements using only the Encoder BERT is a powerful language model. It starts from the assumption that the transformer encoder, thanks to the self attention mechanism, can discover some long-term dependencies between words in the text Differently from the translation case, the memory is used as output. BERT is trained on some meta-tasks, namely … - Bridge between language and images.
  • #19 -With this background, we can finally introduce Transformers in Computer Vision. - Transformers -> Transformer Encoders - Why we would like to use the self-attention mechanism? To automatically understand what are the image patches relevant to a given query patch (for example, if we use the dog nose as query we could automatically disover that this is related to years and or eyes, and possibly not to the background) Also the non-local nature of self-attention (all tokens are compared with all the others) enable to discover some relationships between spatially distant image patches. This is difficult in CNNs since filters analyze only pixels neighborhoods.
  • #21 - Tokens as the features from an object detector. This implies the use of object detectors as an upstream visual processing module. Pro: Consider visual-textual matching architectures. This has just the right level of abstraction, one visual concept -> one token
  • #22  Recently, concolutions came out of the loop. ViTs introduced. Transformers work directly on image patches, where pixels in every patch are flattened and linearly projected to a fixed size vector (that words as a token embedding)
  • #23 - Positional encoding
  • #24 We can see that large ViTs defeat large CNNs, on image classification on imagenet. This sounds promising, and opens the door to the second part of the webinar, in which Davide will talk more in details about successive evolutions of the ViT, and will talk about efficiency concerns too. I leave the floor to Davide.
  • #46 - They plot CKA similarities between all pairs of layers across different model architectures. We observe that ViTs have relatively uniform layer similarity structure, with a clear grid-like pattern and large similarity between lower and higher layers. By contrast, the ResNet models show clear stages in similarity structure, with smaller similarity scores between lower and higher layers.