Transformer in Vision
Sangmin Woo
2020.10.29
[2018 ICML] Image Transformer
[2019 CVPR] Video Action Transformer Network
[2020 ECCV] End-to-End Object Detection with Transformers
[2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2 / 36
Contents
[2018 ICML] Image Transformer
Niki Parmar1 Ashish Vaswani1 Jakob Uszkoreit1 Łukasz Kaiser1 Noam Shazeer1 Alexander Ku2,3 Dustin Tran4
1Google Brain, Mountain View, USA
2Department of Electrical Engineering and Computer Sciences, University of California, Berkeley
3Work done during an internship at Google Brain
4Google AI, Mountain View, USA.
[2019 CVPR] Video Action Transformer Network
Rohit Girdhar1 Jo˜ao Carreira2 Carl Doersch2 Andrew Zisserman2,3
1Carnegie Mellon University 2DeepMind 3University of Oxford
[2020 ECCV] End-to-End Object Detection with Transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko
Facebook AI
[2021 ICLR under review] An Image is Worth 16x16 Words: Transformers for Image
Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner, Mostafa
Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby,
Google Research, Brain Team
3 / 36
Background
 Attention is all you need [2017 NIPS]
• The main idea of the original architecture is to compute self-attention by
comparing a feature to all other features in the sequence.
• Features are first mapped to a query (Q) and memory (key and value, K &
V ) embedding using linear projections.
• The output for the query is computed as an attention weighted sum of
values (V), with the attention weights obtained from the product of the
query (Q) with keys (K).
• In practice, query (Q) is the word being translated, and keys (K) and
values (V) are linear projections of the input sequence and the output
sequence generated so far.
• A positional encoding is also added to these representations in order to
incorporate positional information which is lost in this setup.
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
4 / 36
Image Transformer [2018 ICML]
 Generative models (Image Generation, Super-Resolution, Image
Completion)
5 / 36
Image Transformer [2018 ICML]
 Pixel-RNN / Pixel-CNN (van den Oord et al., 2016)
• Straightforward
• Tractable likelihood
• Simple and stable
[2] van den Oord, A¨aron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. ICML, 2016.
[3] van den Oord, A¨aron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image
generation with pixelcnn decoders. NIPS, 2016.
Pixel-RNN Pixel-CNN
6 / 36
Image Transformer [2018 ICML]
 Motivation
• Pixel-RNN and Pixel-CNN turned the problem into sequence modeling
problem by applying RNN or CNN to predict each next pixel given all
previously generated pixels.
• RNN is computationally heavy 
• CNN is parallelizable 
• CNN has limited receptive field → long range dependency problem → if
stack more layers? → expensive 
• RNN has virtually unlimited receptive field 
• Self-attention can achieve a better balance in the trade-off between the
virtually unlimited receptive field of the necessarily sequential PixelRNN
and the limited receptive field of the much more parallelizable PixelCNN
and its various extensions 
7 / 36
Image Transformer [2018 ICML]
 Image Completion & Super-resolution
8 / 36
Image Transformer [2018 ICML]
 Image Transformer
• 𝑞: single channel of one pixel
(query)
• 𝑚1, 𝑚2, 𝑚3: memory of
previously generated pixels
(key)
• 𝑝 𝑞, 𝑝1, 𝑝2, 𝑝3: position encodings
• 𝑐𝑚𝑝: first embed query and key
then apply dot product.
9 / 36
Image Transformer [2018 ICML]
 Local Self-Attention
• Scalability issue is in the self-attention mechanism 
• Restrict the positions in the memory matrix M to a local neighborhood
around the query position 
• 1d vs. 2d Attention
10/ 36
Image Transformer [2018 ICML]
 Image Generation
11 / 36
Image Transformer [2018 ICML]
 Super-resolution
12/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition
[4] Sigurdsson, Gunnar A., et al. Hollywood in homes: Crowdsourcing data collection for activity understanding. ECCV, 2016.
13/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition & Localization
14/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Transformer
• Action Transformer unit takes as input the video feature representation
and the box proposal from RPN and maps it into query (𝑄) and memory
(𝐾&𝑉) features.
• Query (𝑄): The person being classified.
• Memory (𝐾&𝑉): Clip around the person.
• The unit processes the query (𝑄) and memory (𝐾&𝑉) to output an updated
query vector (𝑄∗
).
• The intuition is that the self-attention will add context from other people
and objects in the clip to the query (𝑄) vector, to aid with the subsequent
classification.
• This unit can be stacked in multiple heads and layers, by concatenating
the output from the multiple heads at a given layer, and using the
concatenated feature as the next query.
• This updated query (𝑄∗
) is then used to again attend to context features in
the following layer.
15/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition
16/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition
17/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition
18/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Object Detection
[5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.
19/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Faster R-CNN
[5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.
20/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Non Maximum Suppression (NMS)
NMS
21/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Motivation
• Multi-staged pipeline 
• Too many hand-crafted components or heuristics (e.g., non-maximum
suppression, anchor box) that explicitly encode out prior knowledge about
the task 
• Let’s simplify these pipelines with end-to-end philosophy! 
• Let’s remove the need of heuristics with direct set prediction! 
• Forces unique predictions via bi-partite matching loss between
predicted and ground-truth objects.
• Encoder-decoder architecture based on Transformer
• Transformer explicitly model all pairwise interactions between
elements in a sequence, which is particularly suitable for specific
constraints of set prediction such as removing duplicate
predictions.
22/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 DETR (DEtection TRansformer) in high-level
23/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 DETR in detail
24/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Encoder self-attention
25/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 NMS & OOD
26/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Decoder attetnion
27/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Decoder output slot
28/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 DETR PyTorch inference code: Very simple 
29/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Motivation
• Large Transformer-based models are often pre-trained on large corpora
and then fine-tuned for the task at hand: BERT uses a denoising self-
supervised pre-training task, while the GPT line of work uses language
modeling as its pre-training task
• Vision Transformer (ViT) yield modest results when trained on mid-sized
datasets such as ImageNet, achieving accuracies of a few percentage
points below ResNets of comparable size. This seemingly discouraging
outcome may be expected: Transformers lack some inductive biases
inherent to CNNs, such as translation equivariance and locality, and
therefore do not generalize well when trained on insufficient amounts of
data.
• However, the picture changes if ViT is trained on large datasets (14M-
300M images). i.e., large scale training trumps inductive bias.
Transformers attain excellent results when pre-trained at sufficient
scale and transferred to tasks with fewer datapoints.
30/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Vision Transformer (ViT)
31/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 ViT vs. BiT (Alexander Kolesnikov et al. Big transfer (BiT): General visual representation learning. In ECCV, 2020.)
32/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Performance vs. pre-training samples
33/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Performance vs. cost
34/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Image Classification
35/ 36
Concluding Remarks
 Transformer are competent with modeling inter-relationship
between pixels (video clips, image patches, …) 
 If Transformer is pre-trained sufficient number of data, it can
replace the CNN and it also performs well 
 Transformer is a generic architecture even more than MLP (I
think…)
 Not only in NLP, Transformer also shows astonishing results in
Vision!
 But, Transformer is known to have quadratic complexity 
 Here’s further reading which reduces the quadratic complexity into the
linear complexity.
 “Rethinking Attention with Performers” (ICLR 2021 under review)
Thank You
shmwoo9395@{gist.ac.kr, gmail.com}

Transformer in Vision

  • 1.
    Transformer in Vision SangminWoo 2020.10.29 [2018 ICML] Image Transformer [2019 CVPR] Video Action Transformer Network [2020 ECCV] End-to-End Object Detection with Transformers [2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  • 2.
    2 / 36 Contents [2018ICML] Image Transformer Niki Parmar1 Ashish Vaswani1 Jakob Uszkoreit1 Łukasz Kaiser1 Noam Shazeer1 Alexander Ku2,3 Dustin Tran4 1Google Brain, Mountain View, USA 2Department of Electrical Engineering and Computer Sciences, University of California, Berkeley 3Work done during an internship at Google Brain 4Google AI, Mountain View, USA. [2019 CVPR] Video Action Transformer Network Rohit Girdhar1 Jo˜ao Carreira2 Carl Doersch2 Andrew Zisserman2,3 1Carnegie Mellon University 2DeepMind 3University of Oxford [2020 ECCV] End-to-End Object Detection with Transformers Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko Facebook AI [2021 ICLR under review] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy, Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, Google Research, Brain Team
  • 3.
    3 / 36 Background Attention is all you need [2017 NIPS] • The main idea of the original architecture is to compute self-attention by comparing a feature to all other features in the sequence. • Features are first mapped to a query (Q) and memory (key and value, K & V ) embedding using linear projections. • The output for the query is computed as an attention weighted sum of values (V), with the attention weights obtained from the product of the query (Q) with keys (K). • In practice, query (Q) is the word being translated, and keys (K) and values (V) are linear projections of the input sequence and the output sequence generated so far. • A positional encoding is also added to these representations in order to incorporate positional information which is lost in this setup. [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
  • 4.
    4 / 36 ImageTransformer [2018 ICML]  Generative models (Image Generation, Super-Resolution, Image Completion)
  • 5.
    5 / 36 ImageTransformer [2018 ICML]  Pixel-RNN / Pixel-CNN (van den Oord et al., 2016) • Straightforward • Tractable likelihood • Simple and stable [2] van den Oord, A¨aron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. ICML, 2016. [3] van den Oord, A¨aron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with pixelcnn decoders. NIPS, 2016. Pixel-RNN Pixel-CNN
  • 6.
    6 / 36 ImageTransformer [2018 ICML]  Motivation • Pixel-RNN and Pixel-CNN turned the problem into sequence modeling problem by applying RNN or CNN to predict each next pixel given all previously generated pixels. • RNN is computationally heavy  • CNN is parallelizable  • CNN has limited receptive field → long range dependency problem → if stack more layers? → expensive  • RNN has virtually unlimited receptive field  • Self-attention can achieve a better balance in the trade-off between the virtually unlimited receptive field of the necessarily sequential PixelRNN and the limited receptive field of the much more parallelizable PixelCNN and its various extensions 
  • 7.
    7 / 36 ImageTransformer [2018 ICML]  Image Completion & Super-resolution
  • 8.
    8 / 36 ImageTransformer [2018 ICML]  Image Transformer • 𝑞: single channel of one pixel (query) • 𝑚1, 𝑚2, 𝑚3: memory of previously generated pixels (key) • 𝑝 𝑞, 𝑝1, 𝑝2, 𝑝3: position encodings • 𝑐𝑚𝑝: first embed query and key then apply dot product.
  • 9.
    9 / 36 ImageTransformer [2018 ICML]  Local Self-Attention • Scalability issue is in the self-attention mechanism  • Restrict the positions in the memory matrix M to a local neighborhood around the query position  • 1d vs. 2d Attention
  • 10.
    10/ 36 Image Transformer[2018 ICML]  Image Generation
  • 11.
    11 / 36 ImageTransformer [2018 ICML]  Super-resolution
  • 12.
    12/ 36 Video ActionTransformer Network [2018 CVPR]  Action Recognition [4] Sigurdsson, Gunnar A., et al. Hollywood in homes: Crowdsourcing data collection for activity understanding. ECCV, 2016.
  • 13.
    13/ 36 Video ActionTransformer Network [2018 CVPR]  Action Recognition & Localization
  • 14.
    14/ 36 Video ActionTransformer Network [2018 CVPR]  Action Transformer • Action Transformer unit takes as input the video feature representation and the box proposal from RPN and maps it into query (𝑄) and memory (𝐾&𝑉) features. • Query (𝑄): The person being classified. • Memory (𝐾&𝑉): Clip around the person. • The unit processes the query (𝑄) and memory (𝐾&𝑉) to output an updated query vector (𝑄∗ ). • The intuition is that the self-attention will add context from other people and objects in the clip to the query (𝑄) vector, to aid with the subsequent classification. • This unit can be stacked in multiple heads and layers, by concatenating the output from the multiple heads at a given layer, and using the concatenated feature as the next query. • This updated query (𝑄∗ ) is then used to again attend to context features in the following layer.
  • 15.
    15/ 36 Video ActionTransformer Network [2018 CVPR]  Action Recognition
  • 16.
    16/ 36 Video ActionTransformer Network [2018 CVPR]  Action Recognition
  • 17.
    17/ 36 Video ActionTransformer Network [2018 CVPR]  Action Recognition
  • 18.
    18/ 36 End-to-End ObjectDetection with Transformer [2020 ECCV]  Object Detection [5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.
  • 19.
    19/ 36 End-to-End ObjectDetection with Transformer [2020 ECCV]  Faster R-CNN [5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.
  • 20.
    20/ 36 End-to-End ObjectDetection with Transformer [2020 ECCV]  Non Maximum Suppression (NMS) NMS
  • 21.
    21/ 36 End-to-End ObjectDetection with Transformer [2020 ECCV]  Motivation • Multi-staged pipeline  • Too many hand-crafted components or heuristics (e.g., non-maximum suppression, anchor box) that explicitly encode out prior knowledge about the task  • Let’s simplify these pipelines with end-to-end philosophy!  • Let’s remove the need of heuristics with direct set prediction!  • Forces unique predictions via bi-partite matching loss between predicted and ground-truth objects. • Encoder-decoder architecture based on Transformer • Transformer explicitly model all pairwise interactions between elements in a sequence, which is particularly suitable for specific constraints of set prediction such as removing duplicate predictions.
  • 22.
    22/ 36 End-to-End ObjectDetection with Transformer [2020 ECCV]  DETR (DEtection TRansformer) in high-level
  • 23.
    23/ 36 End-to-End ObjectDetection with Transformer [2020 ECCV]  DETR in detail
  • 24.
    24/ 36 End-to-End ObjectDetection with Transformer [2020 ECCV]  Encoder self-attention
  • 25.
    25/ 36 End-to-End ObjectDetection with Transformer [2020 ECCV]  NMS & OOD
  • 26.
    26/ 36 End-to-End ObjectDetection with Transformer [2020 ECCV]  Decoder attetnion
  • 27.
    27/ 36 End-to-End ObjectDetection with Transformer [2020 ECCV]  Decoder output slot
  • 28.
    28/ 36 End-to-End ObjectDetection with Transformer [2020 ECCV]  DETR PyTorch inference code: Very simple 
  • 29.
    29/ 36 An Imageis Worth 16x16 words [2021 ICLR under review]  Motivation • Large Transformer-based models are often pre-trained on large corpora and then fine-tuned for the task at hand: BERT uses a denoising self- supervised pre-training task, while the GPT line of work uses language modeling as its pre-training task • Vision Transformer (ViT) yield modest results when trained on mid-sized datasets such as ImageNet, achieving accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. • However, the picture changes if ViT is trained on large datasets (14M- 300M images). i.e., large scale training trumps inductive bias. Transformers attain excellent results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints.
  • 30.
    30/ 36 An Imageis Worth 16x16 words [2021 ICLR under review]  Vision Transformer (ViT)
  • 31.
    31/ 36 An Imageis Worth 16x16 words [2021 ICLR under review]  ViT vs. BiT (Alexander Kolesnikov et al. Big transfer (BiT): General visual representation learning. In ECCV, 2020.)
  • 32.
    32/ 36 An Imageis Worth 16x16 words [2021 ICLR under review]  Performance vs. pre-training samples
  • 33.
    33/ 36 An Imageis Worth 16x16 words [2021 ICLR under review]  Performance vs. cost
  • 34.
    34/ 36 An Imageis Worth 16x16 words [2021 ICLR under review]  Image Classification
  • 35.
    35/ 36 Concluding Remarks Transformer are competent with modeling inter-relationship between pixels (video clips, image patches, …)   If Transformer is pre-trained sufficient number of data, it can replace the CNN and it also performs well   Transformer is a generic architecture even more than MLP (I think…)  Not only in NLP, Transformer also shows astonishing results in Vision!  But, Transformer is known to have quadratic complexity   Here’s further reading which reduces the quadratic complexity into the linear complexity.  “Rethinking Attention with Performers” (ICLR 2021 under review)
  • 36.

Editor's Notes

  • #12 We adjust the concentration of the distribution we sample from with a temperature tau > 0 by which we divide the logits for the channel intensities.
  • #21 Remove duplicates
  • #31 Due to the scalability of Attention mechanism, Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes.