Transformer in Vision

Transformer in Vision
Sangmin Woo
2020.10.29
[2018 ICML] Image Transformer
[2019 CVPR] Video Action Transformer Network
[2020 ECCV] End-to-End Object Detection with Transformers
[2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

2 / 36
Contents
[2018 ICML] Image Transformer
Niki Parmar1 Ashish Vaswani1 Jakob Uszkoreit1 Łukasz Kaiser1 Noam Shazeer1 Alexander Ku2,3 Dustin Tran4
1Google Brain, Mountain View, USA
2Department of Electrical Engineering and Computer Sciences, University of California, Berkeley
3Work done during an internship at Google Brain
4Google AI, Mountain View, USA.
[2019 CVPR] Video Action Transformer Network
Rohit Girdhar1 Jo˜ao Carreira2 Carl Doersch2 Andrew Zisserman2,3
1Carnegie Mellon University 2DeepMind 3University of Oxford
[2020 ECCV] End-to-End Object Detection with Transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko
Facebook AI
[2021 ICLR under review] An Image is Worth 16x16 Words: Transformers for Image
Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner, Mostafa
Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby,
Google Research, Brain Team

3 / 36
Background
 Attention is all you need [2017 NIPS]
• The main idea of the original architecture is to compute self-attention by
comparing a feature to all other features in the sequence.
• Features are first mapped to a query (Q) and memory (key and value, K &
V ) embedding using linear projections.
• The output for the query is computed as an attention weighted sum of
values (V), with the attention weights obtained from the product of the
query (Q) with keys (K).
• In practice, query (Q) is the word being translated, and keys (K) and
values (V) are linear projections of the input sequence and the output
sequence generated so far.
• A positional encoding is also added to these representations in order to
incorporate positional information which is lost in this setup.
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.

4 / 36
Image Transformer [2018 ICML]
 Generative models (Image Generation, Super-Resolution, Image
Completion)

5 / 36
 Pixel-RNN / Pixel-CNN (van den Oord et al., 2016)
• Straightforward
• Tractable likelihood
• Simple and stable
[2] van den Oord, A¨aron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. ICML, 2016.
[3] van den Oord, A¨aron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image
generation with pixelcnn decoders. NIPS, 2016.
Pixel-RNN Pixel-CNN

6 / 36
 Motivation
• Pixel-RNN and Pixel-CNN turned the problem into sequence modeling
problem by applying RNN or CNN to predict each next pixel given all
previously generated pixels.
• RNN is computationally heavy 
• CNN is parallelizable 
• CNN has limited receptive field → long range dependency problem → if
stack more layers? → expensive 
• RNN has virtually unlimited receptive field 
• Self-attention can achieve a better balance in the trade-off between the
virtually unlimited receptive field of the necessarily sequential PixelRNN
and the limited receptive field of the much more parallelizable PixelCNN
and its various extensions 

7 / 36
 Image Completion & Super-resolution

8 / 36
 Image Transformer
• 𝑞: single channel of one pixel
(query)
• 𝑚1, 𝑚2, 𝑚3: memory of
previously generated pixels
(key)
• 𝑝 𝑞, 𝑝1, 𝑝2, 𝑝3: position encodings
• 𝑐𝑚𝑝: first embed query and key
then apply dot product.

9 / 36
 Local Self-Attention
• Scalability issue is in the self-attention mechanism 
• Restrict the positions in the memory matrix M to a local neighborhood
around the query position 
• 1d vs. 2d Attention

10/ 36
 Image Generation

11 / 36
 Super-resolution

12/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition
[4] Sigurdsson, Gunnar A., et al. Hollywood in homes: Crowdsourcing data collection for activity understanding. ECCV, 2016.

13/ 36
[2018 CVPR]
 Action Recognition & Localization

14/ 36
[2018 CVPR]
 Action Transformer
• Action Transformer unit takes as input the video feature representation
and the box proposal from RPN and maps it into query (𝑄) and memory
(𝐾&𝑉) features.
• Query (𝑄): The person being classified.
• Memory (𝐾&𝑉): Clip around the person.
• The unit processes the query (𝑄) and memory (𝐾&𝑉) to output an updated
query vector (𝑄∗
).
• The intuition is that the self-attention will add context from other people
and objects in the clip to the query (𝑄) vector, to aid with the subsequent
classification.
• This unit can be stacked in multiple heads and layers, by concatenating
the output from the multiple heads at a given layer, and using the
concatenated feature as the next query.
• This updated query (𝑄∗
) is then used to again attend to context features in
the following layer.

15/ 36
[2018 CVPR]

16/ 36
[2018 CVPR]

17/ 36
[2018 CVPR]

18/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Object Detection
[5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.

19/ 36
 Faster R-CNN
[5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.

20/ 36
 Non Maximum Suppression (NMS)
NMS

21/ 36
 Motivation
• Multi-staged pipeline 
• Too many hand-crafted components or heuristics (e.g., non-maximum
suppression, anchor box) that explicitly encode out prior knowledge about
the task 
• Let’s simplify these pipelines with end-to-end philosophy! 
• Let’s remove the need of heuristics with direct set prediction! 
• Forces unique predictions via bi-partite matching loss between
predicted and ground-truth objects.
• Encoder-decoder architecture based on Transformer
• Transformer explicitly model all pairwise interactions between
elements in a sequence, which is particularly suitable for specific
constraints of set prediction such as removing duplicate
predictions.

22/ 36
 DETR (DEtection TRansformer) in high-level

23/ 36
 DETR in detail

24/ 36
 Encoder self-attention

25/ 36
 NMS & OOD

26/ 36
 Decoder attetnion

27/ 36
 Decoder output slot

28/ 36
 DETR PyTorch inference code: Very simple 

29/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Motivation
• Large Transformer-based models are often pre-trained on large corpora
and then fine-tuned for the task at hand: BERT uses a denoising self-
supervised pre-training task, while the GPT line of work uses language
modeling as its pre-training task
• Vision Transformer (ViT) yield modest results when trained on mid-sized
datasets such as ImageNet, achieving accuracies of a few percentage
points below ResNets of comparable size. This seemingly discouraging
outcome may be expected: Transformers lack some inductive biases
inherent to CNNs, such as translation equivariance and locality, and
therefore do not generalize well when trained on insufficient amounts of
data.
• However, the picture changes if ViT is trained on large datasets (14M-
300M images). i.e., large scale training trumps inductive bias.
Transformers attain excellent results when pre-trained at sufficient
scale and transferred to tasks with fewer datapoints.

30/ 36
 Vision Transformer (ViT)

31/ 36
 ViT vs. BiT (Alexander Kolesnikov et al. Big transfer (BiT): General visual representation learning. In ECCV, 2020.)

32/ 36
 Performance vs. pre-training samples

33/ 36
 Performance vs. cost

34/ 36
 Image Classification

35/ 36
Concluding Remarks
 Transformer are competent with modeling inter-relationship
between pixels (video clips, image patches, …) 
 If Transformer is pre-trained sufficient number of data, it can
replace the CNN and it also performs well 
 Transformer is a generic architecture even more than MLP (I
think…)
 Not only in NLP, Transformer also shows astonishing results in
Vision!
 But, Transformer is known to have quadratic complexity 
 Here’s further reading which reduces the quadratic complexity into the
linear complexity.
 “Rethinking Attention with Performers” (ICLR 2021 under review)

Thank You
shmwoo9395@{gist.ac.kr, gmail.com}

Transformer in Vision

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Transformer in Vision

Similar to Transformer in Vision (20)

More from Sangmin Woo

More from Sangmin Woo (14)

Recently uploaded

Recently uploaded (20)

Transformer in Vision

Editor's Notes