SlideShare a Scribd company logo
Transformer in Vision
Sangmin Woo
2020.10.29
[2018 ICML] Image Transformer
[2019 CVPR] Video Action Transformer Network
[2020 ECCV] End-to-End Object Detection with Transformers
[2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2 / 36
Contents
[2018 ICML] Image Transformer
Niki Parmar1 Ashish Vaswani1 Jakob Uszkoreit1 Łukasz Kaiser1 Noam Shazeer1 Alexander Ku2,3 Dustin Tran4
1Google Brain, Mountain View, USA
2Department of Electrical Engineering and Computer Sciences, University of California, Berkeley
3Work done during an internship at Google Brain
4Google AI, Mountain View, USA.
[2019 CVPR] Video Action Transformer Network
Rohit Girdhar1 Jo˜ao Carreira2 Carl Doersch2 Andrew Zisserman2,3
1Carnegie Mellon University 2DeepMind 3University of Oxford
[2020 ECCV] End-to-End Object Detection with Transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko
Facebook AI
[2021 ICLR under review] An Image is Worth 16x16 Words: Transformers for Image
Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner, Mostafa
Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby,
Google Research, Brain Team
3 / 36
Background
 Attention is all you need [2017 NIPS]
• The main idea of the original architecture is to compute self-attention by
comparing a feature to all other features in the sequence.
• Features are first mapped to a query (Q) and memory (key and value, K &
V ) embedding using linear projections.
• The output for the query is computed as an attention weighted sum of
values (V), with the attention weights obtained from the product of the
query (Q) with keys (K).
• In practice, query (Q) is the word being translated, and keys (K) and
values (V) are linear projections of the input sequence and the output
sequence generated so far.
• A positional encoding is also added to these representations in order to
incorporate positional information which is lost in this setup.
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
4 / 36
Image Transformer [2018 ICML]
 Generative models (Image Generation, Super-Resolution, Image
Completion)
5 / 36
Image Transformer [2018 ICML]
 Pixel-RNN / Pixel-CNN (van den Oord et al., 2016)
• Straightforward
• Tractable likelihood
• Simple and stable
[2] van den Oord, A¨aron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. ICML, 2016.
[3] van den Oord, A¨aron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image
generation with pixelcnn decoders. NIPS, 2016.
Pixel-RNN Pixel-CNN
6 / 36
Image Transformer [2018 ICML]
 Motivation
• Pixel-RNN and Pixel-CNN turned the problem into sequence modeling
problem by applying RNN or CNN to predict each next pixel given all
previously generated pixels.
• RNN is computationally heavy 
• CNN is parallelizable 
• CNN has limited receptive field → long range dependency problem → if
stack more layers? → expensive 
• RNN has virtually unlimited receptive field 
• Self-attention can achieve a better balance in the trade-off between the
virtually unlimited receptive field of the necessarily sequential PixelRNN
and the limited receptive field of the much more parallelizable PixelCNN
and its various extensions 
7 / 36
Image Transformer [2018 ICML]
 Image Completion & Super-resolution
8 / 36
Image Transformer [2018 ICML]
 Image Transformer
• 𝑞: single channel of one pixel
(query)
• 𝑚1, 𝑚2, 𝑚3: memory of
previously generated pixels
(key)
• 𝑝 𝑞, 𝑝1, 𝑝2, 𝑝3: position encodings
• 𝑐𝑚𝑝: first embed query and key
then apply dot product.
9 / 36
Image Transformer [2018 ICML]
 Local Self-Attention
• Scalability issue is in the self-attention mechanism 
• Restrict the positions in the memory matrix M to a local neighborhood
around the query position 
• 1d vs. 2d Attention
10/ 36
Image Transformer [2018 ICML]
 Image Generation
11 / 36
Image Transformer [2018 ICML]
 Super-resolution
12/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition
[4] Sigurdsson, Gunnar A., et al. Hollywood in homes: Crowdsourcing data collection for activity understanding. ECCV, 2016.
13/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition & Localization
14/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Transformer
• Action Transformer unit takes as input the video feature representation
and the box proposal from RPN and maps it into query (𝑄) and memory
(𝐾&𝑉) features.
• Query (𝑄): The person being classified.
• Memory (𝐾&𝑉): Clip around the person.
• The unit processes the query (𝑄) and memory (𝐾&𝑉) to output an updated
query vector (𝑄∗
).
• The intuition is that the self-attention will add context from other people
and objects in the clip to the query (𝑄) vector, to aid with the subsequent
classification.
• This unit can be stacked in multiple heads and layers, by concatenating
the output from the multiple heads at a given layer, and using the
concatenated feature as the next query.
• This updated query (𝑄∗
) is then used to again attend to context features in
the following layer.
15/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition
16/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition
17/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition
18/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Object Detection
[5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.
19/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Faster R-CNN
[5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.
20/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Non Maximum Suppression (NMS)
NMS
21/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Motivation
• Multi-staged pipeline 
• Too many hand-crafted components or heuristics (e.g., non-maximum
suppression, anchor box) that explicitly encode out prior knowledge about
the task 
• Let’s simplify these pipelines with end-to-end philosophy! 
• Let’s remove the need of heuristics with direct set prediction! 
• Forces unique predictions via bi-partite matching loss between
predicted and ground-truth objects.
• Encoder-decoder architecture based on Transformer
• Transformer explicitly model all pairwise interactions between
elements in a sequence, which is particularly suitable for specific
constraints of set prediction such as removing duplicate
predictions.
22/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 DETR (DEtection TRansformer) in high-level
23/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 DETR in detail
24/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Encoder self-attention
25/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 NMS & OOD
26/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Decoder attetnion
27/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Decoder output slot
28/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 DETR PyTorch inference code: Very simple 
29/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Motivation
• Large Transformer-based models are often pre-trained on large corpora
and then fine-tuned for the task at hand: BERT uses a denoising self-
supervised pre-training task, while the GPT line of work uses language
modeling as its pre-training task
• Vision Transformer (ViT) yield modest results when trained on mid-sized
datasets such as ImageNet, achieving accuracies of a few percentage
points below ResNets of comparable size. This seemingly discouraging
outcome may be expected: Transformers lack some inductive biases
inherent to CNNs, such as translation equivariance and locality, and
therefore do not generalize well when trained on insufficient amounts of
data.
• However, the picture changes if ViT is trained on large datasets (14M-
300M images). i.e., large scale training trumps inductive bias.
Transformers attain excellent results when pre-trained at sufficient
scale and transferred to tasks with fewer datapoints.
30/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Vision Transformer (ViT)
31/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 ViT vs. BiT (Alexander Kolesnikov et al. Big transfer (BiT): General visual representation learning. In ECCV, 2020.)
32/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Performance vs. pre-training samples
33/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Performance vs. cost
34/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Image Classification
35/ 36
Concluding Remarks
 Transformer are competent with modeling inter-relationship
between pixels (video clips, image patches, …) 
 If Transformer is pre-trained sufficient number of data, it can
replace the CNN and it also performs well 
 Transformer is a generic architecture even more than MLP (I
think…)
 Not only in NLP, Transformer also shows astonishing results in
Vision!
 But, Transformer is known to have quadratic complexity 
 Here’s further reading which reduces the quadratic complexity into the
linear complexity.
 “Rethinking Attention with Performers” (ICLR 2021 under review)
Thank You
shmwoo9395@{gist.ac.kr, gmail.com}

More Related Content

What's hot

Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
changedaeoh
 
State of transformers in Computer Vision
State of transformers in Computer VisionState of transformers in Computer Vision
State of transformers in Computer Vision
Deep Kayal
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
Edge AI and Vision Alliance
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Universitat Politècnica de Catalunya
 
Image captioning
Image captioningImage captioning
Image captioning
Rajesh Shreedhar Bhat
 
How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?
Benjaminlapid1
 
CNN Machine learning DeepLearning
CNN Machine learning DeepLearningCNN Machine learning DeepLearning
CNN Machine learning DeepLearning
Abhishek Sharma
 
ViT.pptx
ViT.pptxViT.pptx
ViT.pptx
Changjin Lee
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion path
Vitaly Bondar
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
Richard Kuo
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
남주 김
 
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Simplilearn
 
Style gan
Style ganStyle gan
Style gan
哲东 郑
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
Arvind Devaraj
 
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAIYurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Lviv Startup Club
 
Modern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentationModern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentation
Gioele Ciaparrone
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
Image Classification using deep learning
Image Classification using deep learning Image Classification using deep learning
Image Classification using deep learning
Asma-AH
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
Jinwon Lee
 

What's hot (20)

Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
State of transformers in Computer Vision
State of transformers in Computer VisionState of transformers in Computer Vision
State of transformers in Computer Vision
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)
 
Image captioning
Image captioningImage captioning
Image captioning
 
How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?
 
CNN Machine learning DeepLearning
CNN Machine learning DeepLearningCNN Machine learning DeepLearning
CNN Machine learning DeepLearning
 
ViT.pptx
ViT.pptxViT.pptx
ViT.pptx
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion path
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
 
Style gan
Style ganStyle gan
Style gan
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAIYurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
 
Modern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentationModern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentation
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Image Classification using deep learning
Image Classification using deep learning Image Classification using deep learning
Image Classification using deep learning
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 

Similar to Transformer in Vision

Visual geometry with deep learning
Visual geometry with deep learningVisual geometry with deep learning
Visual geometry with deep learning
NAVER Engineering
 
[RSS2023] Local Object Crop Collision Network for Efficient Simulation
[RSS2023] Local Object Crop Collision Network for Efficient Simulation[RSS2023] Local Object Crop Collision Network for Efficient Simulation
[RSS2023] Local Object Crop Collision Network for Efficient Simulation
DongwonSon1
 
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
paperpublications3
 
物件偵測與辨識技術
物件偵測與辨識技術物件偵測與辨識技術
物件偵測與辨識技術
CHENHuiMei
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
Fellowship at Vodafone FutureLab
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
Hiroshi Fukui
 
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro..."High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
Edge AI and Vision Alliance
 
Image super resolution using Generative Adversarial Network.
Image super resolution using Generative Adversarial Network.Image super resolution using Generative Adversarial Network.
Image super resolution using Generative Adversarial Network.
IRJET Journal
 
Visual Transformers
Visual TransformersVisual Transformers
Visual Transformers
Kwanghee Choi
 
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
Edge AI and Vision Alliance
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
NUPUR YADAV
 
Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...
Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...
Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...
CSCJournals
 
Review On Different Feature Extraction Algorithms
Review On Different Feature Extraction AlgorithmsReview On Different Feature Extraction Algorithms
Review On Different Feature Extraction Algorithms
IRJET Journal
 
AR/SLAM for end-users
AR/SLAM for end-usersAR/SLAM for end-users
AR/SLAM for end-users
Rakuten Group, Inc.
 
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Content-based image retrieval based on corel dataset using deep learning
Content-based image retrieval based on corel dataset using deep learningContent-based image retrieval based on corel dataset using deep learning
Content-based image retrieval based on corel dataset using deep learning
IAESIJAI
 
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET TransformRotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
IRJET Journal
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
RAHUL BHOJWANI
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverviewMotaz El-Saban
 
Large Scale Image Retrieval 2022.pdf
Large Scale Image Retrieval 2022.pdfLarge Scale Image Retrieval 2022.pdf
Large Scale Image Retrieval 2022.pdf
SamuCerezo
 

Similar to Transformer in Vision (20)

Visual geometry with deep learning
Visual geometry with deep learningVisual geometry with deep learning
Visual geometry with deep learning
 
[RSS2023] Local Object Crop Collision Network for Efficient Simulation
[RSS2023] Local Object Crop Collision Network for Efficient Simulation[RSS2023] Local Object Crop Collision Network for Efficient Simulation
[RSS2023] Local Object Crop Collision Network for Efficient Simulation
 
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
 
物件偵測與辨識技術
物件偵測與辨識技術物件偵測與辨識技術
物件偵測與辨識技術
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
 
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro..."High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
 
Image super resolution using Generative Adversarial Network.
Image super resolution using Generative Adversarial Network.Image super resolution using Generative Adversarial Network.
Image super resolution using Generative Adversarial Network.
 
Visual Transformers
Visual TransformersVisual Transformers
Visual Transformers
 
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...
Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...
Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...
 
Review On Different Feature Extraction Algorithms
Review On Different Feature Extraction AlgorithmsReview On Different Feature Extraction Algorithms
Review On Different Feature Extraction Algorithms
 
AR/SLAM for end-users
AR/SLAM for end-usersAR/SLAM for end-users
AR/SLAM for end-users
 
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
 
Content-based image retrieval based on corel dataset using deep learning
Content-based image retrieval based on corel dataset using deep learningContent-based image retrieval based on corel dataset using deep learning
Content-based image retrieval based on corel dataset using deep learning
 
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET TransformRotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverview
 
Large Scale Image Retrieval 2022.pdf
Large Scale Image Retrieval 2022.pdfLarge Scale Image Retrieval 2022.pdf
Large Scale Image Retrieval 2022.pdf
 

More from Sangmin Woo

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptx
Sangmin Woo
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
Sangmin Woo
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
Sangmin Woo
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
Sangmin Woo
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptx
Sangmin Woo
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
Sangmin Woo
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
Sangmin Woo
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
Sangmin Woo
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Sangmin Woo
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient Transformers
Sangmin Woo
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Sangmin Woo
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
Sangmin Woo
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
Sangmin Woo
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
Sangmin Woo
 

More from Sangmin Woo (14)

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptx
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptx
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient Transformers
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 

Recently uploaded

sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
ssuser36d3051
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
AIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdfAIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdf
RicletoEspinosa1
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
anoopmanoharan2
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptxTOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
nikitacareer3
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
obonagu
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
Divyam548318
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
Kamal Acharya
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 

Recently uploaded (20)

sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
AIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdfAIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdf
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptxTOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 

Transformer in Vision

  • 1. Transformer in Vision Sangmin Woo 2020.10.29 [2018 ICML] Image Transformer [2019 CVPR] Video Action Transformer Network [2020 ECCV] End-to-End Object Detection with Transformers [2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  • 2. 2 / 36 Contents [2018 ICML] Image Transformer Niki Parmar1 Ashish Vaswani1 Jakob Uszkoreit1 Łukasz Kaiser1 Noam Shazeer1 Alexander Ku2,3 Dustin Tran4 1Google Brain, Mountain View, USA 2Department of Electrical Engineering and Computer Sciences, University of California, Berkeley 3Work done during an internship at Google Brain 4Google AI, Mountain View, USA. [2019 CVPR] Video Action Transformer Network Rohit Girdhar1 Jo˜ao Carreira2 Carl Doersch2 Andrew Zisserman2,3 1Carnegie Mellon University 2DeepMind 3University of Oxford [2020 ECCV] End-to-End Object Detection with Transformers Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko Facebook AI [2021 ICLR under review] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy, Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, Google Research, Brain Team
  • 3. 3 / 36 Background  Attention is all you need [2017 NIPS] • The main idea of the original architecture is to compute self-attention by comparing a feature to all other features in the sequence. • Features are first mapped to a query (Q) and memory (key and value, K & V ) embedding using linear projections. • The output for the query is computed as an attention weighted sum of values (V), with the attention weights obtained from the product of the query (Q) with keys (K). • In practice, query (Q) is the word being translated, and keys (K) and values (V) are linear projections of the input sequence and the output sequence generated so far. • A positional encoding is also added to these representations in order to incorporate positional information which is lost in this setup. [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
  • 4. 4 / 36 Image Transformer [2018 ICML]  Generative models (Image Generation, Super-Resolution, Image Completion)
  • 5. 5 / 36 Image Transformer [2018 ICML]  Pixel-RNN / Pixel-CNN (van den Oord et al., 2016) • Straightforward • Tractable likelihood • Simple and stable [2] van den Oord, A¨aron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. ICML, 2016. [3] van den Oord, A¨aron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with pixelcnn decoders. NIPS, 2016. Pixel-RNN Pixel-CNN
  • 6. 6 / 36 Image Transformer [2018 ICML]  Motivation • Pixel-RNN and Pixel-CNN turned the problem into sequence modeling problem by applying RNN or CNN to predict each next pixel given all previously generated pixels. • RNN is computationally heavy  • CNN is parallelizable  • CNN has limited receptive field → long range dependency problem → if stack more layers? → expensive  • RNN has virtually unlimited receptive field  • Self-attention can achieve a better balance in the trade-off between the virtually unlimited receptive field of the necessarily sequential PixelRNN and the limited receptive field of the much more parallelizable PixelCNN and its various extensions 
  • 7. 7 / 36 Image Transformer [2018 ICML]  Image Completion & Super-resolution
  • 8. 8 / 36 Image Transformer [2018 ICML]  Image Transformer • 𝑞: single channel of one pixel (query) • 𝑚1, 𝑚2, 𝑚3: memory of previously generated pixels (key) • 𝑝 𝑞, 𝑝1, 𝑝2, 𝑝3: position encodings • 𝑐𝑚𝑝: first embed query and key then apply dot product.
  • 9. 9 / 36 Image Transformer [2018 ICML]  Local Self-Attention • Scalability issue is in the self-attention mechanism  • Restrict the positions in the memory matrix M to a local neighborhood around the query position  • 1d vs. 2d Attention
  • 10. 10/ 36 Image Transformer [2018 ICML]  Image Generation
  • 11. 11 / 36 Image Transformer [2018 ICML]  Super-resolution
  • 12. 12/ 36 Video Action Transformer Network [2018 CVPR]  Action Recognition [4] Sigurdsson, Gunnar A., et al. Hollywood in homes: Crowdsourcing data collection for activity understanding. ECCV, 2016.
  • 13. 13/ 36 Video Action Transformer Network [2018 CVPR]  Action Recognition & Localization
  • 14. 14/ 36 Video Action Transformer Network [2018 CVPR]  Action Transformer • Action Transformer unit takes as input the video feature representation and the box proposal from RPN and maps it into query (𝑄) and memory (𝐾&𝑉) features. • Query (𝑄): The person being classified. • Memory (𝐾&𝑉): Clip around the person. • The unit processes the query (𝑄) and memory (𝐾&𝑉) to output an updated query vector (𝑄∗ ). • The intuition is that the self-attention will add context from other people and objects in the clip to the query (𝑄) vector, to aid with the subsequent classification. • This unit can be stacked in multiple heads and layers, by concatenating the output from the multiple heads at a given layer, and using the concatenated feature as the next query. • This updated query (𝑄∗ ) is then used to again attend to context features in the following layer.
  • 15. 15/ 36 Video Action Transformer Network [2018 CVPR]  Action Recognition
  • 16. 16/ 36 Video Action Transformer Network [2018 CVPR]  Action Recognition
  • 17. 17/ 36 Video Action Transformer Network [2018 CVPR]  Action Recognition
  • 18. 18/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  Object Detection [5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.
  • 19. 19/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  Faster R-CNN [5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.
  • 20. 20/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  Non Maximum Suppression (NMS) NMS
  • 21. 21/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  Motivation • Multi-staged pipeline  • Too many hand-crafted components or heuristics (e.g., non-maximum suppression, anchor box) that explicitly encode out prior knowledge about the task  • Let’s simplify these pipelines with end-to-end philosophy!  • Let’s remove the need of heuristics with direct set prediction!  • Forces unique predictions via bi-partite matching loss between predicted and ground-truth objects. • Encoder-decoder architecture based on Transformer • Transformer explicitly model all pairwise interactions between elements in a sequence, which is particularly suitable for specific constraints of set prediction such as removing duplicate predictions.
  • 22. 22/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  DETR (DEtection TRansformer) in high-level
  • 23. 23/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  DETR in detail
  • 24. 24/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  Encoder self-attention
  • 25. 25/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  NMS & OOD
  • 26. 26/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  Decoder attetnion
  • 27. 27/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  Decoder output slot
  • 28. 28/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  DETR PyTorch inference code: Very simple 
  • 29. 29/ 36 An Image is Worth 16x16 words [2021 ICLR under review]  Motivation • Large Transformer-based models are often pre-trained on large corpora and then fine-tuned for the task at hand: BERT uses a denoising self- supervised pre-training task, while the GPT line of work uses language modeling as its pre-training task • Vision Transformer (ViT) yield modest results when trained on mid-sized datasets such as ImageNet, achieving accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. • However, the picture changes if ViT is trained on large datasets (14M- 300M images). i.e., large scale training trumps inductive bias. Transformers attain excellent results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints.
  • 30. 30/ 36 An Image is Worth 16x16 words [2021 ICLR under review]  Vision Transformer (ViT)
  • 31. 31/ 36 An Image is Worth 16x16 words [2021 ICLR under review]  ViT vs. BiT (Alexander Kolesnikov et al. Big transfer (BiT): General visual representation learning. In ECCV, 2020.)
  • 32. 32/ 36 An Image is Worth 16x16 words [2021 ICLR under review]  Performance vs. pre-training samples
  • 33. 33/ 36 An Image is Worth 16x16 words [2021 ICLR under review]  Performance vs. cost
  • 34. 34/ 36 An Image is Worth 16x16 words [2021 ICLR under review]  Image Classification
  • 35. 35/ 36 Concluding Remarks  Transformer are competent with modeling inter-relationship between pixels (video clips, image patches, …)   If Transformer is pre-trained sufficient number of data, it can replace the CNN and it also performs well   Transformer is a generic architecture even more than MLP (I think…)  Not only in NLP, Transformer also shows astonishing results in Vision!  But, Transformer is known to have quadratic complexity   Here’s further reading which reduces the quadratic complexity into the linear complexity.  “Rethinking Attention with Performers” (ICLR 2021 under review)

Editor's Notes

  1. We adjust the concentration of the distribution we sample from with a temperature tau > 0 by which we divide the logits for the channel intensities.
  2. Remove duplicates
  3. Due to the scalability of Attention mechanism, Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes.