ViT (Vision Transformer) Review [CDM]

An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale
Anonymous (ICLR 2021 under review)
Yonsei University Severance Hospital CCIDS
Choi Dongmin

Abstract
• Transformer 
- standard architecture for NLP
• Convolutional Networks 
- attention is applied keeping their overall structure

• Transformer in Computer Vision 
- a pure transformer can perform very well on image classiﬁcation tasks
when applied directly to sequences of image patches 
- achieved S.O.T.A with small computational costs when pre-trained on
large dataset

Introduction
Vaswani et al. Attention Is All You Need. NIPS 2017
Transformer
BERT
Self-attention 
based architecture
The dominant approach : pre-training on a large text corpus 
and then ﬁne-tuning on a smaller task-speciﬁc dataset

Introduction
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020

Wang et al. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. ECCV 2020
Self-Attention in CV inspired by NLP
DETR
Axial-DeepLab
However, classic ResNet-like architectures are still S.O.T.A

• Applying a Transformer Directly to Images 
- with the fewest possible modiﬁcations 
- provide the sequence of linear embeddings of the patches as an input 
- image patches = tokens (words) in NLP
• Small Scale Training 
- achieved accuracies below ResNets of comparable size 
- Transformers lack some inductive biased inherent to CNNs 
(such as translation equivariance and locality)
• Large Scale Training 
- trumps (surpass) inductive bias 
- excellent results when pre-trained at suﬃcient scale and transferred
Introduction

Related Works
Transformer
Vaswani et al. Attention Is All You Need. NIPS 2017

Devlin et al. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL 2019

Radford et al. Improving language understanding with unsupervised learning. Technical Report 2018
- Standard model in NLP tasks

- Only consists of attention modules 
not using RNN

- Encoder-decoder

- Requires large scale dataset and 
high computational cost

- Pre-training and ﬁne-tuning
approaches : BERT & GPT

Method
Image → A sequence of ﬂattened 2D patchesx ∈ RH×W×C
xp ∈ RN×(P2
·C)
Trainable linear projection maps 
→xp ∈ RN×(P2
·C)
xpE ∈ RN×D
Learnable Position Embedding 
Epos ∈ R(N+1)×D
* Because Transformer uses constant 
widths, model dimension , through all of its layersD
* to retain positional information
z0
L
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py#L99-L111

Method
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py

Method
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
: input sequencez ∈ RN×D
Attention weight : similarity btwAij qi
, kj

Method
Hybrid Architecture
Flattened intermediate feature 
maps of a ResNet 
as the input sequence like DETR

Method
Fine-tuning and Higher Resolution
Remove the pre-trained prediction head and attach a zero-initialized 
feedforward layer ( =the number of downstream classes)D × K K

Experiments
• Datasets 
< Pre-training > 
- ILSVRC-2012 ImageNet dataset : 1k classes / 1.3M images 
- ImageNet-21k : 21k classes / 14M images 
- JFT : 18k classes / 303M images 
< Downstream (Fine-tuning) > 
- ImageNet, ImageNet ReaL, CIFAR-10/100, Oxford-IIIT Pets, Oxford
Flowers-102, VTAB
• Model Variants ex : ViT-L/16 = “Large” variants, with 16 X 16 input patch size

Experiments
• Training & Fine-tuning 
< Pre-training> 
- Adam with  
- Batch size 4,096 
- Weight decay 0.1 (high weight decay is useful for transfer models) 
- Linear learning rate warmup and decay 
 
< Fine-tuning > 
- SGD with momentum, batch size 512

• Metrics 
- Few-shot (for fast on-the-ﬂy evaluation) 
- Fine-tuning accuracy
β1 = 0.9, β2 = 0.999

Experiments
• Comparison to State of the Art
Kolesnikov et al. Big Transfer (BiT): General Visual Representation Learning. ECCV 2020

Xie et al. Self-training with noisy student improves imagenet classiﬁcation. CVPR 2020
* BiT-L : Big Transfer, which performs supervised transfer learning with large ResNets

* Noisy Student : a large EﬃcientNet trained using semi-supervised learning

Experiments
• Comparison to State of the Art

Experiments
• Pre-training Data Requirements
Larger Dataset
Larger Dataset

Experiments
• Inspecting Vision Transformer
The components resemble plausible basis functions 
for a low-dimensional representation of the ﬁne structure within each patch

analogous to receptive ﬁeld size in CNNs

Conclusion
• Application of Transformers to Image Recognition 
- no image-speciﬁc inductive biases in the architecture 
- interpret an image as sequence of patches and process it by a standard
Transformer encoder 
- simple, yet scalable, strategy works 
- matches or exceeds the S.O.T.A being cheap to pre-train

• Many Challenges Remain 
- other computer vision tasks, such as detection and segmentation 
- further scaling ViT

Q&A
• ViT for Segmentation
• Fine-tuning on Grayscale Dataset

ViT (Vision Transformer) Review [CDM]

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ViT (Vision Transformer) Review [CDM]

Similar to ViT (Vision Transformer) Review [CDM] (20)

More from Dongmin Choi

More from Dongmin Choi (20)

Recently uploaded

Recently uploaded (20)

ViT (Vision Transformer) Review [CDM]