Visual Transformers

Visual Transformers
Kwanghee Choi (Jonas)

Table of Contents
● Preliminary
○ Key, Value, Query, Attention
○ Pooling
○ Multi-head Attention
○ Unsupervised Representation Learning
○ Syntactic Knowledge
● State-of-the-art Papers
○ Generative Pretraining from Pixels (ICML 2020)
○ An Image is Worth 16x16 Words (ICLR 2021)
○ End-to-End Object Detection with Transformers (ECCV 2020)
○ Additional Works

Key, Value, Query, Attention
● Problem: Given a set of data points (xi
, yi
), ﬁnd unknown y for x.
● Simplest approach:
● A bit more complicated approach: Watson-Nadaraya Estimator (1964)
● Key, value pairs (xi
, yi
)
● Query x
● Attention ⍺
Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )

Pooling
● Nonlinearity ⍴, ɸ, learnable weight w
● Deep sets (Zaheer et al. 2017)
○ Permutation Invariant
● Word2Vec (Mikolov et al. 2013)
○ Embed each word in a sentence
● Attention Weighting (Wang et al. 2016)
○ Query x depends on the context ⍺
● Iterative Attention Pooling (Yang et al. 2016)
○ Repeatedly update internal state qt
Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )

Multi-head Attention
● Attention module
○ Softmax acts as an attention function.
○ Dot product of Q and K acts as a similarity.
○ sqrt(dk
): Standard deviation of the dot product when Q, V ~ N(0, 1)
● Multi-head Attention
○ Single-head limits the ability of focusing on a speciﬁc position.
○ Multi-head gives attention layers different representation subspace.
Attention Is All You Need (Vaswani et al. NeurIPS 2017)

Unsupervised Representation Learning
● Input sequence x=(x1
, x2
, … )
● Autoregressive (AR)
○ ex) ELMo, GPT
○ No bidirectional context.
○ ELMO: Need to separately train forward/backward context.
● Auto Encoding (AE)
○ Corrupted input x’=(x1
, x2
, …, [MASK], … )
○ ex) BERT
○ Bi-directional self-attention
○ Different input distribution due to corruption
Understanding XLNet https://www.borealisai.com/en/blog/understanding-xlnet

Syntactic Knowledge
● BERT representations are hierarchical rather
than linear.
○ Open Sesame: Getting Inside BERT’s Linguistic Knowledge
(Lin et al. ACLW 2019)
● BERT “naturally” learns some syntactic
information, although it is not very similar to
linguistic annotated resources.
○ Perturbed Masking: Parameter-free Probing for Analyzing
and Interpreting BERT (Wu et al. ACL 2020)
A Primer in BERTology: What we know about how BERT works (Rogers et al. TACL 2020)

Generative Pretraining from Pixels
ICML 2020, OpenAI

Towards a general “image” model
● Just as a general LM can generate coherent text, Image GPT can
generate coherent images.
● “Analysis by Synthesis” suggests that model will also know about
object categories after it learns to do so.
● Generative sequence modeling is a universal unsupervised algorithm.
Image GPT (https://openai.com/blog/image-gpt/)

Approach
Generative Pretraining from Pixels (Chen et al. ICML 2020)

What representation works best?
● In supervised pre-training, representation quality tends to increase
monotonically with depth, but with generative pre-training, it is not
obvious whether a task like pixel prediction is relevant to image
classification.
● Representations first improve as a function of depth, and then,
starting around the middle layer, begin to deteriorate.
○ In the first phase, each position gathers information from its surrounding context in
order to build a more global image representation.
○ In the second phase, this contextualized input is used to solve the conditional next
pixel prediction task.
○ This could resemble the behavior of encoder-decoder architectures, but learned
within a monolithic architecture via a pre-training objective.

Performance on CIFAR dataset
● We ﬁnd that both increasing the
scale of our models and training for
more iterations result in better
generative performance, which
directly translates into better
feature quality.
● Generative models produce much
better features than BERT models
after pre-training, but BERT
models catch up after ﬁne-tuning.

An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale
ICLR 2021, Google

When does Transformers work?
● When trained on mid-sized datasets (i.e. ImageNet), Transformers
yield modest accuracies, few % below ResNets of comparable size.
● However, large scale training (14M-300M images) trumps inductive
bias of CNNs such as translation invariance & locality.
● Naive application of self-attention to images would require that each
pixel attends to every other pixel. With quadratic cost in the number
of pixels, this does not scale to realistic input sizes.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)

Model overview

Performance
With self-supervised pre-training (masked patch prediction), our smaller ViT-B/16 model achieves 79.9% accuracy
on ImageNet, a signiﬁcant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.

Interpreting the Results
● Positional embeddings
○ We speculate that learning to represent the spatial relations in
this resolution (14 x 14) is equally easy for different strategies.
○ Closer patches tend to have more similar position embeddings.
○ Row-column structure & sinusoidal structure appears.
● Self-attention
○ “Attention distance” analogous to “receptive ﬁeld size”.
○ Highly localized attention may serve a similar function as early
convolutional layers in CNNs.
○ Model attends to image regions that are semantically relevant
for classiﬁcation.

End-to-End Object Detection
with Transformers
ECCV 2020, Facebook

End-to-end object detection
Object detection as a direct set prediction problem.
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)

Removing NMS
● Conventional CNN to learn a 2D representation + Positional encoding
● 100 learned positional embeddings as object queries
● Global reasoning using pairwise relations

Encoder’s attention mechanism in action

Decoder’s attention mechanism in action

Performance in Object Detection

Panoptic Segmentation

Performance in Panoptic Segmentation

Notable Extensions
● Training data-efficient image transformers & distillation through
attention (Touvron et al. Arxiv 2021)
○ Add another token: distillation token to ViT. Using only the classiﬁcation token
doesn’t help much.
○ Soft distillation (teacher model’s softmax output) and hard-distillation (teacher
model’s argmax with label smoothing).
○ Surpasses SOTA yet again.
● DALL·E: Creating Images from Text (Ramesh et al. 2021)
○ Decoder-only transformer that receives both the text and the image as a single
stream of tokens (Text: 256, Image: 1024) and models all of them autoregressively.
○ Creates images from text captions for a wide range of concepts expressible in natural
language.

Task-speciﬁc: Object Detection
● End-to-End Object Detection with Adaptive Clustering Transformer
(Zheng et al. Arxiv 2020)
○ ACT cluster the query features adaptively using Locality Sensitive Hashing (LSH) and
approximate the query-key interaction using the prototype-key interaction.
○ ACT can replace the original self-attention module in DETR without degrading the
performance of pre-trained DETR model.
● Deformable DETR: Deformable Transformers for End-to-End Object
Detection (Zhu et al. ICLR 2021)
○ Deformable DETR can achieve better performance than DETR (especially on small
objects) with 10× less training epochs.
○ Deformable attention module: Choose only prominent feature map pixels, aggregate
multi-scale features.
A Survey on Visual Transformer (Han et al. Arxiv 2021)

Task-speciﬁc: Object Detection
● UP-DETR: Unsupervised Pre-training for Object Detection with
Transformers (Dai et al. Arxiv 2020)
○ Propose a pretext task named random query patch detection to unsupervisedly
pretrain DETR (UP-DETR) for object detection.
● Rethinking Transformer-based Set Prediction for Object Detection
(Sun et al. Arxiv 2020)
○ Encoder-only DETR signiﬁcantly accelerate the training of small object detection, as
it removes cross-attention.
○ Feature generation for transformer encoders with FCOS (Fully Convolutional
One-Stage object detector) or RCNN

Task-speciﬁc: Segmentation
● MaX-DeepLab: End-to-End Panoptic Segmentation with Mask
Transformers (Wang et al. Arxiv 2020)
○ Infers masks and classes directly without hand-coded priors like object boxes.
○ Dual-path transformer enables CNNs to read and write a global memory at any layer.
● End-to-End Video Instance Segmentation with Transformers (Wang
et al. Arxiv 2020)
○ Three dimensional (temporal, horizontal and vertical) positional encoding
○ Instance sequence matching strategy - applying loss across different time
signatures

Additional Tasks
● Learning Joint Spatial-Temporal Transformations for Video
Inpainting (Zeng et al. ECCV 2020)
● End-to-End Dense Video Captioning with Masked Transformer (Zhou
et al. CVPR 2018)
● Hand-Transformer: Non-Autoregressive Structured Modeling for 3D
Hand Pose Estimation (Huang et al. ECCV 2020)
● Taming Transformers for High-Resolution Image Synthesis (Esser et
al. Arxiv 2020)
● Pre-Trained Image Processing Transformer (Chen et al. Arxiv 2020)
○ ImageNet pre-training for image denoising/superresolution

Visual Transformers

Recommended

Recommended

More Related Content

Similar to Visual Transformers

Similar to Visual Transformers (20)

More from Kwanghee Choi

More from Kwanghee Choi (19)

Recently uploaded

Recently uploaded (20)

Visual Transformers