Visual Transformers
Kwanghee Choi (Jonas)
Table of Contents
● Preliminary
○ Key, Value, Query, Attention
○ Pooling
○ Multi-head Attention
○ Unsupervised Representation Learning
○ Syntactic Knowledge
● State-of-the-art Papers
○ Generative Pretraining from Pixels (ICML 2020)
○ An Image is Worth 16x16 Words (ICLR 2021)
○ End-to-End Object Detection with Transformers (ECCV 2020)
○ Additional Works
Key, Value, Query, Attention
● Problem: Given a set of data points (xi
, yi
), find unknown y for x.
● Simplest approach:
● A bit more complicated approach: Watson-Nadaraya Estimator (1964)
● Key, value pairs (xi
, yi
)
● Query x
● Attention ⍺
Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
Pooling
● Nonlinearity ⍴, ɸ, learnable weight w
● Deep sets (Zaheer et al. 2017)
○ Permutation Invariant
● Word2Vec (Mikolov et al. 2013)
○ Embed each word in a sentence
● Attention Weighting (Wang et al. 2016)
○ Query x depends on the context ⍺
● Iterative Attention Pooling (Yang et al. 2016)
○ Repeatedly update internal state qt
Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
Multi-head Attention
● Attention module
○ Softmax acts as an attention function.
○ Dot product of Q and K acts as a similarity.
○ sqrt(dk
): Standard deviation of the dot product when Q, V ~ N(0, 1)
● Multi-head Attention
○ Single-head limits the ability of focusing on a specific position.
○ Multi-head gives attention layers different representation subspace.
Attention Is All You Need (Vaswani et al. NeurIPS 2017)
Unsupervised Representation Learning
● Input sequence x=(x1
, x2
, … )
● Autoregressive (AR)
○ ex) ELMo, GPT
○ No bidirectional context.
○ ELMO: Need to separately train forward/backward context.
● Auto Encoding (AE)
○ Corrupted input x’=(x1
, x2
, …, [MASK], … )
○ ex) BERT
○ Bi-directional self-attention
○ Different input distribution due to corruption
Understanding XLNet https://www.borealisai.com/en/blog/understanding-xlnet
Syntactic Knowledge
● BERT representations are hierarchical rather
than linear.
○ Open Sesame: Getting Inside BERT’s Linguistic Knowledge
(Lin et al. ACLW 2019)
● BERT “naturally” learns some syntactic
information, although it is not very similar to
linguistic annotated resources.
○ Perturbed Masking: Parameter-free Probing for Analyzing
and Interpreting BERT (Wu et al. ACL 2020)
A Primer in BERTology: What we know about how BERT works (Rogers et al. TACL 2020)
Generative Pretraining from Pixels
ICML 2020, OpenAI
Towards a general “image” model
● Just as a general LM can generate coherent text, Image GPT can
generate coherent images.
● “Analysis by Synthesis” suggests that model will also know about
object categories after it learns to do so.
● Generative sequence modeling is a universal unsupervised algorithm.
Image GPT (https://openai.com/blog/image-gpt/)
Approach
Generative Pretraining from Pixels (Chen et al. ICML 2020)
What representation works best?
● In supervised pre-training, representation quality tends to increase
monotonically with depth, but with generative pre-training, it is not
obvious whether a task like pixel prediction is relevant to image
classification.
● Representations first improve as a function of depth, and then,
starting around the middle layer, begin to deteriorate.
○ In the first phase, each position gathers information from its surrounding context in
order to build a more global image representation.
○ In the second phase, this contextualized input is used to solve the conditional next
pixel prediction task.
○ This could resemble the behavior of encoder-decoder architectures, but learned
within a monolithic architecture via a pre-training objective.
Generative Pretraining from Pixels (Chen et al. ICML 2020)
Performance on CIFAR dataset
● We find that both increasing the
scale of our models and training for
more iterations result in better
generative performance, which
directly translates into better
feature quality.
● Generative models produce much
better features than BERT models
after pre-training, but BERT
models catch up after fine-tuning.
Generative Pretraining from Pixels (Chen et al. ICML 2020)
An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale
ICLR 2021, Google
When does Transformers work?
● When trained on mid-sized datasets (i.e. ImageNet), Transformers
yield modest accuracies, few % below ResNets of comparable size.
● However, large scale training (14M-300M images) trumps inductive
bias of CNNs such as translation invariance & locality.
● Naive application of self-attention to images would require that each
pixel attends to every other pixel. With quadratic cost in the number
of pixels, this does not scale to realistic input sizes.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
Model overview
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
Performance
With self-supervised pre-training (masked patch prediction), our smaller ViT-B/16 model achieves 79.9% accuracy
on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
Interpreting the Results
● Positional embeddings
○ We speculate that learning to represent the spatial relations in
this resolution (14 x 14) is equally easy for different strategies.
○ Closer patches tend to have more similar position embeddings.
○ Row-column structure & sinusoidal structure appears.
● Self-attention
○ “Attention distance” analogous to “receptive field size”.
○ Highly localized attention may serve a similar function as early
convolutional layers in CNNs.
○ Model attends to image regions that are semantically relevant
for classification.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
End-to-End Object Detection
with Transformers
ECCV 2020, Facebook
End-to-end object detection
Object detection as a direct set prediction problem.
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Removing NMS
● Conventional CNN to learn a 2D representation + Positional encoding
● 100 learned positional embeddings as object queries
● Global reasoning using pairwise relations
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Encoder’s attention mechanism in action
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Decoder’s attention mechanism in action
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Performance in Object Detection
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Panoptic Segmentation
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Performance in Panoptic Segmentation
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Additional Works
Notable Extensions
● Training data-efficient image transformers & distillation through
attention (Touvron et al. Arxiv 2021)
○ Add another token: distillation token to ViT. Using only the classification token
doesn’t help much.
○ Soft distillation (teacher model’s softmax output) and hard-distillation (teacher
model’s argmax with label smoothing).
○ Surpasses SOTA yet again.
● DALL·E: Creating Images from Text (Ramesh et al. 2021)
○ Decoder-only transformer that receives both the text and the image as a single
stream of tokens (Text: 256, Image: 1024) and models all of them autoregressively.
○ Creates images from text captions for a wide range of concepts expressible in natural
language.
Task-specific: Object Detection
● End-to-End Object Detection with Adaptive Clustering Transformer
(Zheng et al. Arxiv 2020)
○ ACT cluster the query features adaptively using Locality Sensitive Hashing (LSH) and
approximate the query-key interaction using the prototype-key interaction.
○ ACT can replace the original self-attention module in DETR without degrading the
performance of pre-trained DETR model.
● Deformable DETR: Deformable Transformers for End-to-End Object
Detection (Zhu et al. ICLR 2021)
○ Deformable DETR can achieve better performance than DETR (especially on small
objects) with 10× less training epochs.
○ Deformable attention module: Choose only prominent feature map pixels, aggregate
multi-scale features.
A Survey on Visual Transformer (Han et al. Arxiv 2021)
Task-specific: Object Detection
● UP-DETR: Unsupervised Pre-training for Object Detection with
Transformers (Dai et al. Arxiv 2020)
○ Propose a pretext task named random query patch detection to unsupervisedly
pretrain DETR (UP-DETR) for object detection.
● Rethinking Transformer-based Set Prediction for Object Detection
(Sun et al. Arxiv 2020)
○ Encoder-only DETR significantly accelerate the training of small object detection, as
it removes cross-attention.
○ Feature generation for transformer encoders with FCOS (Fully Convolutional
One-Stage object detector) or RCNN
A Survey on Visual Transformer (Han et al. Arxiv 2021)
Task-specific: Segmentation
● MaX-DeepLab: End-to-End Panoptic Segmentation with Mask
Transformers (Wang et al. Arxiv 2020)
○ Infers masks and classes directly without hand-coded priors like object boxes.
○ Dual-path transformer enables CNNs to read and write a global memory at any layer.
● End-to-End Video Instance Segmentation with Transformers (Wang
et al. Arxiv 2020)
○ Three dimensional (temporal, horizontal and vertical) positional encoding
○ Instance sequence matching strategy - applying loss across different time
signatures
A Survey on Visual Transformer (Han et al. Arxiv 2021)
Additional Tasks
● Learning Joint Spatial-Temporal Transformations for Video
Inpainting (Zeng et al. ECCV 2020)
● End-to-End Dense Video Captioning with Masked Transformer (Zhou
et al. CVPR 2018)
● Hand-Transformer: Non-Autoregressive Structured Modeling for 3D
Hand Pose Estimation (Huang et al. ECCV 2020)
● Taming Transformers for High-Resolution Image Synthesis (Esser et
al. Arxiv 2020)
● Pre-Trained Image Processing Transformer (Chen et al. Arxiv 2020)
○ ImageNet pre-training for image denoising/superresolution
A Survey on Visual Transformer (Han et al. Arxiv 2021)

Visual Transformers

  • 1.
  • 2.
    Table of Contents ●Preliminary ○ Key, Value, Query, Attention ○ Pooling ○ Multi-head Attention ○ Unsupervised Representation Learning ○ Syntactic Knowledge ● State-of-the-art Papers ○ Generative Pretraining from Pixels (ICML 2020) ○ An Image is Worth 16x16 Words (ICLR 2021) ○ End-to-End Object Detection with Transformers (ECCV 2020) ○ Additional Works
  • 3.
    Key, Value, Query,Attention ● Problem: Given a set of data points (xi , yi ), find unknown y for x. ● Simplest approach: ● A bit more complicated approach: Watson-Nadaraya Estimator (1964) ● Key, value pairs (xi , yi ) ● Query x ● Attention ⍺ Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
  • 4.
    Pooling ● Nonlinearity ⍴,ɸ, learnable weight w ● Deep sets (Zaheer et al. 2017) ○ Permutation Invariant ● Word2Vec (Mikolov et al. 2013) ○ Embed each word in a sentence ● Attention Weighting (Wang et al. 2016) ○ Query x depends on the context ⍺ ● Iterative Attention Pooling (Yang et al. 2016) ○ Repeatedly update internal state qt Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
  • 5.
    Multi-head Attention ● Attentionmodule ○ Softmax acts as an attention function. ○ Dot product of Q and K acts as a similarity. ○ sqrt(dk ): Standard deviation of the dot product when Q, V ~ N(0, 1) ● Multi-head Attention ○ Single-head limits the ability of focusing on a specific position. ○ Multi-head gives attention layers different representation subspace. Attention Is All You Need (Vaswani et al. NeurIPS 2017)
  • 6.
    Unsupervised Representation Learning ●Input sequence x=(x1 , x2 , … ) ● Autoregressive (AR) ○ ex) ELMo, GPT ○ No bidirectional context. ○ ELMO: Need to separately train forward/backward context. ● Auto Encoding (AE) ○ Corrupted input x’=(x1 , x2 , …, [MASK], … ) ○ ex) BERT ○ Bi-directional self-attention ○ Different input distribution due to corruption Understanding XLNet https://www.borealisai.com/en/blog/understanding-xlnet
  • 7.
    Syntactic Knowledge ● BERTrepresentations are hierarchical rather than linear. ○ Open Sesame: Getting Inside BERT’s Linguistic Knowledge (Lin et al. ACLW 2019) ● BERT “naturally” learns some syntactic information, although it is not very similar to linguistic annotated resources. ○ Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT (Wu et al. ACL 2020) A Primer in BERTology: What we know about how BERT works (Rogers et al. TACL 2020)
  • 8.
    Generative Pretraining fromPixels ICML 2020, OpenAI
  • 9.
    Towards a general“image” model ● Just as a general LM can generate coherent text, Image GPT can generate coherent images. ● “Analysis by Synthesis” suggests that model will also know about object categories after it learns to do so. ● Generative sequence modeling is a universal unsupervised algorithm. Image GPT (https://openai.com/blog/image-gpt/)
  • 10.
    Approach Generative Pretraining fromPixels (Chen et al. ICML 2020)
  • 11.
    What representation worksbest? ● In supervised pre-training, representation quality tends to increase monotonically with depth, but with generative pre-training, it is not obvious whether a task like pixel prediction is relevant to image classification. ● Representations first improve as a function of depth, and then, starting around the middle layer, begin to deteriorate. ○ In the first phase, each position gathers information from its surrounding context in order to build a more global image representation. ○ In the second phase, this contextualized input is used to solve the conditional next pixel prediction task. ○ This could resemble the behavior of encoder-decoder architectures, but learned within a monolithic architecture via a pre-training objective. Generative Pretraining from Pixels (Chen et al. ICML 2020)
  • 12.
    Performance on CIFARdataset ● We find that both increasing the scale of our models and training for more iterations result in better generative performance, which directly translates into better feature quality. ● Generative models produce much better features than BERT models after pre-training, but BERT models catch up after fine-tuning. Generative Pretraining from Pixels (Chen et al. ICML 2020)
  • 13.
    An Image isWorth 16x16 Words: Transformers for Image Recognition at Scale ICLR 2021, Google
  • 14.
    When does Transformerswork? ● When trained on mid-sized datasets (i.e. ImageNet), Transformers yield modest accuracies, few % below ResNets of comparable size. ● However, large scale training (14M-300M images) trumps inductive bias of CNNs such as translation invariance & locality. ● Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 15.
    Model overview An Imageis Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 16.
    Performance With self-supervised pre-training(masked patch prediction), our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 17.
    Interpreting the Results ●Positional embeddings ○ We speculate that learning to represent the spatial relations in this resolution (14 x 14) is equally easy for different strategies. ○ Closer patches tend to have more similar position embeddings. ○ Row-column structure & sinusoidal structure appears. ● Self-attention ○ “Attention distance” analogous to “receptive field size”. ○ Highly localized attention may serve a similar function as early convolutional layers in CNNs. ○ Model attends to image regions that are semantically relevant for classification. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 18.
    End-to-End Object Detection withTransformers ECCV 2020, Facebook
  • 19.
    End-to-end object detection Objectdetection as a direct set prediction problem. End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 20.
    Removing NMS ● ConventionalCNN to learn a 2D representation + Positional encoding ● 100 learned positional embeddings as object queries ● Global reasoning using pairwise relations End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 21.
    Encoder’s attention mechanismin action End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 22.
    Decoder’s attention mechanismin action End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 23.
    Performance in ObjectDetection End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 24.
    Panoptic Segmentation End-to-End ObjectDetection with Transformers (Carion et al. ECCV 2020)
  • 25.
    Performance in PanopticSegmentation End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 26.
  • 27.
    Notable Extensions ● Trainingdata-efficient image transformers & distillation through attention (Touvron et al. Arxiv 2021) ○ Add another token: distillation token to ViT. Using only the classification token doesn’t help much. ○ Soft distillation (teacher model’s softmax output) and hard-distillation (teacher model’s argmax with label smoothing). ○ Surpasses SOTA yet again. ● DALL·E: Creating Images from Text (Ramesh et al. 2021) ○ Decoder-only transformer that receives both the text and the image as a single stream of tokens (Text: 256, Image: 1024) and models all of them autoregressively. ○ Creates images from text captions for a wide range of concepts expressible in natural language.
  • 28.
    Task-specific: Object Detection ●End-to-End Object Detection with Adaptive Clustering Transformer (Zheng et al. Arxiv 2020) ○ ACT cluster the query features adaptively using Locality Sensitive Hashing (LSH) and approximate the query-key interaction using the prototype-key interaction. ○ ACT can replace the original self-attention module in DETR without degrading the performance of pre-trained DETR model. ● Deformable DETR: Deformable Transformers for End-to-End Object Detection (Zhu et al. ICLR 2021) ○ Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. ○ Deformable attention module: Choose only prominent feature map pixels, aggregate multi-scale features. A Survey on Visual Transformer (Han et al. Arxiv 2021)
  • 29.
    Task-specific: Object Detection ●UP-DETR: Unsupervised Pre-training for Object Detection with Transformers (Dai et al. Arxiv 2020) ○ Propose a pretext task named random query patch detection to unsupervisedly pretrain DETR (UP-DETR) for object detection. ● Rethinking Transformer-based Set Prediction for Object Detection (Sun et al. Arxiv 2020) ○ Encoder-only DETR significantly accelerate the training of small object detection, as it removes cross-attention. ○ Feature generation for transformer encoders with FCOS (Fully Convolutional One-Stage object detector) or RCNN A Survey on Visual Transformer (Han et al. Arxiv 2021)
  • 30.
    Task-specific: Segmentation ● MaX-DeepLab:End-to-End Panoptic Segmentation with Mask Transformers (Wang et al. Arxiv 2020) ○ Infers masks and classes directly without hand-coded priors like object boxes. ○ Dual-path transformer enables CNNs to read and write a global memory at any layer. ● End-to-End Video Instance Segmentation with Transformers (Wang et al. Arxiv 2020) ○ Three dimensional (temporal, horizontal and vertical) positional encoding ○ Instance sequence matching strategy - applying loss across different time signatures A Survey on Visual Transformer (Han et al. Arxiv 2021)
  • 31.
    Additional Tasks ● LearningJoint Spatial-Temporal Transformations for Video Inpainting (Zeng et al. ECCV 2020) ● End-to-End Dense Video Captioning with Masked Transformer (Zhou et al. CVPR 2018) ● Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation (Huang et al. ECCV 2020) ● Taming Transformers for High-Resolution Image Synthesis (Esser et al. Arxiv 2020) ● Pre-Trained Image Processing Transformer (Chen et al. Arxiv 2020) ○ ImageNet pre-training for image denoising/superresolution A Survey on Visual Transformer (Han et al. Arxiv 2021)