SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS

236781
Lecture 9
Deep learning
Transformers for Vision Tasks
Dr. Chaim Baskin

Transformers for image modeling

Transformers for image modeling: Outline
• Architectures
• Self-supervised learning

Previous approaches
1. On pixels, but locally or
factorized
Usually replaces 3x3 conv in ResNet:
Image credit: Stand-Alone Self-Attention in Vision Models by Ramachandran et.al.
Image credit: Local Relation Networks for Image Recognition by Hu et.al.
Examples:
Non-local NN (Wang et.al. 2017)
SASANet (Stand-Alone Self-Attention in Vision Models)
HaloNet (Scaling Local Self-Attn for Parameter Efficient...)
LR-Net (Local Relation Networks for Image Recognition)
SANet (Exploring Self-attention for Image Recognition)
...
Results:
Are usually "meh", nothing to call home about
Do not justify increased complexity
Do not justify slowdown over convolutions
Many prior works attempted to introduce self-
attention at the pixel level.
For 224px², that's 50k sequence length, too much!

Previous approaches
2. Globally, after/inside a full-blown CNN, or even detector/segmenter!
Cons:
result is highly complex, often multi-stage trained architecture.
not from pixels, i.e. transformer can't "learn to fix" the (often frozen!) CNN's mistakes.
Examples:
DETR (Carion, Massa et.al. 2020) Visual Transformers (Wu et.al. 2020)
UNITER (Chen, Li, Yu et.al. 2019) ViLBERT (Lu et.al. 2019)
etc...
VisualBERT (Li et.al. 20190)
Image credit: UNITER: UNiversal Image-TExt Representation Learning by Chen et.al.
Image credit: Visual Transformers: Token-based Image Representation and Processing for Computer Vision by Wu et.al.

• Split an image into patches, feed linearly projected patches into
standard transformer encoder
• With patches of 14x14 pixels, you need 16x16=256 patches to represent 224x224 images
Vision transformer (ViT) – Google
A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021

Vision transformer (ViT)
BiT: Big Transfer (ResNet)
ViT: Vision Transformer (Base/Large/Huge,
patch size of 14x14, 16x16, or 32x32)
Internal Google dataset (not public)
• Trained in a supervised fashion, fine-tuned on ImageNet

Vision transformer (ViT)
• Transformers are claimed to be more computationally
efficient than CNNs or hybrid architectures

Hierarchical transformer: Swin
Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021

Swin results
COCO detection and segmentation

Beyond transformers?
I. Tolstikhin et al. MLP-Mixer: An all-MLP Architecture for Vision. NeurIPS 2021

Hybrid of CNNs and transformers?
T. Xiao et al. Early convolutions help transformers see better. NeurIPS 2021

Or completely back to CNNs?
Z. Liu et al. A ConvNet for the 2020s. CVPR 2022

Back to CNNs?
Z. Liu et al. A ConvNet for the 2020s. CVPR 2022

Outline
• Architectures
• Self-supervised learning

DINO: Self-distillation with no labels
M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021

DINO
M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021

Masked autoencoders
K. He et al. Masked autoencoders are scalable vision learners. CVPR 2022

Masked autoencoders: Results
K. He et al. Masked autoencoders are scalable vision learners. CVPR 2022

Application: Visual prompting
A. Bar et al. Visual prompting via image inpainting. NeurIPS 2022

SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS

More Related Content

Similar to SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS

Recently uploaded

SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS

Editor's Notes