Transformer in Computer Vision

Transformer in Computer Vision
Dongmin Choi
Deepnoid

Yonsei University Translational Artificial Intelligence Lab

Contents
1. Transformer
2. DETR[1]
3. ViT[2]
[1] Carion et al. End-to-End Object Detection with Transformers. ECCV 2020

[2] Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929

Transformer
[1] Vaswani et al. Attention is All You Need. NIPS 2017

[2] Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019

[3] Radford et al. Improving Language Understanding by Generative Pre-Training. Technical Report 2019
- Transformer[1]
- Standard architecture for NLP tasks
- Only consist of attention modules not
using RNN
- Encoder-decoder
- Requires large scale dataset and high
computational cost
- Pre-training and fine-tuning approach is
dominant
: BERT[2] and GPT[3]

Transformer
https://www.youtube.com/watch?v=mxGCEWOxfe8
RNN based Encoder-Decoder
RNN based Encoder-Decoder with attention

Transformer
Still slow because of RNN
And performance still not perfect
Can we remove RNN?
RNN based Encoder-Decoder with attention

Transformer
Yes, Attention is All We Need

Transformer
Positional Encoding
Word order is important

Transformer
Vaswani et al. Attention is All You Need. NIPS 2017
- Encoder : 6 layers w/ Self-attention and FFN
- Decoder : 6 layers w/ Self-attention,
Encoder-Decoder attention and FFN
Attention softmax
(𝑄, 𝐾, 𝑉 ) =
(
𝑄𝐾𝑇
√𝑑𝑘 )
𝑉

DETR
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
DETR (DEtection TRansformer)
- Eliminate the need for many hand-crafted components
(e.g., anchor generation, rule-based training target assignment, NMS)
- The first fully end-to-end object detector
- A simple architecture (CNN + Transformer encoder-decoder)

DETR
Comparison with Faster R-CNN on COCO val
* DC = Dilated Convolution
* R101 = ResNet-101 backbone

DETR
DETR for panoptic segmentation

DETR
Visualizing Encoder Attention
The encoder is able to separate individual instances

DETR
Visualizing Decoder Attention
Decoder typically attends to object extremities, such as legs and heads

DETR

[2] Zhu et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv:2010.04159
DETR (DEtection TRansformer)[1]
Pros
- Eliminate the need for many hand-crafted components
(e.g., anchor generation, rule-based training target assignment, NMS)
- The first fully end-to-end object detector
- A simple architecture (CNN + Transformer encoder-decoder)
Cons[2]
- Requires long training epochs
- Limited feature spatial resolution
: high resolution leads to unacceptable complexities

Deformable DETR[1]
[1] Zhu et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv:2010.04159

[2] Dai et al. Deformable Convolutional Networks. ICCV 2017
- Apply the concept of deformable convolution[2] to mitigate the problems of DETR
- Deformable attention module, which is an efficient attention mechanism in
processing feature maps
- Requires only 50 epochs for training (1/10 of original DETR)
- Achieved S.O.T.A using two-stage variant as R-CNN models

Deformable DETR[1]
[1] Lin et al. Feature Pyramid Networks for Object Detection. CVPR 2017

[2] Dai et al. Deformable Convolutional Networks. ICCV 2017
- Most of modern object detectors follow the shape of FPN[1]
- It was impossible to apply such concept to DETR because
of computational complexity
- Deformable convolution[2] is a powerful and efficient
mechanism to attend to sparse spatial locations
- By apply the idea of deformable convolution, Deformable DETR can benefit from
multi-scale feature maps with relatively small model complexity

Deformable DETR[1]
Comparison with Faster R-CNN and DETR on COCO val

Deformable DETR[1]
Comparison with S.O.T.A on COCO test-dev

ViT
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2020.11929
ViT (Vision Transformer)
- Interpret an image as sequence of patches
- Standard Transformer encoder
- Exceeds S.O.T.A being cheap to pre-train

ViT
Comparison with S.O.T.A
- ViT pre-trained on the JFT300M dataset match or outperforms ResNet-
based baselines
- Substantially less computational resources to pre-train
ResNet-based

ViT
Transfer to ImageNet
- When pre-trained on small datasets, ViT models perform worse
- However, when pre-trained on large datasets, ViT models shine and even
outperform
Larger dataset
ResNet-based

Conclusion
- Transformer is being actively applied to computer vision
- DETR[1] and ViT[2] showed promising results
- Many challenges still remain
1) a lot of parameters for attention (too heavy…)
2) other computer vision tasks (e.g., segmentation, localization, depth estimation,
image generation, video)

[2] Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929

Transformer in Computer Vision

More Related Content

What's hot

Similar to Transformer in Computer Vision

More from Dongmin Choi

Recently uploaded

Transformer in Computer Vision