Transformer in Computer Vision
Dongmin Choi
Deepnoid

Yonsei University Translational Artificial Intelligence Lab
Contents
1. Transformer
2. DETR[1]
3. ViT[2]
[1] Carion et al. End-to-End Object Detection with Transformers. ECCV 2020

[2] Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929
Transformer
[1] Vaswani et al. Attention is All You Need. NIPS 2017

[2] Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019

[3] Radford et al. Improving Language Understanding by Generative Pre-Training. Technical Report 2019
- Transformer[1]
- Standard architecture for NLP tasks
- Only consist of attention modules not
using RNN
- Encoder-decoder
- Requires large scale dataset and high
computational cost
- Pre-training and fine-tuning approach is
dominant
: BERT[2] and GPT[3]
Transformer
https://www.youtube.com/watch?v=mxGCEWOxfe8
RNN based Encoder-Decoder
RNN based Encoder-Decoder with attention
Transformer
https://www.youtube.com/watch?v=mxGCEWOxfe8
Still slow because of RNN
And performance still not perfect
Can we remove RNN?
RNN based Encoder-Decoder with attention
Transformer
https://www.youtube.com/watch?v=mxGCEWOxfe8
Yes, Attention is All We Need
Transformer
https://www.youtube.com/watch?v=mxGCEWOxfe8
Positional Encoding
Word order is important
Transformer
Vaswani et al. Attention is All You Need. NIPS 2017
- Encoder : 6 layers w/ Self-attention and FFN
- Decoder : 6 layers w/ Self-attention,
Encoder-Decoder attention and FFN
Attention softmax
(𝑄,  𝐾,  𝑉 ) =
(
𝑄𝐾𝑇
√𝑑𝑘 )
𝑉
DETR
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
DETR (DEtection TRansformer)
- Eliminate the need for many hand-crafted components
(e.g., anchor generation, rule-based training target assignment, NMS)
- The first fully end-to-end object detector
- A simple architecture (CNN + Transformer encoder-decoder)
DETR
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
Comparison with Faster R-CNN on COCO val
* DC = Dilated Convolution
* R101 = ResNet-101 backbone
DETR
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
DETR for panoptic segmentation
DETR
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
Visualizing Encoder Attention
The encoder is able to separate individual instances
DETR
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
Visualizing Decoder Attention
Decoder typically attends to object extremities, such as legs and heads
DETR
[1] Carion et al. End-to-End Object Detection with Transformers. ECCV 2020

[2] Zhu et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv:2010.04159
DETR (DEtection TRansformer)[1]
Pros
- Eliminate the need for many hand-crafted components
(e.g., anchor generation, rule-based training target assignment, NMS)
- The first fully end-to-end object detector
- A simple architecture (CNN + Transformer encoder-decoder)
Cons[2]
- Requires long training epochs
- Limited feature spatial resolution
: high resolution leads to unacceptable complexities
Deformable DETR[1]
[1] Zhu et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv:2010.04159

[2] Dai et al. Deformable Convolutional Networks. ICCV 2017
- Apply the concept of deformable convolution[2] to mitigate the problems of DETR
- Deformable attention module, which is an efficient attention mechanism in
processing feature maps
- Requires only 50 epochs for training (1/10 of original DETR)
- Achieved S.O.T.A using two-stage variant as R-CNN models
Deformable DETR[1]
[1] Lin et al. Feature Pyramid Networks for Object Detection. CVPR 2017

[2] Dai et al. Deformable Convolutional Networks. ICCV 2017
- Most of modern object detectors follow the shape of FPN[1]
- It was impossible to apply such concept to DETR because
of computational complexity
- Deformable convolution[2] is a powerful and efficient
mechanism to attend to sparse spatial locations
- By apply the idea of deformable convolution, Deformable DETR can benefit from
multi-scale feature maps with relatively small model complexity
Deformable DETR[1]
Comparison with Faster R-CNN and DETR on COCO val
Deformable DETR[1]
Comparison with S.O.T.A on COCO test-dev
ViT
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2020.11929
ViT (Vision Transformer)
- Interpret an image as sequence of patches
- Standard Transformer encoder
- Exceeds S.O.T.A being cheap to pre-train
ViT
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2020.11929
Comparison with S.O.T.A
- ViT pre-trained on the JFT300M dataset match or outperforms ResNet-
based baselines
- Substantially less computational resources to pre-train
ResNet-based
ViT
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2020.11929
Transfer to ImageNet
- When pre-trained on small datasets, ViT models perform worse
- However, when pre-trained on large datasets, ViT models shine and even
outperform
Larger dataset
ResNet-based
Conclusion
- Transformer is being actively applied to computer vision
- DETR[1] and ViT[2] showed promising results
- Many challenges still remain
1) a lot of parameters for attention (too heavy…)
2) other computer vision tasks (e.g., segmentation, localization, depth estimation,
image generation, video)
[1] Carion et al. End-to-End Object Detection with Transformers. ECCV 2020

[2] Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929
Thank you

Transformer in Computer Vision

  • 1.
    Transformer in ComputerVision Dongmin Choi Deepnoid Yonsei University Translational Artificial Intelligence Lab
  • 3.
    Contents 1. Transformer 2. DETR[1] 3.ViT[2] [1] Carion et al. End-to-End Object Detection with Transformers. ECCV 2020 [2] Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929
  • 4.
    Transformer [1] Vaswani etal. Attention is All You Need. NIPS 2017 [2] Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019 [3] Radford et al. Improving Language Understanding by Generative Pre-Training. Technical Report 2019 - Transformer[1] - Standard architecture for NLP tasks - Only consist of attention modules not using RNN - Encoder-decoder - Requires large scale dataset and high computational cost - Pre-training and fine-tuning approach is dominant : BERT[2] and GPT[3]
  • 5.
  • 6.
    Transformer https://www.youtube.com/watch?v=mxGCEWOxfe8 Still slow becauseof RNN And performance still not perfect Can we remove RNN? RNN based Encoder-Decoder with attention
  • 7.
  • 8.
  • 9.
    Transformer Vaswani et al.Attention is All You Need. NIPS 2017 - Encoder : 6 layers w/ Self-attention and FFN - Decoder : 6 layers w/ Self-attention, Encoder-Decoder attention and FFN Attention softmax (𝑄,  𝐾,  𝑉 ) = ( 𝑄𝐾𝑇 √𝑑𝑘 ) 𝑉
  • 10.
    DETR Carion et al.End-to-End Object Detection with Transformers. ECCV 2020 DETR (DEtection TRansformer) - Eliminate the need for many hand-crafted components (e.g., anchor generation, rule-based training target assignment, NMS) - The first fully end-to-end object detector - A simple architecture (CNN + Transformer encoder-decoder)
  • 11.
    DETR Carion et al.End-to-End Object Detection with Transformers. ECCV 2020 Comparison with Faster R-CNN on COCO val * DC = Dilated Convolution * R101 = ResNet-101 backbone
  • 12.
    DETR Carion et al.End-to-End Object Detection with Transformers. ECCV 2020 DETR for panoptic segmentation
  • 13.
    DETR Carion et al.End-to-End Object Detection with Transformers. ECCV 2020 Visualizing Encoder Attention The encoder is able to separate individual instances
  • 14.
    DETR Carion et al.End-to-End Object Detection with Transformers. ECCV 2020 Visualizing Decoder Attention Decoder typically attends to object extremities, such as legs and heads
  • 15.
    DETR [1] Carion etal. End-to-End Object Detection with Transformers. ECCV 2020 [2] Zhu et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv:2010.04159 DETR (DEtection TRansformer)[1] Pros - Eliminate the need for many hand-crafted components (e.g., anchor generation, rule-based training target assignment, NMS) - The first fully end-to-end object detector - A simple architecture (CNN + Transformer encoder-decoder) Cons[2] - Requires long training epochs - Limited feature spatial resolution : high resolution leads to unacceptable complexities
  • 16.
    Deformable DETR[1] [1] Zhuet al. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv:2010.04159 [2] Dai et al. Deformable Convolutional Networks. ICCV 2017 - Apply the concept of deformable convolution[2] to mitigate the problems of DETR - Deformable attention module, which is an efficient attention mechanism in processing feature maps - Requires only 50 epochs for training (1/10 of original DETR) - Achieved S.O.T.A using two-stage variant as R-CNN models
  • 17.
    Deformable DETR[1] [1] Linet al. Feature Pyramid Networks for Object Detection. CVPR 2017 [2] Dai et al. Deformable Convolutional Networks. ICCV 2017 - Most of modern object detectors follow the shape of FPN[1] - It was impossible to apply such concept to DETR because of computational complexity - Deformable convolution[2] is a powerful and efficient mechanism to attend to sparse spatial locations - By apply the idea of deformable convolution, Deformable DETR can benefit from multi-scale feature maps with relatively small model complexity
  • 18.
    Deformable DETR[1] Comparison withFaster R-CNN and DETR on COCO val
  • 19.
    Deformable DETR[1] Comparison withS.O.T.A on COCO test-dev
  • 20.
    ViT Dosovitskiy et al.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2020.11929 ViT (Vision Transformer) - Interpret an image as sequence of patches - Standard Transformer encoder - Exceeds S.O.T.A being cheap to pre-train
  • 21.
    ViT Dosovitskiy et al.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2020.11929 Comparison with S.O.T.A - ViT pre-trained on the JFT300M dataset match or outperforms ResNet- based baselines - Substantially less computational resources to pre-train ResNet-based
  • 22.
    ViT Dosovitskiy et al.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2020.11929 Transfer to ImageNet - When pre-trained on small datasets, ViT models perform worse - However, when pre-trained on large datasets, ViT models shine and even outperform Larger dataset ResNet-based
  • 23.
    Conclusion - Transformer isbeing actively applied to computer vision - DETR[1] and ViT[2] showed promising results - Many challenges still remain 1) a lot of parameters for attention (too heavy…) 2) other computer vision tasks (e.g., segmentation, localization, depth estimation, image generation, video) [1] Carion et al. End-to-End Object Detection with Transformers. ECCV 2020 [2] Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929
  • 24.