Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ViT (Vision Transformer) Review [CDM]

Review : An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Paper Link : https://openreview.net/forum?id=YicbFdNTTy

  • Be the first to comment

ViT (Vision Transformer) Review [CDM]

  1. 1. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Anonymous (ICLR 2021 under review) Yonsei University Severance Hospital CCIDS Choi Dongmin
  2. 2. Abstract • Transformer
 - standard architecture for NLP • Convolutional Networks
 - attention is applied keeping their overall structure • Transformer in Computer Vision
 - a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches
 - achieved S.O.T.A with small computational costs when pre-trained on large dataset
  3. 3. Introduction Vaswani et al. Attention Is All You Need. NIPS 2017 Transformer BERT Self-attention
 based architecture The dominant approach : pre-training on a large text corpus
 and then fine-tuning on a smaller task-specific dataset
  4. 4. Introduction Carion et al. End-to-End Object Detection with Transformers. ECCV 2020 Wang et al. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. ECCV 2020 Self-Attention in CV inspired by NLP DETR Axial-DeepLab However, classic ResNet-like architectures are still S.O.T.A
  5. 5. • Applying a Transformer Directly to Images
 - with the fewest possible modifications
 - provide the sequence of linear embeddings of the patches as an input
 - image patches = tokens (words) in NLP • Small Scale Training
 - achieved accuracies below ResNets of comparable size
 - Transformers lack some inductive biased inherent to CNNs
 (such as translation equivariance and locality) • Large Scale Training
 - trumps (surpass) inductive bias
 - excellent results when pre-trained at sufficient scale and transferred Introduction
  6. 6. Related Works Transformer Vaswani et al. Attention Is All You Need. NIPS 2017 Devlin et al. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL 2019 Radford et al. Improving language under- standing with unsupervised learning. Technical Report 2018 - Standard model in NLP tasks - Only consists of attention modules
 not using RNN - Encoder-decoder - Requires large scale dataset and
 high computational cost - Pre-training and fine-tuning approaches : BERT & GPT
  7. 7. Method
  8. 8. Method Image → A sequence of flattened 2D patchesx ∈ RH×W×C xp ∈ RN×(P2 ·C) Trainable linear projection maps
 →xp ∈ RN×(P2 ·C) xpE ∈ RN×D Learnable Position Embedding
 Epos ∈ R(N+1)×D * Because Transformer uses constant
 widths, model dimension , through all of its layersD * to retain positional information z0 L https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py#L99-L111
  9. 9. Method https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
  10. 10. Method https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
  11. 11. Method https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py : input sequencez ∈ RN×D Attention weight : similarity btwAij qi , kj
  12. 12. Method https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
  13. 13. Method https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
  14. 14. Method Hybrid Architecture Carion et al. End-to-End Object Detection with Transformers. ECCV 2020 Flattened intermediate feature
 maps of a ResNet
 as the input sequence like DETR
  15. 15. Method Fine-tuning and Higher Resolution Carion et al. End-to-End Object Detection with Transformers. ECCV 2020 Remove the pre-trained prediction head and attach a zero-initialized
 feedforward layer ( =the number of downstream classes)D × K K
  16. 16. Experiments • Datasets
 < Pre-training >
 - ILSVRC-2012 ImageNet dataset : 1k classes / 1.3M images
 - ImageNet-21k : 21k classes / 14M images
 - JFT : 18k classes / 303M images
 < Downstream (Fine-tuning) >
 - ImageNet, ImageNet ReaL, CIFAR-10/100, Oxford-IIIT Pets, Oxford Flowers-102, VTAB • Model Variants ex : ViT-L/16 = “Large” variants, with 16 X 16 input patch size
  17. 17. Experiments • Training & Fine-tuning
 < Pre-training>
 - Adam with 
 - Batch size 4,096
 - Weight decay 0.1 (high weight decay is useful for transfer models)
 - Linear learning rate warmup and decay
 
 < Fine-tuning >
 - SGD with momentum, batch size 512 • Metrics
 - Few-shot (for fast on-the-fly evaluation)
 - Fine-tuning accuracy β1 = 0.9, β2 = 0.999
  18. 18. Experiments • Comparison to State of the Art Kolesnikov et al. Big Transfer (BiT): General Visual Representation Learning. ECCV 2020 Xie et al. Self-training with noisy student improves imagenet classification. CVPR 2020 * BiT-L : Big Transfer, which performs supervised transfer learning with large ResNets * Noisy Student : a large EfficientNet trained using semi-supervised learning
  19. 19. Experiments • Comparison to State of the Art
  20. 20. Experiments • Pre-training Data Requirements Larger Dataset Larger Dataset
  21. 21. Experiments • Scaling Study
  22. 22. Experiments • Inspecting Vision Transformer The components resemble plausible basis functions
 for a low-dimensional representation of the fine structure within each patch analogous to receptive field size in CNNs
  23. 23. Conclusion • Application of Transformers to Image Recognition
 - no image-specific inductive biases in the architecture
 - interpret an image as sequence of patches and process it by a standard Transformer encoder
 - simple, yet scalable, strategy works
 - matches or exceeds the S.O.T.A being cheap to pre-train • Many Challenges Remain
 - other computer vision tasks, such as detection and segmentation
 - further scaling ViT
  24. 24. Q&A • ViT for Segmentation • Fine-tuning on Grayscale Dataset
  25. 25. Thank you

    Be the first to comment

  • ssuser16965f

    Oct. 13, 2020
  • c5101010

    Oct. 26, 2020
  • AseiShimokura

    Feb. 5, 2021
  • sanjaykatta2

    Jul. 20, 2021

Review : An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Paper Link : https://openreview.net/forum?id=YicbFdNTTy

Views

Total views

1,574

On Slideshare

0

From embeds

0

Number of embeds

3

Actions

Downloads

223

Shares

0

Comments

0

Likes

4

×