SlideShare a Scribd company logo
1 of 25
Download to read offline
An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale
Anonymous (ICLR 2021 under review)
Yonsei University Severance Hospital CCIDS
Choi Dongmin
Abstract
• Transformer

- standard architecture for NLP
• Convolutional Networks

- attention is applied keeping their overall structure

• Transformer in Computer Vision

- a pure transformer can perform very well on image classification tasks
when applied directly to sequences of image patches

- achieved S.O.T.A with small computational costs when pre-trained on
large dataset
Introduction
Vaswani et al. Attention Is All You Need. NIPS 2017
Transformer
BERT
Self-attention

based architecture
The dominant approach : pre-training on a large text corpus

and then fine-tuning on a smaller task-specific dataset
Introduction
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020

Wang et al. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. ECCV 2020
Self-Attention in CV inspired by NLP
DETR
Axial-DeepLab
However, classic ResNet-like architectures are still S.O.T.A
• Applying a Transformer Directly to Images

- with the fewest possible modifications

- provide the sequence of linear embeddings of the patches as an input

- image patches = tokens (words) in NLP
• Small Scale Training

- achieved accuracies below ResNets of comparable size

- Transformers lack some inductive biased inherent to CNNs

(such as translation equivariance and locality)
• Large Scale Training

- trumps (surpass) inductive bias

- excellent results when pre-trained at sufficient scale and transferred
Introduction
Related Works
Transformer
Vaswani et al. Attention Is All You Need. NIPS 2017

Devlin et al. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL 2019

Radford et al. Improving language under- standing with unsupervised learning. Technical Report 2018
- Standard model in NLP tasks

- Only consists of attention modules

not using RNN

- Encoder-decoder

- Requires large scale dataset and

high computational cost

- Pre-training and fine-tuning
approaches : BERT & GPT
Method
Method
Image → A sequence of flattened 2D patchesx ∈ RH×W×C
xp ∈ RN×(P2
·C)
Trainable linear projection maps

→xp ∈ RN×(P2
·C)
xpE ∈ RN×D
Learnable Position Embedding

Epos ∈ R(N+1)×D
* Because Transformer uses constant

widths, model dimension , through all of its layersD
* to retain positional information
z0
L
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py#L99-L111
Method
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
Method
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
Method
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
: input sequencez ∈ RN×D
Attention weight : similarity btwAij qi
, kj
Method
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
Method
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
Method
Hybrid Architecture
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
Flattened intermediate feature

maps of a ResNet

as the input sequence like DETR
Method
Fine-tuning and Higher Resolution
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
Remove the pre-trained prediction head and attach a zero-initialized

feedforward layer ( =the number of downstream classes)D × K K
Experiments
• Datasets

< Pre-training >

- ILSVRC-2012 ImageNet dataset : 1k classes / 1.3M images

- ImageNet-21k : 21k classes / 14M images

- JFT : 18k classes / 303M images

< Downstream (Fine-tuning) >

- ImageNet, ImageNet ReaL, CIFAR-10/100, Oxford-IIIT Pets, Oxford
Flowers-102, VTAB
• Model Variants ex : ViT-L/16 = “Large” variants, with 16 X 16 input patch size
Experiments
• Training & Fine-tuning

< Pre-training>

- Adam with 

- Batch size 4,096

- Weight decay 0.1 (high weight decay is useful for transfer models)

- Linear learning rate warmup and decay



< Fine-tuning >

- SGD with momentum, batch size 512

• Metrics

- Few-shot (for fast on-the-fly evaluation)

- Fine-tuning accuracy
β1 = 0.9, β2 = 0.999
Experiments
• Comparison to State of the Art
Kolesnikov et al. Big Transfer (BiT): General Visual Representation Learning. ECCV 2020

Xie et al. Self-training with noisy student improves imagenet classification. CVPR 2020
* BiT-L : Big Transfer, which performs supervised transfer learning with large ResNets

* Noisy Student : a large EfficientNet trained using semi-supervised learning
Experiments
• Comparison to State of the Art
Experiments
• Pre-training Data Requirements
Larger Dataset
Larger Dataset
Experiments
• Scaling Study
Experiments
• Inspecting Vision Transformer
The components resemble plausible basis functions

for a low-dimensional representation of the fine structure within each patch 

analogous to receptive field size in CNNs
Conclusion
• Application of Transformers to Image Recognition

- no image-specific inductive biases in the architecture

- interpret an image as sequence of patches and process it by a standard
Transformer encoder

- simple, yet scalable, strategy works

- matches or exceeds the S.O.T.A being cheap to pre-train

• Many Challenges Remain

- other computer vision tasks, such as detection and segmentation

- further scaling ViT
Q&A
• ViT for Segmentation
• Fine-tuning on Grayscale Dataset
Thank you

More Related Content

What's hot

PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 

What's hot (20)

Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
ViT.pptx
ViT.pptxViT.pptx
ViT.pptx
 
210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
 
Swin transformer
Swin transformerSwin transformer
Swin transformer
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer Vision
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
State of transformers in Computer Vision
State of transformers in Computer VisionState of transformers in Computer Vision
State of transformers in Computer Vision
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
 
Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approach
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learning
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 

Similar to ViT (Vision Transformer) Review [CDM]

Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Sitakanta Mishra
 
IJCAI01 MSPC.ppt
IJCAI01 MSPC.pptIJCAI01 MSPC.ppt
IJCAI01 MSPC.ppt
Ptidej Team
 

Similar to ViT (Vision Transformer) Review [CDM] (20)

IPT.pdf
IPT.pdfIPT.pdf
IPT.pdf
 
Use CNN for Sequence Modeling
Use CNN for Sequence ModelingUse CNN for Sequence Modeling
Use CNN for Sequence Modeling
 
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
 
Deep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitDeep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr Sanparit
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
 
IJCAI01 MSPC.ppt
IJCAI01 MSPC.pptIJCAI01 MSPC.ppt
IJCAI01 MSPC.ppt
 
AI-CUK Joint Journal Club: V.T.Hoang, Review on "Global self-attention as a r...
AI-CUK Joint Journal Club: V.T.Hoang, Review on "Global self-attention as a r...AI-CUK Joint Journal Club: V.T.Hoang, Review on "Global self-attention as a r...
AI-CUK Joint Journal Club: V.T.Hoang, Review on "Global self-attention as a r...
 
Real-Time Face Tracking with GPU Acceleration
Real-Time Face Tracking with GPU AccelerationReal-Time Face Tracking with GPU Acceleration
Real-Time Face Tracking with GPU Acceleration
 
CUDA Accelerated Face Recognition
CUDA Accelerated Face RecognitionCUDA Accelerated Face Recognition
CUDA Accelerated Face Recognition
 
Reading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxReading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptx
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNN
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptx
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
 
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level Feature
 

More from Dongmin Choi

More from Dongmin Choi (20)

[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
 
Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]
 
Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...
Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...
Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...
 
YolactEdge Review [cdm]
YolactEdge Review [cdm]YolactEdge Review [cdm]
YolactEdge Review [cdm]
 
Review : Inter-slice Context Residual Learning for 3D Medical Image Segmentation
Review : Inter-slice Context Residual Learning for 3D Medical Image SegmentationReview : Inter-slice Context Residual Learning for 3D Medical Image Segmentation
Review : Inter-slice Context Residual Learning for 3D Medical Image Segmentation
 
Deformable DETR Review [CDM]
Deformable DETR Review [CDM]Deformable DETR Review [CDM]
Deformable DETR Review [CDM]
 
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
Review : Prototype Mixture Models for Few-shot Semantic SegmentationReview : Prototype Mixture Models for Few-shot Semantic Segmentation
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
 
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
 
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
 
Review : Rethinking Pre-training and Self-training
Review : Rethinking Pre-training and Self-trainingReview : Rethinking Pre-training and Self-training
Review : Rethinking Pre-training and Self-training
 
Review : Structure Boundary Preserving Segmentation
for Medical Image with Am...
Review : Structure Boundary Preserving Segmentation
for Medical Image with Am...Review : Structure Boundary Preserving Segmentation
for Medical Image with Am...
Review : Structure Boundary Preserving Segmentation
for Medical Image with Am...
 
Pyradiomics Customization [CDM]
Pyradiomics Customization [CDM]Pyradiomics Customization [CDM]
Pyradiomics Customization [CDM]
 
Seeing What a GAN Cannot Generate [cdm]
Seeing What a GAN Cannot Generate [cdm]Seeing What a GAN Cannot Generate [cdm]
Seeing What a GAN Cannot Generate [cdm]
 
Neural network pruning with residual connections and limited-data review [cdm]
Neural network pruning with residual connections and limited-data review [cdm]Neural network pruning with residual connections and limited-data review [cdm]
Neural network pruning with residual connections and limited-data review [cdm]
 
Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Network Deconvolution review [cdm]
Network Deconvolution review [cdm]
 
How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...
 
Objects as points (CenterNet) review [CDM]
Objects as points (CenterNet) review [CDM]Objects as points (CenterNet) review [CDM]
Objects as points (CenterNet) review [CDM]
 
Augmix review [cdm]
Augmix review [cdm]Augmix review [cdm]
Augmix review [cdm]
 
Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...
 
ICCV 2019 REVIEW [CDM]
ICCV 2019 REVIEW [CDM]ICCV 2019 REVIEW [CDM]
ICCV 2019 REVIEW [CDM]
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

ViT (Vision Transformer) Review [CDM]

  • 1. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Anonymous (ICLR 2021 under review) Yonsei University Severance Hospital CCIDS Choi Dongmin
  • 2. Abstract • Transformer
 - standard architecture for NLP • Convolutional Networks
 - attention is applied keeping their overall structure • Transformer in Computer Vision
 - a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches
 - achieved S.O.T.A with small computational costs when pre-trained on large dataset
  • 3. Introduction Vaswani et al. Attention Is All You Need. NIPS 2017 Transformer BERT Self-attention
 based architecture The dominant approach : pre-training on a large text corpus
 and then fine-tuning on a smaller task-specific dataset
  • 4. Introduction Carion et al. End-to-End Object Detection with Transformers. ECCV 2020 Wang et al. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. ECCV 2020 Self-Attention in CV inspired by NLP DETR Axial-DeepLab However, classic ResNet-like architectures are still S.O.T.A
  • 5. • Applying a Transformer Directly to Images
 - with the fewest possible modifications
 - provide the sequence of linear embeddings of the patches as an input
 - image patches = tokens (words) in NLP • Small Scale Training
 - achieved accuracies below ResNets of comparable size
 - Transformers lack some inductive biased inherent to CNNs
 (such as translation equivariance and locality) • Large Scale Training
 - trumps (surpass) inductive bias
 - excellent results when pre-trained at sufficient scale and transferred Introduction
  • 6. Related Works Transformer Vaswani et al. Attention Is All You Need. NIPS 2017 Devlin et al. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL 2019 Radford et al. Improving language under- standing with unsupervised learning. Technical Report 2018 - Standard model in NLP tasks - Only consists of attention modules
 not using RNN - Encoder-decoder - Requires large scale dataset and
 high computational cost - Pre-training and fine-tuning approaches : BERT & GPT
  • 8. Method Image → A sequence of flattened 2D patchesx ∈ RH×W×C xp ∈ RN×(P2 ·C) Trainable linear projection maps
 →xp ∈ RN×(P2 ·C) xpE ∈ RN×D Learnable Position Embedding
 Epos ∈ R(N+1)×D * Because Transformer uses constant
 widths, model dimension , through all of its layersD * to retain positional information z0 L https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py#L99-L111
  • 14. Method Hybrid Architecture Carion et al. End-to-End Object Detection with Transformers. ECCV 2020 Flattened intermediate feature
 maps of a ResNet
 as the input sequence like DETR
  • 15. Method Fine-tuning and Higher Resolution Carion et al. End-to-End Object Detection with Transformers. ECCV 2020 Remove the pre-trained prediction head and attach a zero-initialized
 feedforward layer ( =the number of downstream classes)D × K K
  • 16. Experiments • Datasets
 < Pre-training >
 - ILSVRC-2012 ImageNet dataset : 1k classes / 1.3M images
 - ImageNet-21k : 21k classes / 14M images
 - JFT : 18k classes / 303M images
 < Downstream (Fine-tuning) >
 - ImageNet, ImageNet ReaL, CIFAR-10/100, Oxford-IIIT Pets, Oxford Flowers-102, VTAB • Model Variants ex : ViT-L/16 = “Large” variants, with 16 X 16 input patch size
  • 17. Experiments • Training & Fine-tuning
 < Pre-training>
 - Adam with 
 - Batch size 4,096
 - Weight decay 0.1 (high weight decay is useful for transfer models)
 - Linear learning rate warmup and decay
 
 < Fine-tuning >
 - SGD with momentum, batch size 512 • Metrics
 - Few-shot (for fast on-the-fly evaluation)
 - Fine-tuning accuracy β1 = 0.9, β2 = 0.999
  • 18. Experiments • Comparison to State of the Art Kolesnikov et al. Big Transfer (BiT): General Visual Representation Learning. ECCV 2020 Xie et al. Self-training with noisy student improves imagenet classification. CVPR 2020 * BiT-L : Big Transfer, which performs supervised transfer learning with large ResNets * Noisy Student : a large EfficientNet trained using semi-supervised learning
  • 19. Experiments • Comparison to State of the Art
  • 20. Experiments • Pre-training Data Requirements Larger Dataset Larger Dataset
  • 22. Experiments • Inspecting Vision Transformer The components resemble plausible basis functions
 for a low-dimensional representation of the fine structure within each patch analogous to receptive field size in CNNs
  • 23. Conclusion • Application of Transformers to Image Recognition
 - no image-specific inductive biases in the architecture
 - interpret an image as sequence of patches and process it by a standard Transformer encoder
 - simple, yet scalable, strategy works
 - matches or exceeds the S.O.T.A being cheap to pre-train • Many Challenges Remain
 - other computer vision tasks, such as detection and segmentation
 - further scaling ViT
  • 24. Q&A • ViT for Segmentation • Fine-tuning on Grayscale Dataset