SlideShare a Scribd company logo
1 of 27
An Image is Worth 16x16 Words: Transformers for
Image Recognition at Scale
2022/05/02, Changjin Lee
Introduction
● Transformer has become the de-facto standard for NLP tasks but was not mature in
computer vision tasks (by the time ViT was published)
● Many proposed ways of integrating transformer into computer vision tasks
○ Partially replacing CNN layers with transformer blocks
○ Conjunction with CNN
● ViT suggests fully transformer-based architecture for image classification
Review: Transformer
embedding vectors
Self-Attention: Q, K, V Create query, key, value vector
from embedded vectors
Self-Attention: Score For each query, multiply it with
every other key vector to get
scores
Self-Attention: Normalize Normalize: Divide by the
embedding dimension
Self-Attention: Softmax Apply softmax to obtain attention
Attention !
Self-Attention: multiply with V Multiply attentions with value
vectors
Self-Attention: Weighted Sum Add value vectors to get the final
output of self-attention
All-in-one: Matrix Multiplication
0.88 0.12
x 0.88
x 0.12
Multi-head attention
Then how do we apply transformer to image classification?
ViT
W
H
P
P
N =
C
…
C
P
P
Flatten each patch
# of patches
Each patch is a token!!
ViT Architecture Overview
[1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
x L
[7] Classification Head
Image: (B, C, H, W)
1. Patch Embedding
2. Transformer Encoder
3. Classification Head
Flatten & Linear Projection
1. Split into patches and flatten
2. Linear projection to get embedded vector
a. Trainable linear projection
…
C
P
P
Flatten [1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
Class Token
● Prepend a learnable embedding, class token, to the sequence of embedded patches
● The class token at the output of the encoder serves as the image representation
[1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
Position Embedding
● Plain transformer does not contain relative ordering information of the patches
● Learnable 1D position embedding
[1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
Transformer Encoder
● Encoder Block 1
○ Norm -> MHA -> Skip Connection
● Encoder Block 2
○ Norm -> MLP -> Skip Connection
[1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
Encoder Block 1: Multi-head Attention
● Vectorized Implementation
[1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
Encoder Block 2: MLP
● GELU non-linearity
● Dropout
● Expansion
[1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
Classification Head
● Classification head is attached for the final prediction
[1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
ViT: Putting all together [1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
Transformer(ViT) needs a lot of data!
● For smaller dataset, ViT performs worse than ResNet-based models!
larger dataset
larger dataset
But.. WHY?
Inductive Bias
● Inductive bias is any assumption we make about the unseen data
● House price prediction
○ Features: house size, # floors, # bedrooms
○ Model 1: Plain MLPs with billions of parameters (no assumption)
■ Needs TONS of data to figure out the underlying relationship from scratch
○ Model 2: Linear regression
■ We “assume” that the features are related to the house price linearly
■ If our assumption is correct -> more efficient!
● Relational Inductive Bias
○ Represents relationships between entities in the network (ex. entities=pixel)
CNN - Relational Inductive Bias
● Locality
○ Use kernels which capture local relationships between the entities in the kernel
● 2D Neighborhood Structure
● Translation Equivariance: input changes, output changes
● Translation Invariance: input changes, output doesn’t change
● Good for image-related tasks!
Translation Equivariance
Transformer - Inductive Bias
● Transformer has a weak image-specific inductive bias
● In ViT, only MLP are local and translation equivariance
● Self-attention is global!
● 2D neighborhood structure is only used when
○ Image is cut into patches
○ Position embeddings
● This weak inductive bias leads transformer to need extensive dataset to learn about the 2D
positions of patches and all spatial relations between the patches from scratch
● With small~medium datasets, ViT performs worse than CNNs but with large datasets, ViT
outperforms CNNs
References
[1] https://arxiv.org/abs/2010.11929
[2] https://arxiv.org/abs/1706.03762
[3] https://jalammar.github.io/illustrated-transformer/
[4] https://github.com/FrancescoSaverioZuppichini/ViT

More Related Content

What's hot

Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxDeep Learning Italia
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
 
210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5taeseon ryu
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningﺁﺻﻒ ﻋﻠﯽ ﻣﯿﺮ
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningMohamed Loey
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic SegmentationSemantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation岳華 杜
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Universitat Politècnica de Catalunya
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAIYurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAILviv Startup Club
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanismSwatiNarkhede1
 
Image to image translation with Pix2Pix GAN
Image to image translation with Pix2Pix GANImage to image translation with Pix2Pix GAN
Image to image translation with Pix2Pix GANS.Shayan Daneshvar
 
Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Akash Goel
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 

What's hot (20)

Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
 
210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5
 
Swin transformer
Swin transformerSwin transformer
Swin transformer
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic SegmentationSemantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAIYurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
Image to image translation with Pix2Pix GAN
Image to image translation with Pix2Pix GANImage to image translation with Pix2Pix GAN
Image to image translation with Pix2Pix GAN
 
Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 

Similar to Transformers for Image Recognition: ViT Architecture Explained

Reading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxReading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxcongtran88
 
MPEG-1 Part 2 Video Encoding
MPEG-1 Part 2 Video EncodingMPEG-1 Part 2 Video Encoding
MPEG-1 Part 2 Video EncodingChristian Kehl
 
affine transformation for computer graphics
affine transformation for computer graphicsaffine transformation for computer graphics
affine transformation for computer graphicsDrSUGANYADEVIK
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptxhtn540
 
Floating point ALU using VHDL implemented on FPGA
Floating point ALU using VHDL implemented on FPGAFloating point ALU using VHDL implemented on FPGA
Floating point ALU using VHDL implemented on FPGAAzhar Syed
 
Client side machine learning
Client side machine learningClient side machine learning
Client side machine learningKumar Abhinav
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1ananth
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelinesRamesh Sampath
 
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...thanhdowork
 
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...thanhdowork
 
Introduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable RenderingIntroduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable RenderingPreferred Networks
 
220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualizationtaeseon ryu
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchSubhashis Hazarika
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Sparkdatamantra
 
Computer Organization and Architecture Overview
Computer Organization and Architecture OverviewComputer Organization and Architecture Overview
Computer Organization and Architecture OverviewDhaval Bagal
 

Similar to Transformers for Image Recognition: ViT Architecture Explained (20)

Reading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxReading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptx
 
MPEG-1 Part 2 Video Encoding
MPEG-1 Part 2 Video EncodingMPEG-1 Part 2 Video Encoding
MPEG-1 Part 2 Video Encoding
 
Scene understanding
Scene understandingScene understanding
Scene understanding
 
affine transformation for computer graphics
affine transformation for computer graphicsaffine transformation for computer graphics
affine transformation for computer graphics
 
PPT s07-machine vision-s2
PPT s07-machine vision-s2PPT s07-machine vision-s2
PPT s07-machine vision-s2
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptx
 
Floating point ALU using VHDL implemented on FPGA
Floating point ALU using VHDL implemented on FPGAFloating point ALU using VHDL implemented on FPGA
Floating point ALU using VHDL implemented on FPGA
 
Client side machine learning
Client side machine learningClient side machine learning
Client side machine learning
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1
 
I3602061067
I3602061067I3602061067
I3602061067
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
 
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
 
Introduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable RenderingIntroduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable Rendering
 
220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization
 
HEVC intra coding
HEVC intra codingHEVC intra coding
HEVC intra coding
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
Computer Organization and Architecture Overview
Computer Organization and Architecture OverviewComputer Organization and Architecture Overview
Computer Organization and Architecture Overview
 

More from Changjin Lee

Cascade R-CNN_ Delving into High Quality Object Detection.pptx
Cascade R-CNN_ Delving into High Quality Object Detection.pptxCascade R-CNN_ Delving into High Quality Object Detection.pptx
Cascade R-CNN_ Delving into High Quality Object Detection.pptxChangjin Lee
 
Cut mix: Regularization strategy to train strong classifiers with localizable...
Cut mix: Regularization strategy to train strong classifiers with localizable...Cut mix: Regularization strategy to train strong classifiers with localizable...
Cut mix: Regularization strategy to train strong classifiers with localizable...Changjin Lee
 

More from Changjin Lee (6)

R-FCN.pptx
R-FCN.pptxR-FCN.pptx
R-FCN.pptx
 
U-Net (1).pptx
U-Net (1).pptxU-Net (1).pptx
U-Net (1).pptx
 
WBF.pptx
WBF.pptxWBF.pptx
WBF.pptx
 
Cascade R-CNN_ Delving into High Quality Object Detection.pptx
Cascade R-CNN_ Delving into High Quality Object Detection.pptxCascade R-CNN_ Delving into High Quality Object Detection.pptx
Cascade R-CNN_ Delving into High Quality Object Detection.pptx
 
Cut mix: Regularization strategy to train strong classifiers with localizable...
Cut mix: Regularization strategy to train strong classifiers with localizable...Cut mix: Regularization strategy to train strong classifiers with localizable...
Cut mix: Regularization strategy to train strong classifiers with localizable...
 
EfficientNet
EfficientNetEfficientNet
EfficientNet
 

Recently uploaded

Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 

Transformers for Image Recognition: ViT Architecture Explained

  • 1. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 2022/05/02, Changjin Lee
  • 2. Introduction ● Transformer has become the de-facto standard for NLP tasks but was not mature in computer vision tasks (by the time ViT was published) ● Many proposed ways of integrating transformer into computer vision tasks ○ Partially replacing CNN layers with transformer blocks ○ Conjunction with CNN ● ViT suggests fully transformer-based architecture for image classification
  • 4. Self-Attention: Q, K, V Create query, key, value vector from embedded vectors
  • 5. Self-Attention: Score For each query, multiply it with every other key vector to get scores
  • 6. Self-Attention: Normalize Normalize: Divide by the embedding dimension
  • 7. Self-Attention: Softmax Apply softmax to obtain attention Attention !
  • 8. Self-Attention: multiply with V Multiply attentions with value vectors
  • 9. Self-Attention: Weighted Sum Add value vectors to get the final output of self-attention
  • 12. Then how do we apply transformer to image classification?
  • 13. ViT W H P P N = C … C P P Flatten each patch # of patches Each patch is a token!!
  • 14. ViT Architecture Overview [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) x L [7] Classification Head Image: (B, C, H, W) 1. Patch Embedding 2. Transformer Encoder 3. Classification Head
  • 15. Flatten & Linear Projection 1. Split into patches and flatten 2. Linear projection to get embedded vector a. Trainable linear projection … C P P Flatten [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 16. Class Token ● Prepend a learnable embedding, class token, to the sequence of embedded patches ● The class token at the output of the encoder serves as the image representation [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 17. Position Embedding ● Plain transformer does not contain relative ordering information of the patches ● Learnable 1D position embedding [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 18. Transformer Encoder ● Encoder Block 1 ○ Norm -> MHA -> Skip Connection ● Encoder Block 2 ○ Norm -> MLP -> Skip Connection [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 19. Encoder Block 1: Multi-head Attention ● Vectorized Implementation [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 20. Encoder Block 2: MLP ● GELU non-linearity ● Dropout ● Expansion [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 21. Classification Head ● Classification head is attached for the final prediction [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 22. ViT: Putting all together [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 23. Transformer(ViT) needs a lot of data! ● For smaller dataset, ViT performs worse than ResNet-based models! larger dataset larger dataset But.. WHY?
  • 24. Inductive Bias ● Inductive bias is any assumption we make about the unseen data ● House price prediction ○ Features: house size, # floors, # bedrooms ○ Model 1: Plain MLPs with billions of parameters (no assumption) ■ Needs TONS of data to figure out the underlying relationship from scratch ○ Model 2: Linear regression ■ We “assume” that the features are related to the house price linearly ■ If our assumption is correct -> more efficient! ● Relational Inductive Bias ○ Represents relationships between entities in the network (ex. entities=pixel)
  • 25. CNN - Relational Inductive Bias ● Locality ○ Use kernels which capture local relationships between the entities in the kernel ● 2D Neighborhood Structure ● Translation Equivariance: input changes, output changes ● Translation Invariance: input changes, output doesn’t change ● Good for image-related tasks! Translation Equivariance
  • 26. Transformer - Inductive Bias ● Transformer has a weak image-specific inductive bias ● In ViT, only MLP are local and translation equivariance ● Self-attention is global! ● 2D neighborhood structure is only used when ○ Image is cut into patches ○ Position embeddings ● This weak inductive bias leads transformer to need extensive dataset to learn about the 2D positions of patches and all spatial relations between the patches from scratch ● With small~medium datasets, ViT performs worse than CNNs but with large datasets, ViT outperforms CNNs
  • 27. References [1] https://arxiv.org/abs/2010.11929 [2] https://arxiv.org/abs/1706.03762 [3] https://jalammar.github.io/illustrated-transformer/ [4] https://github.com/FrancescoSaverioZuppichini/ViT