SlideShare a Scribd company logo
An Image is Worth 16x16 Words: Transformers for
Image Recognition at Scale
2022/05/02, Changjin Lee
Introduction
● Transformer has become the de-facto standard for NLP tasks but was not mature in
computer vision tasks (by the time ViT was published)
● Many proposed ways of integrating transformer into computer vision tasks
○ Partially replacing CNN layers with transformer blocks
○ Conjunction with CNN
● ViT suggests fully transformer-based architecture for image classification
Review: Transformer
embedding vectors
Self-Attention: Q, K, V Create query, key, value vector
from embedded vectors
Self-Attention: Score For each query, multiply it with
every other key vector to get
scores
Self-Attention: Normalize Normalize: Divide by the
embedding dimension
Self-Attention: Softmax Apply softmax to obtain attention
Attention !
Self-Attention: multiply with V Multiply attentions with value
vectors
Self-Attention: Weighted Sum Add value vectors to get the final
output of self-attention
All-in-one: Matrix Multiplication
0.88 0.12
x 0.88
x 0.12
Multi-head attention
Then how do we apply transformer to image classification?
ViT
W
H
P
P
N =
C
…
C
P
P
Flatten each patch
# of patches
Each patch is a token!!
ViT Architecture Overview
[1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
x L
[7] Classification Head
Image: (B, C, H, W)
1. Patch Embedding
2. Transformer Encoder
3. Classification Head
Flatten & Linear Projection
1. Split into patches and flatten
2. Linear projection to get embedded vector
a. Trainable linear projection
…
C
P
P
Flatten [1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
Class Token
● Prepend a learnable embedding, class token, to the sequence of embedded patches
● The class token at the output of the encoder serves as the image representation
[1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
Position Embedding
● Plain transformer does not contain relative ordering information of the patches
● Learnable 1D position embedding
[1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
Transformer Encoder
● Encoder Block 1
○ Norm -> MHA -> Skip Connection
● Encoder Block 2
○ Norm -> MLP -> Skip Connection
[1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
Encoder Block 1: Multi-head Attention
● Vectorized Implementation
[1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
Encoder Block 2: MLP
● GELU non-linearity
● Dropout
● Expansion
[1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
Classification Head
● Classification head is attached for the final prediction
[1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
ViT: Putting all together [1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
[7] Classification Head
Transformer(ViT) needs a lot of data!
● For smaller dataset, ViT performs worse than ResNet-based models!
larger dataset
larger dataset
But.. WHY?
Inductive Bias
● Inductive bias is any assumption we make about the unseen data
● House price prediction
○ Features: house size, # floors, # bedrooms
○ Model 1: Plain MLPs with billions of parameters (no assumption)
■ Needs TONS of data to figure out the underlying relationship from scratch
○ Model 2: Linear regression
■ We “assume” that the features are related to the house price linearly
■ If our assumption is correct -> more efficient!
● Relational Inductive Bias
○ Represents relationships between entities in the network (ex. entities=pixel)
CNN - Relational Inductive Bias
● Locality
○ Use kernels which capture local relationships between the entities in the kernel
● 2D Neighborhood Structure
● Translation Equivariance: input changes, output changes
● Translation Invariance: input changes, output doesn’t change
● Good for image-related tasks!
Translation Equivariance
Transformer - Inductive Bias
● Transformer has a weak image-specific inductive bias
● In ViT, only MLP are local and translation equivariance
● Self-attention is global!
● 2D neighborhood structure is only used when
○ Image is cut into patches
○ Position embeddings
● This weak inductive bias leads transformer to need extensive dataset to learn about the 2D
positions of patches and all spatial relations between the patches from scratch
● With small~medium datasets, ViT performs worse than CNNs but with large datasets, ViT
outperforms CNNs
References
[1] https://arxiv.org/abs/2010.11929
[2] https://arxiv.org/abs/1706.03762
[3] https://jalammar.github.io/illustrated-transformer/
[4] https://github.com/FrancescoSaverioZuppichini/ViT

More Related Content

What's hot

Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
Nuwan Sriyantha Bandara
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers
leopauly
 
Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approach
Ferdin Joe John Joseph PhD
 
Swin transformer
Swin transformerSwin transformer
Swin transformer
JAEMINJEONG5
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
Edge AI and Vision Alliance
 
ConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explained
Sushant Gautam
 
State of transformers in Computer Vision
State of transformers in Computer VisionState of transformers in Computer Vision
State of transformers in Computer Vision
Deep Kayal
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
Khang Pham
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
Yuta Niki
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
Susang Kim
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
Jeong-Gwan Lee
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
Sungjoon Choi
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
Jinwon Lee
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
Jinwon Lee
 
Image-to-Image Translation pix2pix
Image-to-Image Translation pix2pixImage-to-Image Translation pix2pix
Image-to-Image Translation pix2pix
Yasar Hayat
 
Semantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep LearningSemantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep Learning
Sungjoon Choi
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
Gaurav Mittal
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
Hye-min Ahn
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
Jinwon Lee
 

What's hot (20)

Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers
 
Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approach
 
Swin transformer
Swin transformerSwin transformer
Swin transformer
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
ConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explained
 
State of transformers in Computer Vision
State of transformers in Computer VisionState of transformers in Computer Vision
State of transformers in Computer Vision
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
 
Image-to-Image Translation pix2pix
Image-to-Image Translation pix2pixImage-to-Image Translation pix2pix
Image-to-Image Translation pix2pix
 
Semantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep LearningSemantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep Learning
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 

Similar to ViT.pptx

Reading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxReading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptx
congtran88
 
MPEG-1 Part 2 Video Encoding
MPEG-1 Part 2 Video EncodingMPEG-1 Part 2 Video Encoding
MPEG-1 Part 2 Video Encoding
Christian Kehl
 
Scene understanding
Scene understandingScene understanding
Scene understanding
Mohammed Shoaib
 
affine transformation for computer graphics
affine transformation for computer graphicsaffine transformation for computer graphics
affine transformation for computer graphics
DrSUGANYADEVIK
 
PPT s07-machine vision-s2
PPT s07-machine vision-s2PPT s07-machine vision-s2
PPT s07-machine vision-s2
Binus Online Learning
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptx
htn540
 
Floating point ALU using VHDL implemented on FPGA
Floating point ALU using VHDL implemented on FPGAFloating point ALU using VHDL implemented on FPGA
Floating point ALU using VHDL implemented on FPGA
Azhar Syed
 
Client side machine learning
Client side machine learningClient side machine learning
Client side machine learning
Kumar Abhinav
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1
ananth
 
I3602061067
I3602061067I3602061067
I3602061067
ijceronline
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
Ramesh Sampath
 
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
thanhdowork
 
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
thanhdowork
 
Introduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable RenderingIntroduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable Rendering
Preferred Networks
 
220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization
taeseon ryu
 
HEVC intra coding
HEVC intra codingHEVC intra coding
HEVC intra coding
Manohar Kuse
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
Subhashis Hazarika
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
ananth
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
datamantra
 
Computer Organization and Architecture Overview
Computer Organization and Architecture OverviewComputer Organization and Architecture Overview
Computer Organization and Architecture Overview
Dhaval Bagal
 

Similar to ViT.pptx (20)

Reading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxReading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptx
 
MPEG-1 Part 2 Video Encoding
MPEG-1 Part 2 Video EncodingMPEG-1 Part 2 Video Encoding
MPEG-1 Part 2 Video Encoding
 
Scene understanding
Scene understandingScene understanding
Scene understanding
 
affine transformation for computer graphics
affine transformation for computer graphicsaffine transformation for computer graphics
affine transformation for computer graphics
 
PPT s07-machine vision-s2
PPT s07-machine vision-s2PPT s07-machine vision-s2
PPT s07-machine vision-s2
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptx
 
Floating point ALU using VHDL implemented on FPGA
Floating point ALU using VHDL implemented on FPGAFloating point ALU using VHDL implemented on FPGA
Floating point ALU using VHDL implemented on FPGA
 
Client side machine learning
Client side machine learningClient side machine learning
Client side machine learning
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1
 
I3602061067
I3602061067I3602061067
I3602061067
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
 
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...
 
Introduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable RenderingIntroduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable Rendering
 
220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization
 
HEVC intra coding
HEVC intra codingHEVC intra coding
HEVC intra coding
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
Computer Organization and Architecture Overview
Computer Organization and Architecture OverviewComputer Organization and Architecture Overview
Computer Organization and Architecture Overview
 

More from Changjin Lee

R-FCN.pptx
R-FCN.pptxR-FCN.pptx
R-FCN.pptx
Changjin Lee
 
U-Net (1).pptx
U-Net (1).pptxU-Net (1).pptx
U-Net (1).pptx
Changjin Lee
 
WBF.pptx
WBF.pptxWBF.pptx
WBF.pptx
Changjin Lee
 
Cascade R-CNN_ Delving into High Quality Object Detection.pptx
Cascade R-CNN_ Delving into High Quality Object Detection.pptxCascade R-CNN_ Delving into High Quality Object Detection.pptx
Cascade R-CNN_ Delving into High Quality Object Detection.pptx
Changjin Lee
 
Cut mix: Regularization strategy to train strong classifiers with localizable...
Cut mix: Regularization strategy to train strong classifiers with localizable...Cut mix: Regularization strategy to train strong classifiers with localizable...
Cut mix: Regularization strategy to train strong classifiers with localizable...
Changjin Lee
 
EfficientNet
EfficientNetEfficientNet
EfficientNet
Changjin Lee
 

More from Changjin Lee (6)

R-FCN.pptx
R-FCN.pptxR-FCN.pptx
R-FCN.pptx
 
U-Net (1).pptx
U-Net (1).pptxU-Net (1).pptx
U-Net (1).pptx
 
WBF.pptx
WBF.pptxWBF.pptx
WBF.pptx
 
Cascade R-CNN_ Delving into High Quality Object Detection.pptx
Cascade R-CNN_ Delving into High Quality Object Detection.pptxCascade R-CNN_ Delving into High Quality Object Detection.pptx
Cascade R-CNN_ Delving into High Quality Object Detection.pptx
 
Cut mix: Regularization strategy to train strong classifiers with localizable...
Cut mix: Regularization strategy to train strong classifiers with localizable...Cut mix: Regularization strategy to train strong classifiers with localizable...
Cut mix: Regularization strategy to train strong classifiers with localizable...
 
EfficientNet
EfficientNetEfficientNet
EfficientNet
 

Recently uploaded

WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
MuhammadTufail242431
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
ShahidSultan24
 

Recently uploaded (20)

WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
 

ViT.pptx

  • 1. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 2022/05/02, Changjin Lee
  • 2. Introduction ● Transformer has become the de-facto standard for NLP tasks but was not mature in computer vision tasks (by the time ViT was published) ● Many proposed ways of integrating transformer into computer vision tasks ○ Partially replacing CNN layers with transformer blocks ○ Conjunction with CNN ● ViT suggests fully transformer-based architecture for image classification
  • 4. Self-Attention: Q, K, V Create query, key, value vector from embedded vectors
  • 5. Self-Attention: Score For each query, multiply it with every other key vector to get scores
  • 6. Self-Attention: Normalize Normalize: Divide by the embedding dimension
  • 7. Self-Attention: Softmax Apply softmax to obtain attention Attention !
  • 8. Self-Attention: multiply with V Multiply attentions with value vectors
  • 9. Self-Attention: Weighted Sum Add value vectors to get the final output of self-attention
  • 12. Then how do we apply transformer to image classification?
  • 13. ViT W H P P N = C … C P P Flatten each patch # of patches Each patch is a token!!
  • 14. ViT Architecture Overview [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) x L [7] Classification Head Image: (B, C, H, W) 1. Patch Embedding 2. Transformer Encoder 3. Classification Head
  • 15. Flatten & Linear Projection 1. Split into patches and flatten 2. Linear projection to get embedded vector a. Trainable linear projection … C P P Flatten [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 16. Class Token ● Prepend a learnable embedding, class token, to the sequence of embedded patches ● The class token at the output of the encoder serves as the image representation [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 17. Position Embedding ● Plain transformer does not contain relative ordering information of the patches ● Learnable 1D position embedding [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 18. Transformer Encoder ● Encoder Block 1 ○ Norm -> MHA -> Skip Connection ● Encoder Block 2 ○ Norm -> MLP -> Skip Connection [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 19. Encoder Block 1: Multi-head Attention ● Vectorized Implementation [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 20. Encoder Block 2: MLP ● GELU non-linearity ● Dropout ● Expansion [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 21. Classification Head ● Classification head is attached for the final prediction [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 22. ViT: Putting all together [1] Split into patches & flatten [2] Linear Projection [3] Class Tokens [4] Position Embedding [5] Encoder block 1 (MHA) [6] Encoder block 2 (MLP) [7] Classification Head
  • 23. Transformer(ViT) needs a lot of data! ● For smaller dataset, ViT performs worse than ResNet-based models! larger dataset larger dataset But.. WHY?
  • 24. Inductive Bias ● Inductive bias is any assumption we make about the unseen data ● House price prediction ○ Features: house size, # floors, # bedrooms ○ Model 1: Plain MLPs with billions of parameters (no assumption) ■ Needs TONS of data to figure out the underlying relationship from scratch ○ Model 2: Linear regression ■ We “assume” that the features are related to the house price linearly ■ If our assumption is correct -> more efficient! ● Relational Inductive Bias ○ Represents relationships between entities in the network (ex. entities=pixel)
  • 25. CNN - Relational Inductive Bias ● Locality ○ Use kernels which capture local relationships between the entities in the kernel ● 2D Neighborhood Structure ● Translation Equivariance: input changes, output changes ● Translation Invariance: input changes, output doesn’t change ● Good for image-related tasks! Translation Equivariance
  • 26. Transformer - Inductive Bias ● Transformer has a weak image-specific inductive bias ● In ViT, only MLP are local and translation equivariance ● Self-attention is global! ● 2D neighborhood structure is only used when ○ Image is cut into patches ○ Position embeddings ● This weak inductive bias leads transformer to need extensive dataset to learn about the 2D positions of patches and all spatial relations between the patches from scratch ● With small~medium datasets, ViT performs worse than CNNs but with large datasets, ViT outperforms CNNs
  • 27. References [1] https://arxiv.org/abs/2010.11929 [2] https://arxiv.org/abs/1706.03762 [3] https://jalammar.github.io/illustrated-transformer/ [4] https://github.com/FrancescoSaverioZuppichini/ViT