ViT.pptx

An Image is Worth 16x16 Words: Transformers for
Image Recognition at Scale
2022/05/02, Changjin Lee

Introduction
● Transformer has become the de-facto standard for NLP tasks but was not mature in
computer vision tasks (by the time ViT was published)
● Many proposed ways of integrating transformer into computer vision tasks
○ Partially replacing CNN layers with transformer blocks
○ Conjunction with CNN
● ViT suggests fully transformer-based architecture for image classification

Review: Transformer
embedding vectors

Self-Attention: Q, K, V Create query, key, value vector
from embedded vectors

Self-Attention: Score For each query, multiply it with
every other key vector to get
scores

Self-Attention: Normalize Normalize: Divide by the
embedding dimension

Self-Attention: Softmax Apply softmax to obtain attention
Attention !

Self-Attention: multiply with V Multiply attentions with value
vectors

Self-Attention: Weighted Sum Add value vectors to get the final
output of self-attention

All-in-one: Matrix Multiplication
0.88 0.12
x 0.88
x 0.12

Then how do we apply transformer to image classification?

ViT
W
H
P
P
N =
C
…
C
P
P
Flatten each patch
# of patches
Each patch is a token!!

ViT Architecture Overview
[1] Split into patches & flatten
[2] Linear Projection
[3] Class Tokens
[4] Position Embedding
[5] Encoder block 1 (MHA)
[6] Encoder block 2 (MLP)
x L
[7] Classification Head
Image: (B, C, H, W)
1. Patch Embedding
2. Transformer Encoder
3. Classification Head

Flatten & Linear Projection
1. Split into patches and flatten
2. Linear projection to get embedded vector
a. Trainable linear projection
…
C
P
P
Flatten [1] Split into patches & flatten
[3] Class Tokens

Class Token
● Prepend a learnable embedding, class token, to the sequence of embedded patches
● The class token at the output of the encoder serves as the image representation
[3] Class Tokens

Position Embedding
● Plain transformer does not contain relative ordering information of the patches
● Learnable 1D position embedding
[3] Class Tokens

Transformer Encoder
● Encoder Block 1
○ Norm -> MHA -> Skip Connection
● Encoder Block 2
○ Norm -> MLP -> Skip Connection
[3] Class Tokens

Encoder Block 1: Multi-head Attention
● Vectorized Implementation
[3] Class Tokens

Encoder Block 2: MLP
● GELU non-linearity
● Dropout
● Expansion
[3] Class Tokens

Classification Head
● Classification head is attached for the final prediction
[3] Class Tokens

ViT: Putting all together [1] Split into patches & flatten
[3] Class Tokens

Transformer(ViT) needs a lot of data!
● For smaller dataset, ViT performs worse than ResNet-based models!
larger dataset
larger dataset
But.. WHY?

Inductive Bias
● Inductive bias is any assumption we make about the unseen data
● House price prediction
○ Features: house size, # floors, # bedrooms
○ Model 1: Plain MLPs with billions of parameters (no assumption)
■ Needs TONS of data to figure out the underlying relationship from scratch
○ Model 2: Linear regression
■ We “assume” that the features are related to the house price linearly
■ If our assumption is correct -> more efficient!
● Relational Inductive Bias
○ Represents relationships between entities in the network (ex. entities=pixel)

CNN - Relational Inductive Bias
● Locality
○ Use kernels which capture local relationships between the entities in the kernel
● 2D Neighborhood Structure
● Translation Equivariance: input changes, output changes
● Translation Invariance: input changes, output doesn’t change
● Good for image-related tasks!
Translation Equivariance

Transformer - Inductive Bias
● Transformer has a weak image-specific inductive bias
● In ViT, only MLP are local and translation equivariance
● Self-attention is global!
● 2D neighborhood structure is only used when
○ Image is cut into patches
○ Position embeddings
● This weak inductive bias leads transformer to need extensive dataset to learn about the 2D
positions of patches and all spatial relations between the patches from scratch
● With small~medium datasets, ViT performs worse than CNNs but with large datasets, ViT
outperforms CNNs

References
[1] https://arxiv.org/abs/2010.11929
[2] https://arxiv.org/abs/1706.03762
[3] https://jalammar.github.io/illustrated-transformer/
[4] https://github.com/FrancescoSaverioZuppichini/ViT

ViT.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ViT.pptx

Similar to ViT.pptx (20)

More from Changjin Lee

More from Changjin Lee (6)

Recently uploaded

Recently uploaded (20)

ViT.pptx