SlideShare a Scribd company logo
1 of 36
Vishal Mittal 2017A7PS0080P
Akshit Khanna 2017A7PS0023P
Raghav Bansal 2017A3PS0196P
PAPER TITLE : An Image is Worth
16x16 Words: Transformers for
Image Recognition at Scale
AUTHORS’ AFFILIATION : Google Research
PAPER DESCRIPTION
AIM
• Explore Transformer-based architectures for Computer Vision Tasks.
• Transformers have been the de-facto for NLP tasks, and CNN/Resnet-like
architectures have been the state of the art for Computer Vision.
• Till date, researchers have tried using attention for Vision, but used them
in conjunction with CNN.
• This paper mainly discusses the strength and versatility of vision
transformers, as it kind of approves that they can be used in recognition
and can even beat the state-of-the-art CNN.
METHODOLOGY (TRAINING)
Large Dataset
of Images
1 Image
(H x W x C)
Divide Image
according to
patch size
(P x P)
Flatten into a
2-D sequence
(N x P2.C) Append a
learnable
class token
(N+1) x P2.C
Add Positional
Embeddings
(N+1) x P2.C
Linear
Projection to fit
Transformer
Width
(N + 1) x D
Transformer Encoder
MLP Head
CLASS
Continue
Transformer Encoder
Embedded Patches
Norm
Multi – Head
Attention
Norm
MLP
L x
Testing
•The authors have tested different variants of
Vision Transformer having different patch size,
number of layers, and embedding dimension,
on datasets of different sizes – ImageNet,
JFT300M, CIFAR10/100 etc.
•The results of Vision Transformer have been
compared with results of other architectures
as well – BiT (Resnet 152x4), and EfficientNet,
on same conditions.
•The models have also been evaluated on
VTAB classification suite consisting of 19 tasks
divided into groups as Natural, Specialized and
Structured Tasks.
•They have also performed a preliminary
exploration on masked patch prediction for
self-supervision.
FINAL OUTCOME
The authors have experimented transformer- based architecture under
certain conditions.
• As the amount of training data - accuracy of ViT compared to ResNet (BiT). They
say that a basic transformer encoder has no inductive bias unlike CNN – translation
equivariance, locality. For larger datasets, learning the relevant patterns is
sufficient.
• Among different variants of ViT, larger variants perform better when amount of
training data
• Scaling the depth of the ViT architecture & patch size - accuracy
• Scaling the width of the ViT architecture – minimal effect.
• Vision Transformer dominate ResNets on the performance/compute trade-off.
• Hybrid architecture slightly outperform ViT at small computational budgets, but the
difference vanishes for larger ones
• Performance does not seem yet to be saturating with the increased model size.
• In VTAB evaluation, ViT dominates ResNet (BiT) in the Structured Task group.
BACKGROUN
D CONCEPTS
Why do we
need
attention
mechanism?
In the seq2seq models, popular for translation,
image captioning, the context vector turned out
to be a bottleneck for these types of models.
It made it challenging for the models to deal
with long sentences.
So, “Attention” was proposed which highly
improved the quality of machine translation
systems.
Attention allows the model to focus on the
relevant parts of the input sequence as needed.
Attention Mechanism
• All the vectors h1,h2.., etc., used are basically the concatenation of forward and
backward hidden states in the encoder.
• The attention model computes a set of attention weights denoted by
α(t,1),…,α(t,t) because not all the inputs would be used in generating the
corresponding output.
• The context vector ci for the output word yi is generated using the weighted
sum of the annotations:
• The attention weights are calculated by normalizing the output score of a feed-
forward neural network described by the function a that captures the
alignment between input at j and output at i.
Transformer – Encoder
• Like LSTM, Transformer is an architecture
for transforming one sequence into
another one with the help of two parts
(Encoder and Decoder), but it differs from
the existing seq2seq models because it
does not imply any Recurrent Networks
(GRU, LSTM, etc.).
• 3 main parts : Position Embeddings + Multi
Head Attention + Feed- forward Layers.
• Q (query) matrix : vector representation of
one word in the sequence
• K (keys) matrix : vector representations of
all the words in the sequence
• V (values) matrix : vector representations
of all the words in the sequence. For the
encoder, V consists of the same word
sequence than Q.
• Attention weights = softmax(QKT /dk )
• These weights are defined by how each
word of the sequence (represented by Q)
is influenced by all the other words in the
sequence (represented by K).
• Those weights are then applied to all the
words in the sequence that are
introduced in V.
Multi-Head Attention
Input
Sequence
Embed each
word and add
positional
embedding
Split into
number of
heads and
heads & also
create learnable
matrices
Multiply with
learnable
matrices to get
Q, K, V
Calculate output
of each head:
softmax(QKT /dk
)V
Concatenate
the resulting
matrices to
produce O/P
Multihead attention allows the model to
jointly attend to information from different
representation subspaces and hence
expands the model’s ability to focus on
different positions
DATASETS
• Due to non – availability of
powerful compute on Google
Colab, we chose to train and test
on these 2 datasets –
• CIFAR 10
• CIFAR 100
CIFAR 10
• CIFAR - Canadian Institute For Advanced
Research.
• 60000 32x32 colour images in 10 classes, with
6000 images per class.
• 50000 training images and 10000 test images.
• classes represent airplanes, cars, birds, cats,
deer, dogs, frogs, horses, ships, and trucks.
• CIFAR-10 is a labeled subset of the 80 million tiny
images dataset collected by Alex Krizhevsky,
Vinod Nair, and Geoffrey Hinton.
• Often used as a benchmark.
CIFAR 100
• CIFAR - Canadian Institute For Advanced Research.
• 60000 32x32 colour images in 100 classes, with 600
images per class.
• 500 training images and 100 test images, per class.
• The 100 classes in the CIFAR-100 are grouped into
20 superclasses. Each image comes with a "fine"
label (the class to which it belongs) and a "coarse"
label (the superclass to which it belongs).
• Example – for a superclass “flowers”, classes are –
“orchids, poppies, roses, sunflowers, tulips”.
• Often used as a benchmark.
IMPLEMENTATION
DETAILS
Model Architecture
Vision Transformer Pseudo-Code
def ViT (input):
patches = Create_Patches(input)
patch_embed = Patch_Embedding(patches)
sequence = Concat(class_token, patch_embed) + Position_embedding
hidden_states = Transformer(sequence)
class_output = Classification_Head(hidden_states[0])
return class_output
This is the pseudo code for the sequence of operations on a image to
classify it using the Vision Transformer Model
Hybrid Vision Transformer Pseudo-Code
def Hybrid-ViT (input):
resnet_features = ResNet_Feature_Extractor(input)
patches = Create_Patches(resnet_features)
patch_embed = Patch_Embedding(patches)
sequence = Concat(class_token, patch_embed) + Position_embedding
hidden_states = Transformer(sequence)
class_output = Classification_Head(hidden_states[0])
return class_output
The difference between the Hybrid and Normal Vision Transformer is
that the input features are extracted from pretrained ResNet34 till the
first Residual Layer instead of being directly fed to the Transformer
Major
Components
Implemented
• Vision Transformer implemented from
scratch which includes Multihead Attention,
FeedForward, Transformer and Classification
Head modules
• Hybrid Variant of Vision Transformer with
Pretrained ResNet features as input to the
Transformer
• Modular ResNet implemented from scratch
(ResNet34, ResNet50)
• Pretrained Vision Transformer using external
code to verify results from the paper
• Attention Map Visualization on Input images
• Visualization of Filter Embeddings
• Visualization of Position Embeddings
RESULTS
Attention Map Visualisation
Representative examples of attention from the output token to the input space
Patch Embedding
The Cosine Similarity Matrix of learnable Position
Embeddings denotes clear locality pattern as every
14th value is highly correlated as the patch size
(16) divides the image(224x224) into a 14x14
matrix. The very first position embedding is
orthogonal to all others since it is reserved for the
class token.
First 28 principal components of the initial
RGB linear Embedding Filter for pre-
trained model ViT - B/16
Position Embedding
Results for Different Model Variations
Model Variation CIFAR10 Train
Accuracy (%)
CIFAR10 Test
Accuracy (%)
CIFAR100 Train
Accuracy (%)
CIFAR100 Test
Accuracy (%)
Vision Transformer
(12 Layer, Patch-Size-8)
Image Size - 32x32
64.3 57.2 62.1 37.6
Vision Transformer
(12 Layer, Patch-Size-4)
Image Size - 32x32
82.1 71.3 83.8 40.7
Vision Transformer
(8 Layer, Patch-Size-4)
Image Size - 32x32
80.2 71.9 83.5 43.8
Hybrid Vision Transformer
(12 Layer, Patch-Size-7)
Image Size - 224x224
90.9 80.0 96.3 54.6
ResNet34
(From Scratch)
Image Size - 224x224
98.4 92.2 98.6 69.4
PreTrained Vision
Transformer (12 Layer,
Patch-Size-16)
Image Size - 224x224
99.3 98.1 97.9 87.2
Inference from our Results
● Patch size in the Vision Transformer decides the length of the
sequence. Lower patch size leads to higher information exchange
during the self attention mechanism. This is verified by the better
results using lower patch-size 4 over 8 on a 32x32 image.
● Increasing the number of layers of the Vision Transformer should
ideally lead to better results but the results on the 8 Layer model
are marginally better than the 12 Layer model which can be
attributed to the small datasets used to train the models. Models
with higher complexity require more data to capture the image
features.
Inference from our Results
● As noted in the paper, Hybrid Vision Transformer performs better on
small datasets compared to ViT as the initial ResNet features are able to
capture the lower level features due to the locality property of
Convolutions which normal ViT is not able to capture with the limited
data available for training.
● ResNets trained from scratch are able to outperform both ViT and
Hybrid-ViT trained from scratch due to its inherent inductive bias of
locality and translation invariance. These biases can not learned by the
ViT on small datasets.
● PreTrained ViT performs much better than the other methods due to
being trained on huge datasets and thus having learned the better
representations than even ResNet since it can access much further
information right from the very beginning unlike CNN.
Train vs Test Accuracy Graphs (CIFAR10)
ViT
Layer - 12
Patch size - 8
Hybrid ViT
Layer - 12
Patch size - 7
ViT
Layer - 12
Patch size - 4
ResNet34
Train vs Test Accuracy Graphs (CIFAR100)
ViT
Layer - 12
Patch size - 8
Hybrid ViT
Layer - 12
Patch size - 7
ViT
Layer - 12
Patch size - 4
ResNet34
CHALLENGES FACED
CHALLENGES FACED
• We quote the research paper : “When trained on mid-sized datasets such as
ImageNet, such models yield modest accuracies of a few percentage points below
ResNets of comparable size. However, the picture changes if the models are
trained on larger datasets”. We had the resources to compute only on CIFAR
10/100 !!!
• Due to non-availability of powerful compute on Google Colab, the model could
not be trained on large datasets which is the first and the foremost requirement
of this architecture to produce very high accuracies. Due to this limitation, we
could not produce accuracies from scratch as mentioned in the paper.
• Hyperparameter tuning was one of the most complicated tasks to get good
results, otherwise the model did not converge and kept oscillating between a
range of accuracies and eventually diverged on the more complex CIFAR100
dataset
CHALLENGES FACED
• Some of the important datasets on which the authors had trained their model,
were not publicly available due to which we had to restrict ourselves to small,
public datasets that could be used for easy training on Google Colab. These
private datasets curated by Google Research have not been released yet.
• The model is not able to learn inherent structures of images such as locality and
translation invariance if gradients changed at higher rates (lower batch sizes or
higher learning rate).
FUTURE WORK
FUTURE SCOPE
• Due to non-availability of better computing resources, the model
could not be trained on large datasets which is the first and the
foremost requirement of this architecture to produce very high
accuracies. Due to this limitation, we could not produce accuracies as
mentioned in the paper in implementation from scratch.
• Exploring Self-supervision : Due to time constraints, we could not
explore masked patch prediction for self-supervision, mimicking the
masked language modeling task used in BERT.
• Evaluating the model on VTAB classification suite.
• Different Attention mechanisms could be explored that take the 2D
structure of images into account.
LEARNING
OUTCOME
LEARNING OUTCOME
• This project gave us the opportunity to get a basic understanding of how to implement a
research paper from scratch.
• It was a necessary and a complete evaluation component to justify the learning outcome
of this course.
• Until this project, we believed that CNNs are the state-of-the-art for Computer Vision
Tasks, especially Image Recognition. But after reading this very recent literature, we
came to know that even Transformers can be used for the same with much better
results. Transformer is more general architecture without any inductive bias and given
enough data it can even perform better than CNNS.
• We got an opportunity to apply the knowledge of basics of different architectures taught
in class, and experience how tasks like hyperparameter tuning can be so complicated as
compared to theory.
• Lastly, this project came with some disguised benefits – team work, developing the habit
of writing documented code, collaboration, and good presentation.
Presentation vision transformersppt.pptx

More Related Content

Similar to Presentation vision transformersppt.pptx

DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph KernelsDDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph Kernelsivaderivader
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467IJRAT
 
3_Transfer_Learning.pdf
3_Transfer_Learning.pdf3_Transfer_Learning.pdf
3_Transfer_Learning.pdfFEG
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkRichard Kuo
 
Safety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdfSafety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdfPolytechnique Montréal
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELJoel Falcou
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Universitat Politècnica de Catalunya
 
Programming in python
Programming in pythonProgramming in python
Programming in pythonIvan Rojas
 
Standardising the compressed representation of neural networks
Standardising the compressed representation of neural networksStandardising the compressed representation of neural networks
Standardising the compressed representation of neural networksFörderverein Technische Fakultät
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017Manish Pandey
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...inside-BigData.com
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
 
Bt9301 computer graphics (1)
Bt9301   computer graphics (1)Bt9301   computer graphics (1)
Bt9301 computer graphics (1)smumbahelp
 
AIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdfAIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdfssuserb4d806
 
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...Hari M
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSitakanta Mishra
 
Computer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathonComputer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathonAditya Bhattacharya
 

Similar to Presentation vision transformersppt.pptx (20)

DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph KernelsDDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467
 
3_Transfer_Learning.pdf
3_Transfer_Learning.pdf3_Transfer_Learning.pdf
3_Transfer_Learning.pdf
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
 
06_features_slides.pdf
06_features_slides.pdf06_features_slides.pdf
06_features_slides.pdf
 
Safety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdfSafety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdf
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSEL
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
Programming in python
Programming in pythonProgramming in python
Programming in python
 
Standardising the compressed representation of neural networks
Standardising the compressed representation of neural networksStandardising the compressed representation of neural networks
Standardising the compressed representation of neural networks
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
Bt9301 computer graphics (1)
Bt9301   computer graphics (1)Bt9301   computer graphics (1)
Bt9301 computer graphics (1)
 
AIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdfAIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdf
 
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
 
Computer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathonComputer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathon
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Presentation vision transformersppt.pptx

  • 1. Vishal Mittal 2017A7PS0080P Akshit Khanna 2017A7PS0023P Raghav Bansal 2017A3PS0196P
  • 2. PAPER TITLE : An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale AUTHORS’ AFFILIATION : Google Research
  • 4. AIM • Explore Transformer-based architectures for Computer Vision Tasks. • Transformers have been the de-facto for NLP tasks, and CNN/Resnet-like architectures have been the state of the art for Computer Vision. • Till date, researchers have tried using attention for Vision, but used them in conjunction with CNN. • This paper mainly discusses the strength and versatility of vision transformers, as it kind of approves that they can be used in recognition and can even beat the state-of-the-art CNN.
  • 5. METHODOLOGY (TRAINING) Large Dataset of Images 1 Image (H x W x C) Divide Image according to patch size (P x P) Flatten into a 2-D sequence (N x P2.C) Append a learnable class token (N+1) x P2.C Add Positional Embeddings (N+1) x P2.C Linear Projection to fit Transformer Width (N + 1) x D Transformer Encoder MLP Head CLASS Continue
  • 6. Transformer Encoder Embedded Patches Norm Multi – Head Attention Norm MLP L x Testing •The authors have tested different variants of Vision Transformer having different patch size, number of layers, and embedding dimension, on datasets of different sizes – ImageNet, JFT300M, CIFAR10/100 etc. •The results of Vision Transformer have been compared with results of other architectures as well – BiT (Resnet 152x4), and EfficientNet, on same conditions. •The models have also been evaluated on VTAB classification suite consisting of 19 tasks divided into groups as Natural, Specialized and Structured Tasks. •They have also performed a preliminary exploration on masked patch prediction for self-supervision.
  • 7. FINAL OUTCOME The authors have experimented transformer- based architecture under certain conditions. • As the amount of training data - accuracy of ViT compared to ResNet (BiT). They say that a basic transformer encoder has no inductive bias unlike CNN – translation equivariance, locality. For larger datasets, learning the relevant patterns is sufficient. • Among different variants of ViT, larger variants perform better when amount of training data • Scaling the depth of the ViT architecture & patch size - accuracy • Scaling the width of the ViT architecture – minimal effect. • Vision Transformer dominate ResNets on the performance/compute trade-off. • Hybrid architecture slightly outperform ViT at small computational budgets, but the difference vanishes for larger ones • Performance does not seem yet to be saturating with the increased model size. • In VTAB evaluation, ViT dominates ResNet (BiT) in the Structured Task group.
  • 9. Why do we need attention mechanism? In the seq2seq models, popular for translation, image captioning, the context vector turned out to be a bottleneck for these types of models. It made it challenging for the models to deal with long sentences. So, “Attention” was proposed which highly improved the quality of machine translation systems. Attention allows the model to focus on the relevant parts of the input sequence as needed.
  • 10. Attention Mechanism • All the vectors h1,h2.., etc., used are basically the concatenation of forward and backward hidden states in the encoder. • The attention model computes a set of attention weights denoted by α(t,1),…,α(t,t) because not all the inputs would be used in generating the corresponding output. • The context vector ci for the output word yi is generated using the weighted sum of the annotations: • The attention weights are calculated by normalizing the output score of a feed- forward neural network described by the function a that captures the alignment between input at j and output at i.
  • 11. Transformer – Encoder • Like LSTM, Transformer is an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder), but it differs from the existing seq2seq models because it does not imply any Recurrent Networks (GRU, LSTM, etc.). • 3 main parts : Position Embeddings + Multi Head Attention + Feed- forward Layers. • Q (query) matrix : vector representation of one word in the sequence • K (keys) matrix : vector representations of all the words in the sequence • V (values) matrix : vector representations of all the words in the sequence. For the encoder, V consists of the same word sequence than Q. • Attention weights = softmax(QKT /dk ) • These weights are defined by how each word of the sequence (represented by Q) is influenced by all the other words in the sequence (represented by K). • Those weights are then applied to all the words in the sequence that are introduced in V.
  • 12. Multi-Head Attention Input Sequence Embed each word and add positional embedding Split into number of heads and heads & also create learnable matrices Multiply with learnable matrices to get Q, K, V Calculate output of each head: softmax(QKT /dk )V Concatenate the resulting matrices to produce O/P Multihead attention allows the model to jointly attend to information from different representation subspaces and hence expands the model’s ability to focus on different positions
  • 13. DATASETS • Due to non – availability of powerful compute on Google Colab, we chose to train and test on these 2 datasets – • CIFAR 10 • CIFAR 100
  • 14. CIFAR 10 • CIFAR - Canadian Institute For Advanced Research. • 60000 32x32 colour images in 10 classes, with 6000 images per class. • 50000 training images and 10000 test images. • classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. • CIFAR-10 is a labeled subset of the 80 million tiny images dataset collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. • Often used as a benchmark.
  • 15. CIFAR 100 • CIFAR - Canadian Institute For Advanced Research. • 60000 32x32 colour images in 100 classes, with 600 images per class. • 500 training images and 100 test images, per class. • The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). • Example – for a superclass “flowers”, classes are – “orchids, poppies, roses, sunflowers, tulips”. • Often used as a benchmark.
  • 18. Vision Transformer Pseudo-Code def ViT (input): patches = Create_Patches(input) patch_embed = Patch_Embedding(patches) sequence = Concat(class_token, patch_embed) + Position_embedding hidden_states = Transformer(sequence) class_output = Classification_Head(hidden_states[0]) return class_output This is the pseudo code for the sequence of operations on a image to classify it using the Vision Transformer Model
  • 19. Hybrid Vision Transformer Pseudo-Code def Hybrid-ViT (input): resnet_features = ResNet_Feature_Extractor(input) patches = Create_Patches(resnet_features) patch_embed = Patch_Embedding(patches) sequence = Concat(class_token, patch_embed) + Position_embedding hidden_states = Transformer(sequence) class_output = Classification_Head(hidden_states[0]) return class_output The difference between the Hybrid and Normal Vision Transformer is that the input features are extracted from pretrained ResNet34 till the first Residual Layer instead of being directly fed to the Transformer
  • 20. Major Components Implemented • Vision Transformer implemented from scratch which includes Multihead Attention, FeedForward, Transformer and Classification Head modules • Hybrid Variant of Vision Transformer with Pretrained ResNet features as input to the Transformer • Modular ResNet implemented from scratch (ResNet34, ResNet50) • Pretrained Vision Transformer using external code to verify results from the paper • Attention Map Visualization on Input images • Visualization of Filter Embeddings • Visualization of Position Embeddings
  • 22. Attention Map Visualisation Representative examples of attention from the output token to the input space
  • 23. Patch Embedding The Cosine Similarity Matrix of learnable Position Embeddings denotes clear locality pattern as every 14th value is highly correlated as the patch size (16) divides the image(224x224) into a 14x14 matrix. The very first position embedding is orthogonal to all others since it is reserved for the class token. First 28 principal components of the initial RGB linear Embedding Filter for pre- trained model ViT - B/16 Position Embedding
  • 24. Results for Different Model Variations Model Variation CIFAR10 Train Accuracy (%) CIFAR10 Test Accuracy (%) CIFAR100 Train Accuracy (%) CIFAR100 Test Accuracy (%) Vision Transformer (12 Layer, Patch-Size-8) Image Size - 32x32 64.3 57.2 62.1 37.6 Vision Transformer (12 Layer, Patch-Size-4) Image Size - 32x32 82.1 71.3 83.8 40.7 Vision Transformer (8 Layer, Patch-Size-4) Image Size - 32x32 80.2 71.9 83.5 43.8 Hybrid Vision Transformer (12 Layer, Patch-Size-7) Image Size - 224x224 90.9 80.0 96.3 54.6 ResNet34 (From Scratch) Image Size - 224x224 98.4 92.2 98.6 69.4 PreTrained Vision Transformer (12 Layer, Patch-Size-16) Image Size - 224x224 99.3 98.1 97.9 87.2
  • 25. Inference from our Results ● Patch size in the Vision Transformer decides the length of the sequence. Lower patch size leads to higher information exchange during the self attention mechanism. This is verified by the better results using lower patch-size 4 over 8 on a 32x32 image. ● Increasing the number of layers of the Vision Transformer should ideally lead to better results but the results on the 8 Layer model are marginally better than the 12 Layer model which can be attributed to the small datasets used to train the models. Models with higher complexity require more data to capture the image features.
  • 26. Inference from our Results ● As noted in the paper, Hybrid Vision Transformer performs better on small datasets compared to ViT as the initial ResNet features are able to capture the lower level features due to the locality property of Convolutions which normal ViT is not able to capture with the limited data available for training. ● ResNets trained from scratch are able to outperform both ViT and Hybrid-ViT trained from scratch due to its inherent inductive bias of locality and translation invariance. These biases can not learned by the ViT on small datasets. ● PreTrained ViT performs much better than the other methods due to being trained on huge datasets and thus having learned the better representations than even ResNet since it can access much further information right from the very beginning unlike CNN.
  • 27. Train vs Test Accuracy Graphs (CIFAR10) ViT Layer - 12 Patch size - 8 Hybrid ViT Layer - 12 Patch size - 7 ViT Layer - 12 Patch size - 4 ResNet34
  • 28. Train vs Test Accuracy Graphs (CIFAR100) ViT Layer - 12 Patch size - 8 Hybrid ViT Layer - 12 Patch size - 7 ViT Layer - 12 Patch size - 4 ResNet34
  • 30. CHALLENGES FACED • We quote the research paper : “When trained on mid-sized datasets such as ImageNet, such models yield modest accuracies of a few percentage points below ResNets of comparable size. However, the picture changes if the models are trained on larger datasets”. We had the resources to compute only on CIFAR 10/100 !!! • Due to non-availability of powerful compute on Google Colab, the model could not be trained on large datasets which is the first and the foremost requirement of this architecture to produce very high accuracies. Due to this limitation, we could not produce accuracies from scratch as mentioned in the paper. • Hyperparameter tuning was one of the most complicated tasks to get good results, otherwise the model did not converge and kept oscillating between a range of accuracies and eventually diverged on the more complex CIFAR100 dataset
  • 31. CHALLENGES FACED • Some of the important datasets on which the authors had trained their model, were not publicly available due to which we had to restrict ourselves to small, public datasets that could be used for easy training on Google Colab. These private datasets curated by Google Research have not been released yet. • The model is not able to learn inherent structures of images such as locality and translation invariance if gradients changed at higher rates (lower batch sizes or higher learning rate).
  • 33. FUTURE SCOPE • Due to non-availability of better computing resources, the model could not be trained on large datasets which is the first and the foremost requirement of this architecture to produce very high accuracies. Due to this limitation, we could not produce accuracies as mentioned in the paper in implementation from scratch. • Exploring Self-supervision : Due to time constraints, we could not explore masked patch prediction for self-supervision, mimicking the masked language modeling task used in BERT. • Evaluating the model on VTAB classification suite. • Different Attention mechanisms could be explored that take the 2D structure of images into account.
  • 35. LEARNING OUTCOME • This project gave us the opportunity to get a basic understanding of how to implement a research paper from scratch. • It was a necessary and a complete evaluation component to justify the learning outcome of this course. • Until this project, we believed that CNNs are the state-of-the-art for Computer Vision Tasks, especially Image Recognition. But after reading this very recent literature, we came to know that even Transformers can be used for the same with much better results. Transformer is more general architecture without any inductive bias and given enough data it can even perform better than CNNS. • We got an opportunity to apply the knowledge of basics of different architectures taught in class, and experience how tasks like hyperparameter tuning can be so complicated as compared to theory. • Lastly, this project came with some disguised benefits – team work, developing the habit of writing documented code, collaboration, and good presentation.