SlideShare a Scribd company logo
How Transformers Are
Changing the Nature of
Deep Learning Models
Tom Michiels
Principal System Architect
Synopsys
• The surprising rise of transformers in vision
• The structure of attention and transformer
• Transformers applied to vision
• Why transformers are here to stay for vision
Outline
© 2023 Synopsys 2
CNNs Dominating Vision Tasks Since 2012
2012 ... 2014 2015 2016 2017 2018 2019 2020 2021 2022
AlexNet VGG ResNet MobileNet V1 MobileNet V2
EfficientNet
MobileNet v3
Image Classification
© 2023 Synopsys 3
2012 ... 2014
AlexNet VGG ResNet MobileNet V1 MobileNet V2
EfficientNet
MobileNet v3
RCNN
FRCNN
SSD
YOLOV2 Mask RCNN YoloV3
YoloV4/V5
EfficientDet
Object Detection
© 2023 Synopsys 4
2015 2016 2017 2018 2019 2020 2021 2022
CNNs Dominating Vision Tasks Since 2012
AlexNet VGG ResNet MobileNet V1 MobileNet V2
EfficientNet
MobileNet v3
RCNN
FRCNN
SSD
YOLOV2 Mask RCNN YoloV3
YoloV4/V5
EfficientDet
FCN
DeepLab
SegNet DSNet
Semantic Segmentation
© 2023 Synopsys 5
2012 ... 2014 2015 2016 2017 2018 2019 2020 2021 2022
CNNs Dominating Vision Tasks Since 2012
2012 ... 2014
AlexNet VGG ResNet MobileNet V1 MobileNet V2
EfficientNet
MobileNet v3
RCNN
FRCNN
SSD
YOLOV2 Mask RCNN YoloV3
YoloV4/V5
EfficientDet
FCN
DeepLab
SegNet DSNet
DeeperLab
PFPN
EfficientPS
YoloP
Panoptic Vision
© 2023 Synopsys 6
2015 2016 2017 2018 2019 2020 2021 2022
CNNs Dominating Vision Tasks Since 2012
A Decade of CNN Development…
Residual
Connection
© 2023 Synopsys 7
Inverted Residual
Blocks
Beaten in Accuracy by Transformers
© 2023 Synopsys 8
Is this the real life ? Is this just fantasy ?
Transformer, a model designed for natural
language processing
… without any modifications applied to image patches,
beats the highly specialized CNNs in accuracy
Accuracy Records on ImageNet
9
© 2023 Synopsys
45%
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
11/2010 4/2012 8/2013 12/2014 5/2016 9/2017 2/2019 6/2020 10/2021 3/2023
Ancient
(classical CV)
Medieval
(CNNs)
Modern
(Transformers)
Sift+FV’s
AlexNet
VGG16
ResNet152
NasNet
Meta Pseudo Labels
(EfficientNet-L2)
ResNeXt-101
ViT-G/14 CoAtNet-7
CoCa (finetuned)
Model soups
State-of-the-art Top-1 Accuracy ImageNet, entering a new era?
50%
63.3%
74.4%
90.20%
90.88%
90.45%
90.98%
91.00%
78.57%
Transformers in Other Vision Tasks
10
© 2023 Synopsys
Semantic Segmentation on ADE20K Object Detection on COCO test-dev
0
10
20
30
40
50
60
70
11/2015 11/2017 11/2019 11/2021
0
10
20
30
40
50
60
70
04/2015 04/2017 04/2019 04/2021
Faster R-CNN
36.2
State-of-the-Art of other Vision tasks are dominated by transformers
58.7
SWIN-L
YOLOv4-P7
55.8
PSPNet
44.95
DRAN(ResNet-101)
46.1
SwinV2-G
59.9
But These State-of-the-Art Models Are Huge!
0
500
1000
1500
2000
2500
78.00%
80.00%
82.00%
84.00%
86.00%
88.00%
90.00%
92.00%
Is the state-of-the-art really relevant for embedded applications?
accuracy
© 2023 Synopsys 11
Number
of
parameters
(M)
Compact Transformers versus CNNs
12
© 2023 Synopsys
source MobileViTv3 https://arxiv.org/abs/2209.15159
Number of FLOPs
Mobile ViT: Small Mobile
(Paper by Apple, March 2022)
https://arxiv.org/pdf/2110.02178.pdf
2.3X
FLOPs
+1.5%
Accuracy
2.4X
Time
7.9X
Time
Inference Time (ms)
CPU/NNE = 8.1X
CPU/NNE = 2.5X
• Observations in paper
– On embedded devices (iPhone) MobileViT is slower than CNN based methods
– Because the AI accelerator on iPhone is not as optimized for transformers as it is for CNNs
– The authors expect that future AI accelerators will better support transformers
0.7X
Model
Size
© 2023 Synopsys 13
The Structure of Attention
and Transformer
© 2023 Synopsys 14
• Attention is all you need!(*)
• Bidirectional Encoder Representations from
Transformers
• A transformer is a deep learning model that uses
attention mechanism
• Transformers were primarily used for natural
language processing
• Translation
• Question answering
• Conversational AI
• Successful training of huge transformers
• MTM, GPT-3, T5, ALBERT, RoBERTa, T5, Switch
• Transformers are successfully applied in other
application domains with promising results for
embedded use
Bert and Transformers
(*) https://arxiv.org/abs/1706.03762
© 2023 Synopsys 15
Convolutions, Feed Forward, and
Multi-Head Attention
16
© 2023 Synopsys
1x1
Convolution
Add
3x3
Convolution
Add
CNN
Transformer
• The feed forward layer of the transformer
is identical to a 1x1 convolution
• In this part of the model, no information
is flowing between tokens/pixels
• Multi-head attention and 3x3 convolution
layers are the layers responsible for
mixing information between tokens/pixels
Convolutions as Hard-Coded Attention
Both Convolution and Attention Networks mix in features of other tokens/pixels
Convolution Attention
Convolutions mix in features from tokens based on fixed spatial location
Attention mix in features from tokens based on learned attention
© 2023 Synopsys 17
The Structure of a Transformer: Attention
Multi-Head Attention
Attention: Mix in features of other tokens
Is this the real life ?
Is
this
the
real
life
?
© 2023 Synopsys 18
The Structure of a Transformer: Attention
Multi-Head Attention
Attention: Mix in features of other tokens
© 2023 Synopsys 19
The Structure of a Transformer: Attention
NxN
matrix
N: number of tokens
Multi-Head Attention
© 2023 Synopsys 20
+
The Structure of a Transformer: Embedding
Embedding of input tokens and the positional encoding
Is this the real life ? Is this just fantasy ?
embedding vectors
position encoding
elementwise addition
© 2023 Synopsys 21
Applying Transformers to Vision Tasks
© 2023 Synopsys 22
Vision Transformers (ViT/L16 or ViT-G/14)
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale(*)
Image is split into tiles
(*) https://arxiv.org/abs/2010.11929
Linear Projection
Multi-Head Attention
Feed Forward
Add & Norm
Add & Norm
+
Pixels in a tile are
flattened into tokens (vectors)
that feed in the transformer
N
x
Vision transformers are best-
known method for image
classification
They are beating convolutional
neural networks in accuracy
and training time, but not in
inference time
Position encoding
© 2023 Synopsys 23
Vision Transformer  Increasing Resolution
Linear Projection
Multi-Head Attention
Feed Forward
Add & Norm
Add & Norm
+
N
x
Position encoding
Attention matrix scales quadratically
with the number of patches
N x N matrix
Where N = the number
of tokens/patches
© 2023 Synopsys 24
Swin Transformers
Hierarchical Vision Transformer Using Shifted Windows (*)
Adaptation makes transformers
scale for larger images:
1. Shifted window attention
2. Patch-merging
State of the art for
• Object detection (COCO)
• Semantic segmentation
(ADE20K)
(*) https://arxiv.org/abs/2103.14030
© 2023 Synopsys 25
Patch-Merging
Action Classification with Transformers
Video Swin Transformer
https://arxiv.org/abs/2106.13230
Video Swin Transformers extend the (shifted)
window to three dimensions (2D spatial + time)
Today’s state of the art on Kinetics-400 and
Kinetics-600
© 2023 Synopsys 26
Object Detection with Transformers
27
© 2023 Synopsys
End-to-End Object Detection with Transformers (Facebook 2020)
https://arxiv.org/abs/2005.12872
DETR uses a CNN (ResNet-50) as a backbone
Off-the-shelf transformer encoder and decoder
Trained Object Queries retrieve possible candidates for objects
Training Vision Transformers
28
© 2023 Synopsys
• More data required to train a transformer to overcome the lack
of inductive bias of convolution
• Vision Transformers take significantly less training time than
comparable CNN’s
• Self-supervised Pre-Training for Vision Transformers
“cat”
Supervised learning Self-Supervised learning
Why Attention and Transformers
are Here to Stay for Vision
Inductive Bias of CNNs
Recognizing Cat Fur Recognizing a whole Cat
The inductive bias of a convolution, is more helpful
for recognizing low level features like cat fur than for
recognizing more complex objects like cats
Convolution use the same weights
for processing every pixel
© 2023 Synopsys 30
Inductive Bias of CNNs and Sensor Fusion
31
© 2023 Synopsys
Camera front Left Camera Front Camera Front Right Lidar TOP
• Many real-life applications use multiple sensors
• Transformers excel in sensor fusion
• The inductive bias of CNNs designed for grid-like
structures limits their effectiveness in sensor fusion
Fused in a single object
Panoptic Segmentation Using CNNs
Panoptic Segmentation combines predictions from both instance and semantic segmentation in a unified output
Image source: https://www.barnorama.com/wp-content/uploads/2016/12/03-Confusing-Pictures.jpg
Model Used: Detectron2 - COCO-PanopticSegmentation/panoptic_fpn_R_101_3x
© 2023 Synopsys
32
Panoptic
FPN_ResNet101_3x
Multimodal Transformers
LLaVA Language and
© 2023 Synopsys 33
User: What is unusual about this image?
LLaVA: The unusual aspect of the image is that
a man is ironing clothes on the back of a yellow
minivan while it is on the road. This is an
unconventional and unsafe place to perform
such an activity, as ironing clothes typically
requires a stable surface and appropriate
equipment. Ironing clothes in a moving vehicle
could lead to potential hazards for both the
person doing the ironing and other road users.
PN_ResNet101_3x
https://arxiv.org/pdf/2304.08485.pdf
How Much Intelligence Do We Need?
34
© 2023 Synopsys
Panoptic Segmentation Multi-modal language/vision
How much (artificial) intelligence is required for real-life applications?
User: What is unusual about this image?
LLaVA: The unusual aspect of the image is
that a man is ironing clothes on the back of
a yellow minivan while it is on the road. This
is an unconventional and unsafe place to
perform such an activity, as ironing clothes
typically requires a stable surface and
appropriate equipment. ….
• Attention-based networks outperform CNN-only networks on accuracy
• Highest accuracy required for high-end applications
• Models that combine vision transformers with convolutions are more efficient at
inference
• Examples: MobileViT(*), CoAtNet(**)
• Real-life vision: demands beyond CNN inductive bias
• Scene understanding needs common-sense knowledge that may not be
learned by vision alone
• Sensor fusion: complex geometrical mappings are ill-suited for CNN bias
Why Transformers Are Here to Stay in Vision
(*) https://arxiv.org/abs/2110.02178
(**) https://arxiv.org/abs/2106.04803v2
35
© 2023 Synopsys
• Transformers are deep learning models primarily used in the field of NLP
• Transformers lead to state-of-the-art results in other application domains of deep
learning like vision and speech
• They can be applied to other domains with surprisingly little modifications
• Models that combine attention and convolutions outperform convolutional
neural networks on vision tasks, even for small models
• Transformers and attention for vision applications are here to stay
• Real world applications require knowledge that is not easily captured with
convolutions
Summary
© 2023 Synopsys 36
Resources
Visit Synopsys Booth 309
• Partner demo: Visionary.ai True Night Vision SW ISP
• Meet with Synopsys executives and experts
• Discuss emerging neural network architectures like
transformers and vision/object detection for safety-critical
automotive SoCs
• Learn about the latest in practical technology to bring
visual intelligence into embedded systems, mobile apps,
cars, and PCs
Resources
ARC NPX6 NPU IP
www.synopsys.com/npx
© 2023 Synopsys 37

More Related Content

Similar to “How Transformers Are Changing the Nature of Deep Learning Models,” a Presentation from Synopsys

Computational steering Interactive Design-through-Analysis for Simulation Sci...
Computational steering Interactive Design-through-Analysis for Simulation Sci...Computational steering Interactive Design-through-Analysis for Simulation Sci...
Computational steering Interactive Design-through-Analysis for Simulation Sci...
SURFevents
 
Review On Different Feature Extraction Algorithms
Review On Different Feature Extraction AlgorithmsReview On Different Feature Extraction Algorithms
Review On Different Feature Extraction Algorithms
IRJET Journal
 
Garbage Classification Using Deep Learning Techniques
Garbage Classification Using Deep Learning TechniquesGarbage Classification Using Deep Learning Techniques
Garbage Classification Using Deep Learning Techniques
IRJET Journal
 
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentéeInria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
Stéphanie Roger
 
MPEG-Immersive 3DoF+: 360 Video Streaming for Virtual Reality
MPEG-Immersive 3DoF+: 360 Video Streaming for Virtual RealityMPEG-Immersive 3DoF+: 360 Video Streaming for Virtual Reality
MPEG-Immersive 3DoF+: 360 Video Streaming for Virtual Reality
mcslgachon
 
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
Edge AI and Vision Alliance
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
Sangmin Woo
 
Archon VR deck
Archon VR deckArchon VR deck
Archon VR deck
Giovanni Landi
 
SkyStitch: a Cooperative Multi-UAV-based Real-time Video Surveillance System ...
SkyStitch: a Cooperative Multi-UAV-based Real-time Video Surveillance System ...SkyStitch: a Cooperative Multi-UAV-based Real-time Video Surveillance System ...
SkyStitch: a Cooperative Multi-UAV-based Real-time Video Surveillance System ...
Kitsukawa Yuki
 
Lane and Object Detection for Autonomous Vehicle using Advanced Computer Vision
Lane and Object Detection for Autonomous Vehicle using Advanced Computer VisionLane and Object Detection for Autonomous Vehicle using Advanced Computer Vision
Lane and Object Detection for Autonomous Vehicle using Advanced Computer Vision
YogeshIJTSRD
 
Video Stitching using Improved RANSAC and SIFT
Video Stitching using Improved RANSAC and SIFTVideo Stitching using Improved RANSAC and SIFT
Video Stitching using Improved RANSAC and SIFT
IRJET Journal
 
“Efficient Video Perception Through AI,” a Presentation from Qualcomm
“Efficient Video Perception Through AI,” a Presentation from Qualcomm“Efficient Video Perception Through AI,” a Presentation from Qualcomm
“Efficient Video Perception Through AI,” a Presentation from Qualcomm
Edge AI and Vision Alliance
 
EMERGENCY VEHICLE SOUND DETECTION SYSTEMS IN TRAFFIC CONGESTION
EMERGENCY VEHICLE SOUND DETECTION SYSTEMS IN TRAFFIC CONGESTIONEMERGENCY VEHICLE SOUND DETECTION SYSTEMS IN TRAFFIC CONGESTION
EMERGENCY VEHICLE SOUND DETECTION SYSTEMS IN TRAFFIC CONGESTION
IRJET Journal
 
kanimozhi2019.pdf
kanimozhi2019.pdfkanimozhi2019.pdf
kanimozhi2019.pdf
AshrafDabbas1
 
A Video Processing based System for Counting Vehicles
A Video Processing based System for Counting VehiclesA Video Processing based System for Counting Vehicles
A Video Processing based System for Counting Vehicles
IRJET Journal
 
Car Steering Angle Prediction Using Deep Learning
Car Steering Angle Prediction Using Deep LearningCar Steering Angle Prediction Using Deep Learning
Car Steering Angle Prediction Using Deep Learning
IRJET Journal
 
Dataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsDataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problems
PetteriTeikariPhD
 
Human Action Recognition in Videos
Human Action Recognition in VideosHuman Action Recognition in Videos
Human Action Recognition in Videos
IRJET Journal
 

Similar to “How Transformers Are Changing the Nature of Deep Learning Models,” a Presentation from Synopsys (20)

Computational steering Interactive Design-through-Analysis for Simulation Sci...
Computational steering Interactive Design-through-Analysis for Simulation Sci...Computational steering Interactive Design-through-Analysis for Simulation Sci...
Computational steering Interactive Design-through-Analysis for Simulation Sci...
 
Review On Different Feature Extraction Algorithms
Review On Different Feature Extraction AlgorithmsReview On Different Feature Extraction Algorithms
Review On Different Feature Extraction Algorithms
 
Garbage Classification Using Deep Learning Techniques
Garbage Classification Using Deep Learning TechniquesGarbage Classification Using Deep Learning Techniques
Garbage Classification Using Deep Learning Techniques
 
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentéeInria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
 
MPEG-Immersive 3DoF+: 360 Video Streaming for Virtual Reality
MPEG-Immersive 3DoF+: 360 Video Streaming for Virtual RealityMPEG-Immersive 3DoF+: 360 Video Streaming for Virtual Reality
MPEG-Immersive 3DoF+: 360 Video Streaming for Virtual Reality
 
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
 
Portfolio
PortfolioPortfolio
Portfolio
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
Archon VR deck
Archon VR deckArchon VR deck
Archon VR deck
 
SkyStitch: a Cooperative Multi-UAV-based Real-time Video Surveillance System ...
SkyStitch: a Cooperative Multi-UAV-based Real-time Video Surveillance System ...SkyStitch: a Cooperative Multi-UAV-based Real-time Video Surveillance System ...
SkyStitch: a Cooperative Multi-UAV-based Real-time Video Surveillance System ...
 
Lane and Object Detection for Autonomous Vehicle using Advanced Computer Vision
Lane and Object Detection for Autonomous Vehicle using Advanced Computer VisionLane and Object Detection for Autonomous Vehicle using Advanced Computer Vision
Lane and Object Detection for Autonomous Vehicle using Advanced Computer Vision
 
Video Stitching using Improved RANSAC and SIFT
Video Stitching using Improved RANSAC and SIFTVideo Stitching using Improved RANSAC and SIFT
Video Stitching using Improved RANSAC and SIFT
 
“Efficient Video Perception Through AI,” a Presentation from Qualcomm
“Efficient Video Perception Through AI,” a Presentation from Qualcomm“Efficient Video Perception Through AI,” a Presentation from Qualcomm
“Efficient Video Perception Through AI,” a Presentation from Qualcomm
 
EMERGENCY VEHICLE SOUND DETECTION SYSTEMS IN TRAFFIC CONGESTION
EMERGENCY VEHICLE SOUND DETECTION SYSTEMS IN TRAFFIC CONGESTIONEMERGENCY VEHICLE SOUND DETECTION SYSTEMS IN TRAFFIC CONGESTION
EMERGENCY VEHICLE SOUND DETECTION SYSTEMS IN TRAFFIC CONGESTION
 
kanimozhi2019.pdf
kanimozhi2019.pdfkanimozhi2019.pdf
kanimozhi2019.pdf
 
A Video Processing based System for Counting Vehicles
A Video Processing based System for Counting VehiclesA Video Processing based System for Counting Vehicles
A Video Processing based System for Counting Vehicles
 
Car Steering Angle Prediction Using Deep Learning
Car Steering Angle Prediction Using Deep LearningCar Steering Angle Prediction Using Deep Learning
Car Steering Angle Prediction Using Deep Learning
 
Task 2
Task 2Task 2
Task 2
 
Dataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsDataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problems
 
Human Action Recognition in Videos
Human Action Recognition in VideosHuman Action Recognition in Videos
Human Action Recognition in Videos
 

More from Edge AI and Vision Alliance

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
"OpenCV for High-performance, Low-power Vision Applications on Snapdragon," a...
"OpenCV for High-performance, Low-power Vision Applications on Snapdragon," a..."OpenCV for High-performance, Low-power Vision Applications on Snapdragon," a...
"OpenCV for High-performance, Low-power Vision Applications on Snapdragon," a...
Edge AI and Vision Alliance
 
“Deploying Large Models on the Edge: Success Stories and Challenges,” a Prese...
“Deploying Large Models on the Edge: Success Stories and Challenges,” a Prese...“Deploying Large Models on the Edge: Success Stories and Challenges,” a Prese...
“Deploying Large Models on the Edge: Success Stories and Challenges,” a Prese...
Edge AI and Vision Alliance
 
“Scaling Vision-based Edge AI Solutions: From Prototype to Global Deployment,...
“Scaling Vision-based Edge AI Solutions: From Prototype to Global Deployment,...“Scaling Vision-based Edge AI Solutions: From Prototype to Global Deployment,...
“Scaling Vision-based Edge AI Solutions: From Prototype to Global Deployment,...
Edge AI and Vision Alliance
 
“What’s Next in On-device Generative AI,” a Presentation from Qualcomm
“What’s Next in On-device Generative AI,” a Presentation from Qualcomm“What’s Next in On-device Generative AI,” a Presentation from Qualcomm
“What’s Next in On-device Generative AI,” a Presentation from Qualcomm
Edge AI and Vision Alliance
 
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
Edge AI and Vision Alliance
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
Edge AI and Vision Alliance
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
Edge AI and Vision Alliance
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
Edge AI and Vision Alliance
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
Edge AI and Vision Alliance
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
Edge AI and Vision Alliance
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
Edge AI and Vision Alliance
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
Edge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
Edge AI and Vision Alliance
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
Edge AI and Vision Alliance
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
Edge AI and Vision Alliance
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
Edge AI and Vision Alliance
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
Edge AI and Vision Alliance
 

More from Edge AI and Vision Alliance (20)

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
"OpenCV for High-performance, Low-power Vision Applications on Snapdragon," a...
"OpenCV for High-performance, Low-power Vision Applications on Snapdragon," a..."OpenCV for High-performance, Low-power Vision Applications on Snapdragon," a...
"OpenCV for High-performance, Low-power Vision Applications on Snapdragon," a...
 
“Deploying Large Models on the Edge: Success Stories and Challenges,” a Prese...
“Deploying Large Models on the Edge: Success Stories and Challenges,” a Prese...“Deploying Large Models on the Edge: Success Stories and Challenges,” a Prese...
“Deploying Large Models on the Edge: Success Stories and Challenges,” a Prese...
 
“Scaling Vision-based Edge AI Solutions: From Prototype to Global Deployment,...
“Scaling Vision-based Edge AI Solutions: From Prototype to Global Deployment,...“Scaling Vision-based Edge AI Solutions: From Prototype to Global Deployment,...
“Scaling Vision-based Edge AI Solutions: From Prototype to Global Deployment,...
 
“What’s Next in On-device Generative AI,” a Presentation from Qualcomm
“What’s Next in On-device Generative AI,” a Presentation from Qualcomm“What’s Next in On-device Generative AI,” a Presentation from Qualcomm
“What’s Next in On-device Generative AI,” a Presentation from Qualcomm
 
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 

“How Transformers Are Changing the Nature of Deep Learning Models,” a Presentation from Synopsys

  • 1. How Transformers Are Changing the Nature of Deep Learning Models Tom Michiels Principal System Architect Synopsys
  • 2. • The surprising rise of transformers in vision • The structure of attention and transformer • Transformers applied to vision • Why transformers are here to stay for vision Outline © 2023 Synopsys 2
  • 3. CNNs Dominating Vision Tasks Since 2012 2012 ... 2014 2015 2016 2017 2018 2019 2020 2021 2022 AlexNet VGG ResNet MobileNet V1 MobileNet V2 EfficientNet MobileNet v3 Image Classification © 2023 Synopsys 3
  • 4. 2012 ... 2014 AlexNet VGG ResNet MobileNet V1 MobileNet V2 EfficientNet MobileNet v3 RCNN FRCNN SSD YOLOV2 Mask RCNN YoloV3 YoloV4/V5 EfficientDet Object Detection © 2023 Synopsys 4 2015 2016 2017 2018 2019 2020 2021 2022 CNNs Dominating Vision Tasks Since 2012
  • 5. AlexNet VGG ResNet MobileNet V1 MobileNet V2 EfficientNet MobileNet v3 RCNN FRCNN SSD YOLOV2 Mask RCNN YoloV3 YoloV4/V5 EfficientDet FCN DeepLab SegNet DSNet Semantic Segmentation © 2023 Synopsys 5 2012 ... 2014 2015 2016 2017 2018 2019 2020 2021 2022 CNNs Dominating Vision Tasks Since 2012
  • 6. 2012 ... 2014 AlexNet VGG ResNet MobileNet V1 MobileNet V2 EfficientNet MobileNet v3 RCNN FRCNN SSD YOLOV2 Mask RCNN YoloV3 YoloV4/V5 EfficientDet FCN DeepLab SegNet DSNet DeeperLab PFPN EfficientPS YoloP Panoptic Vision © 2023 Synopsys 6 2015 2016 2017 2018 2019 2020 2021 2022 CNNs Dominating Vision Tasks Since 2012
  • 7. A Decade of CNN Development… Residual Connection © 2023 Synopsys 7 Inverted Residual Blocks
  • 8. Beaten in Accuracy by Transformers © 2023 Synopsys 8 Is this the real life ? Is this just fantasy ? Transformer, a model designed for natural language processing … without any modifications applied to image patches, beats the highly specialized CNNs in accuracy
  • 9. Accuracy Records on ImageNet 9 © 2023 Synopsys 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 11/2010 4/2012 8/2013 12/2014 5/2016 9/2017 2/2019 6/2020 10/2021 3/2023 Ancient (classical CV) Medieval (CNNs) Modern (Transformers) Sift+FV’s AlexNet VGG16 ResNet152 NasNet Meta Pseudo Labels (EfficientNet-L2) ResNeXt-101 ViT-G/14 CoAtNet-7 CoCa (finetuned) Model soups State-of-the-art Top-1 Accuracy ImageNet, entering a new era? 50% 63.3% 74.4% 90.20% 90.88% 90.45% 90.98% 91.00% 78.57%
  • 10. Transformers in Other Vision Tasks 10 © 2023 Synopsys Semantic Segmentation on ADE20K Object Detection on COCO test-dev 0 10 20 30 40 50 60 70 11/2015 11/2017 11/2019 11/2021 0 10 20 30 40 50 60 70 04/2015 04/2017 04/2019 04/2021 Faster R-CNN 36.2 State-of-the-Art of other Vision tasks are dominated by transformers 58.7 SWIN-L YOLOv4-P7 55.8 PSPNet 44.95 DRAN(ResNet-101) 46.1 SwinV2-G 59.9
  • 11. But These State-of-the-Art Models Are Huge! 0 500 1000 1500 2000 2500 78.00% 80.00% 82.00% 84.00% 86.00% 88.00% 90.00% 92.00% Is the state-of-the-art really relevant for embedded applications? accuracy © 2023 Synopsys 11 Number of parameters (M)
  • 12. Compact Transformers versus CNNs 12 © 2023 Synopsys source MobileViTv3 https://arxiv.org/abs/2209.15159 Number of FLOPs
  • 13. Mobile ViT: Small Mobile (Paper by Apple, March 2022) https://arxiv.org/pdf/2110.02178.pdf 2.3X FLOPs +1.5% Accuracy 2.4X Time 7.9X Time Inference Time (ms) CPU/NNE = 8.1X CPU/NNE = 2.5X • Observations in paper – On embedded devices (iPhone) MobileViT is slower than CNN based methods – Because the AI accelerator on iPhone is not as optimized for transformers as it is for CNNs – The authors expect that future AI accelerators will better support transformers 0.7X Model Size © 2023 Synopsys 13
  • 14. The Structure of Attention and Transformer © 2023 Synopsys 14
  • 15. • Attention is all you need!(*) • Bidirectional Encoder Representations from Transformers • A transformer is a deep learning model that uses attention mechanism • Transformers were primarily used for natural language processing • Translation • Question answering • Conversational AI • Successful training of huge transformers • MTM, GPT-3, T5, ALBERT, RoBERTa, T5, Switch • Transformers are successfully applied in other application domains with promising results for embedded use Bert and Transformers (*) https://arxiv.org/abs/1706.03762 © 2023 Synopsys 15
  • 16. Convolutions, Feed Forward, and Multi-Head Attention 16 © 2023 Synopsys 1x1 Convolution Add 3x3 Convolution Add CNN Transformer • The feed forward layer of the transformer is identical to a 1x1 convolution • In this part of the model, no information is flowing between tokens/pixels • Multi-head attention and 3x3 convolution layers are the layers responsible for mixing information between tokens/pixels
  • 17. Convolutions as Hard-Coded Attention Both Convolution and Attention Networks mix in features of other tokens/pixels Convolution Attention Convolutions mix in features from tokens based on fixed spatial location Attention mix in features from tokens based on learned attention © 2023 Synopsys 17
  • 18. The Structure of a Transformer: Attention Multi-Head Attention Attention: Mix in features of other tokens Is this the real life ? Is this the real life ? © 2023 Synopsys 18
  • 19. The Structure of a Transformer: Attention Multi-Head Attention Attention: Mix in features of other tokens © 2023 Synopsys 19
  • 20. The Structure of a Transformer: Attention NxN matrix N: number of tokens Multi-Head Attention © 2023 Synopsys 20
  • 21. + The Structure of a Transformer: Embedding Embedding of input tokens and the positional encoding Is this the real life ? Is this just fantasy ? embedding vectors position encoding elementwise addition © 2023 Synopsys 21
  • 22. Applying Transformers to Vision Tasks © 2023 Synopsys 22
  • 23. Vision Transformers (ViT/L16 or ViT-G/14) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale(*) Image is split into tiles (*) https://arxiv.org/abs/2010.11929 Linear Projection Multi-Head Attention Feed Forward Add & Norm Add & Norm + Pixels in a tile are flattened into tokens (vectors) that feed in the transformer N x Vision transformers are best- known method for image classification They are beating convolutional neural networks in accuracy and training time, but not in inference time Position encoding © 2023 Synopsys 23
  • 24. Vision Transformer  Increasing Resolution Linear Projection Multi-Head Attention Feed Forward Add & Norm Add & Norm + N x Position encoding Attention matrix scales quadratically with the number of patches N x N matrix Where N = the number of tokens/patches © 2023 Synopsys 24
  • 25. Swin Transformers Hierarchical Vision Transformer Using Shifted Windows (*) Adaptation makes transformers scale for larger images: 1. Shifted window attention 2. Patch-merging State of the art for • Object detection (COCO) • Semantic segmentation (ADE20K) (*) https://arxiv.org/abs/2103.14030 © 2023 Synopsys 25 Patch-Merging
  • 26. Action Classification with Transformers Video Swin Transformer https://arxiv.org/abs/2106.13230 Video Swin Transformers extend the (shifted) window to three dimensions (2D spatial + time) Today’s state of the art on Kinetics-400 and Kinetics-600 © 2023 Synopsys 26
  • 27. Object Detection with Transformers 27 © 2023 Synopsys End-to-End Object Detection with Transformers (Facebook 2020) https://arxiv.org/abs/2005.12872 DETR uses a CNN (ResNet-50) as a backbone Off-the-shelf transformer encoder and decoder Trained Object Queries retrieve possible candidates for objects
  • 28. Training Vision Transformers 28 © 2023 Synopsys • More data required to train a transformer to overcome the lack of inductive bias of convolution • Vision Transformers take significantly less training time than comparable CNN’s • Self-supervised Pre-Training for Vision Transformers “cat” Supervised learning Self-Supervised learning
  • 29. Why Attention and Transformers are Here to Stay for Vision
  • 30. Inductive Bias of CNNs Recognizing Cat Fur Recognizing a whole Cat The inductive bias of a convolution, is more helpful for recognizing low level features like cat fur than for recognizing more complex objects like cats Convolution use the same weights for processing every pixel © 2023 Synopsys 30
  • 31. Inductive Bias of CNNs and Sensor Fusion 31 © 2023 Synopsys Camera front Left Camera Front Camera Front Right Lidar TOP • Many real-life applications use multiple sensors • Transformers excel in sensor fusion • The inductive bias of CNNs designed for grid-like structures limits their effectiveness in sensor fusion Fused in a single object
  • 32. Panoptic Segmentation Using CNNs Panoptic Segmentation combines predictions from both instance and semantic segmentation in a unified output Image source: https://www.barnorama.com/wp-content/uploads/2016/12/03-Confusing-Pictures.jpg Model Used: Detectron2 - COCO-PanopticSegmentation/panoptic_fpn_R_101_3x © 2023 Synopsys 32 Panoptic FPN_ResNet101_3x
  • 33. Multimodal Transformers LLaVA Language and © 2023 Synopsys 33 User: What is unusual about this image? LLaVA: The unusual aspect of the image is that a man is ironing clothes on the back of a yellow minivan while it is on the road. This is an unconventional and unsafe place to perform such an activity, as ironing clothes typically requires a stable surface and appropriate equipment. Ironing clothes in a moving vehicle could lead to potential hazards for both the person doing the ironing and other road users. PN_ResNet101_3x https://arxiv.org/pdf/2304.08485.pdf
  • 34. How Much Intelligence Do We Need? 34 © 2023 Synopsys Panoptic Segmentation Multi-modal language/vision How much (artificial) intelligence is required for real-life applications? User: What is unusual about this image? LLaVA: The unusual aspect of the image is that a man is ironing clothes on the back of a yellow minivan while it is on the road. This is an unconventional and unsafe place to perform such an activity, as ironing clothes typically requires a stable surface and appropriate equipment. ….
  • 35. • Attention-based networks outperform CNN-only networks on accuracy • Highest accuracy required for high-end applications • Models that combine vision transformers with convolutions are more efficient at inference • Examples: MobileViT(*), CoAtNet(**) • Real-life vision: demands beyond CNN inductive bias • Scene understanding needs common-sense knowledge that may not be learned by vision alone • Sensor fusion: complex geometrical mappings are ill-suited for CNN bias Why Transformers Are Here to Stay in Vision (*) https://arxiv.org/abs/2110.02178 (**) https://arxiv.org/abs/2106.04803v2 35 © 2023 Synopsys
  • 36. • Transformers are deep learning models primarily used in the field of NLP • Transformers lead to state-of-the-art results in other application domains of deep learning like vision and speech • They can be applied to other domains with surprisingly little modifications • Models that combine attention and convolutions outperform convolutional neural networks on vision tasks, even for small models • Transformers and attention for vision applications are here to stay • Real world applications require knowledge that is not easily captured with convolutions Summary © 2023 Synopsys 36
  • 37. Resources Visit Synopsys Booth 309 • Partner demo: Visionary.ai True Night Vision SW ISP • Meet with Synopsys executives and experts • Discuss emerging neural network architectures like transformers and vision/object detection for safety-critical automotive SoCs • Learn about the latest in practical technology to bring visual intelligence into embedded systems, mobile apps, cars, and PCs Resources ARC NPX6 NPU IP www.synopsys.com/npx © 2023 Synopsys 37