Successfully reported this slideshow.
Your SlideShare is downloading. ×

“How Transformers are Changing the Direction of Deep Learning Architectures,” a Presentation from Synopsys

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 29 Ad

“How Transformers are Changing the Direction of Deep Learning Architectures,” a Presentation from Synopsys

Download to read offline

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2022/08/how-transformers-are-changing-the-direction-of-deep-learning-architectures-a-presentation-from-synopsys/

Tom Michiels, System Architect for DesignWare ARC Processors at Synopsys, presents the “How Transformers are Changing the Direction of Deep Learning Architectures” tutorial at the May 2022 Embedded Vision Summit.

The neural network architectures used in embedded real-time applications are evolving quickly. Transformers are a leading deep learning approach for natural language processing and other time-dependent, series data applications. Now, transformer-based deep learning network architectures are also being applied to vision applications with state-of-the-art results compared to CNN-based solutions.

In this presentation, Michiels introduces transformers and contrast them with the CNNs commonly used for vision tasks today. He examines the key features of transformer model architectures and shows performance comparisons between transformers and CNNs. He concludes the presentation with insights on why Synopsys thinks transformers are an important approach for future visual perception tasks.

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2022/08/how-transformers-are-changing-the-direction-of-deep-learning-architectures-a-presentation-from-synopsys/

Tom Michiels, System Architect for DesignWare ARC Processors at Synopsys, presents the “How Transformers are Changing the Direction of Deep Learning Architectures” tutorial at the May 2022 Embedded Vision Summit.

The neural network architectures used in embedded real-time applications are evolving quickly. Transformers are a leading deep learning approach for natural language processing and other time-dependent, series data applications. Now, transformer-based deep learning network architectures are also being applied to vision applications with state-of-the-art results compared to CNN-based solutions.

In this presentation, Michiels introduces transformers and contrast them with the CNNs commonly used for vision tasks today. He examines the key features of transformer model architectures and shows performance comparisons between transformers and CNNs. He concludes the presentation with insights on why Synopsys thinks transformers are an important approach for future visual perception tasks.

Advertisement
Advertisement

More Related Content

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

Advertisement

“How Transformers are Changing the Direction of Deep Learning Architectures,” a Presentation from Synopsys

  1. 1. How Transformers are Changing the Direction of Deep Learning Architectures Tom Michiels System Architect Synopsys
  2. 2. • The Surprising Rise of Transformers in Vision • The Structure of Attention and Transformer • Transformers applied to Vision and Other Application Domains • Why Transformers are Here to Stay for Vision Outline © 2022 Synopsys 2
  3. 3. CNNs Have Dominated Many Vision Tasks Since 2012 2012 ... 2014 2015 2016 2017 2018 2019 2020 2021 2022 AlexNet VGG ResNet MobileNet V1 MobileNet V2 EfficientNet MobileNet v3 Image Classification © 2022 Synopsys 3
  4. 4. 2012 ... 2014 AlexNet VGG ResNet MobileNet V1 MobileNet V2 EfficientNet MobileNet v3 RCNN FRCNN SSD YOLOV2 Mask RCNN YoloV3 YoloV4/V5 EfficientDet Object Detection © 2022 Synopsys 4 2015 2016 2017 2018 2019 2020 2021 2022 CNNs Have Dominated Many Vision Tasks Since 2012
  5. 5. 2012 ... 2014 AlexNet VGG ResNet MobileNet V1 MobileNet V2 EfficientNet MobileNet v3 RCNN FRCNN SSD YOLOV2 Mask RCNN YoloV3 YoloV4/V5 EfficientDet FCN DeepLab SegNet DSNet Semantic Segmentation © 2022 Synopsys 5 2015 2016 2017 2018 2019 2020 2021 2022 CNNs Have Dominated Many Vision Tasks Since 2012
  6. 6. 2012 ... 2014 AlexNet VGG ResNet MobileNet V1 MobileNet V2 EfficientNet MobileNet v3 RCNN FRCNN SSD YOLOV2 Mask RCNN YoloV3 YoloV4/V5 EfficientDet FCN DeepLab SegNet DSNet DeeperLab PFPN EfficientPS YoloP Panoptic Vision © 2022 Synopsys 6 2015 2016 2017 2018 2019 2020 2021 2022 CNNs Have Dominated Many Vision Tasks Since 2012
  7. 7. A Decade of CNN Development… Residual Connection © 2022 Synopsys 7 Inverted Residual Blocks
  8. 8. Beaten in Accuracy by Transformers © 2022 Synopsys 8 Is this the real life ? Is this just fantasy ? Transformer, a model designed for natural language processing … without any modifications applied to image patches, beats the highly specialized CNNs in accuracy
  9. 9. The Structure of Attention and Transformer
  10. 10. • Attention is all you need!(*) • Bidirectional Encoder Representations from Transformers • A Transformer is a deep learning model that uses attention mechanism • Transformers were primarily used for Natural Language Processing • Translation • Question Answering • Conversational AI • Successful training of huge transformers • MTM, GPT-3, T5, ALBERT, RoBERTa, T5, Switch • Transformers are successfully applied in other application domains with promising results for embedded use Bert and Transformers (*) https://arxiv.org/abs/1706.03762 © 2022 Synopsys 10
  11. 11. Convolutions, Feed Forward, and Multi-Head Attention 11 © 2022 Synopsys 1x1 Convolution Add 3x3 Convolution Add CNN Transformer • The Feed Forward layer of the Transformer is identical to a 1x1 Convolution • In this part of the model, no information is flowing between tokens/pixels • Multi-Head Attention and 3x3 Convolution layers are the layers responsible for mixing information between tokens/pixels
  12. 12. Convolutions as Hard-Coded Attention Both Convolution and Attention Networks mix in features of other tokens/pixels Convolution Attention Convolutions mix in features from tokens based on fixed spatial location Attention mix in features from tokens based on learned attention © 2022 Synopsys 12
  13. 13. The Structure of a Transformer: Attention Multi-Head Attention Attention: Mix in Features of Other Tokens Is this the real life ? Is this the real life ? © 2022 Synopsys 13
  14. 14. The Structure of a Transformer: Attention Multi-Head Attention Attention: Mix in Features of Other Tokens © 2022 Synopsys 14
  15. 15. The Structure of a Transformer: Attention NxN matrix N: number of tokens Multi-Head Attention © 2022 Synopsys 15
  16. 16. + The Structure of a Transformer: Embedding Embedding of input tokens and the positional encoding Is this the real life ? Is this just fantasy ? embedding vectors position encoding elementwise addition © 2022 Synopsys 17
  17. 17. Other Application Domains: Vision, Action Recognition, Speech Recognition
  18. 18. Vision Transformers (ViT/L16 or ViT-G/14) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale(*) Image is split into tiles (*) https://arxiv.org/abs/2010.11929 Linear Projection Multi-Head Attention Feed Forward Add & Norm Add & Norm + Pixels in a tile are flattened into tokens (vectors) that feed in the transformer N x Vision Transformers are at the time of publication best-known method for image classification They are beating convolutional neural networks in accuracy and training time, but not in inference time. Position encoding © 2022 Synopsys 20
  19. 19. Vision Transformer  Increasing Resolution Linear Projection Multi-Head Attention Feed Forward Add & Norm Add & Norm + N x Position encoding Attention matrix scales quadratically with the number of patches N x N matrix Where N = the number of tokens/patches © 2022 Synopsys 21
  20. 20. Swin Transformers Hierarchical Vision Transformer using Shifted Windows (*) Adaptation makes Transformers scale for larger images: 1. Shifted Window Attention 2. Patch-Merging State of the Art for • Object Detection (COCO) • Semantic Segmentation (ADE20K) (*) https://arxiv.org/abs/2103.14030 Shifted Window Attention Patch-Merging 22 © 2022 Synopsys
  21. 21. Action Classification with Transformers Video Swin Transformer https://arxiv.org/abs/2106.13230 Video Swin Transformers extend the (shifted) window to three dimensions (2D spatial + time) Today’s state of the art on Kinetics-400 and Kinetics-600 © 2022 Synopsys 23
  22. 22. © 2022 Synopsys Action Classification with Transformers Is Space-Time Attention All You Need for Video Understanding? https://arxiv.org/abs/2102.05095 • Transformers can directly be applied to video • Like for ViT, the video frames are split-up in tiles that feed directly in the Transformer • Applying attention separately on time and on space “Divided Attention” gives (at time of publication) state of the art results on Kinetics-400 and Kinetics-600 benchmarks 24
  23. 23. • Conformers are Transformer with and additional Convolution Module • The convolution module contains a pointwise and a depthwise (1D, size=31) convolution: • Compared to RNN, LSTM, DW-Conv and Transformers, Conformers give excellent accuracy / size ratio: • Best known methods for speech recognition (LibriSpeech) are based on Conformers Speech Recognition Conformer: Convolution-augmented Transformer for Speech Recognition (*) WER (%) 5.0 1.0 2.0 3.0 4.0 6.0 7.0 8.0 9.0 10.0 Not Released Hybrid Model Transformer # of parameters (M) 25 275 300 50 75 100 125 150 175 200 225 250 325 https://arxiv.org/abs/2005.08100 © 2022 Synopsys 25
  24. 24. Why Attention and Transformers are Here to Stay for Vision
  25. 25. Visual Perception beyond Segmentation & Object Detection Future applications like security cameras, personal assistants, storage retrieval,…. require a deeper understanding of the world  Merging NLP and Vision using the same knowledge representation backend Today Panoptic Segmentation 2022-… What is happening in this scene? © 2022 Synopsys 27
  26. 26. Tesla AI Day: Using Transformers Make Predictions in Vector Space • Convolutional neural network extract features for every camera • A transformer is used to: • Fuse multiple cameras • Make predictions directly in bird-eye- view vector space © 2022 Synopsys 28
  27. 27. • Attention based networks outperform CNN-only networks on accuracy • Highest accuracy required for high-end applications • Models that combine Vision Transformers with Convolutions are more efficient at inference • Examples: MobileViT(*), CoAtNet(**) • Full visual perception requires knowledge that may not easily be acquired by vision only • Multi-modal learning required for a deeper understanding of visual information • Application integrating multiple sensors benefit from attention-based networks Why Transformers are Here to Stay in Vision (*) https://arxiv.org/abs/2110.02178 (**) https://arxiv.org/abs/2106.04803v2 30 © 2022 Synopsys
  28. 28. • Transformers are deep learning models primarily used in the field of NLP • Transformers lead to state-of-the-art results in other application domains of deep learning like vision and speech • They can be applied to other domains with surprisingly little modifications • Models that combine attention and convolutions outperform convolutional neural networks on vision tasks, even for small models • Transformers and attention for vision applications are here to stay • Real world applications require knowledge that is not easily captured with convolutions Summary © 2022 Synopsys 32
  29. 29. Resources Join the Synopsys Deep Dive Optimize AI Performance & Power for Tomorrow’s Neural Network Applications (Thursday, 12-3 PM) Synopsys Demos in Booth 719 • Executing Transformer Neural Networks in ARC NPX6 NPU IP • Driver Management System on ARC EV Processor IP with Visidon • Neural Network-Enhanced Radar Processing on ARC VPX5 DSP with SensorCortek Resources ARVIX.org https://arxiv.org/abs/1706.03762 ARC NPX6 NPU IP www.synopsys.com/arc © 2022 Synopsys 33

×