How Arm’s Machine Learning
Solution Enables Vision
Transformers at the Edge
Stephen Su
Sr. Segment Marketing Manager
Arm Inc.
• What is a transformer? Ref. [1]
Vaswani et al. Attention is all you need,
NIPS 2017
• A highly scalable network
architecture based on self-
attention
Transformer Background
2
© 2024 Arm Inc.
• Potentially unified architecture for text, audio, and image
• Models based on transformers perform outstandingly in natural
language processing (NLP) and computer vision (CV)
• Support wide use cases, not only image classification but also
applications such as super resolution, segmentation, object detection,
and much more
Why Transformers?
3
© 2024 Arm Inc.
• While CNNs have inductive biases, e.g., locality and translation
equivariance,
• The transformer uses self-attention to capture the dependencies within
the input sequences
• Hence, models based on transformers are more extendable; i.e., work
well in video understanding, image completion, multi-camera, and multi-
modal domains
Transformer in Vision Applications
4
© 2024 Arm Inc.
• Hardware is fragmented, ranging from CPU only, (CPU + GPU), or (CPU +
accelerator), and others
• What is the most suitable hardware solution for transformers?
• Efficiency is another challenge
• How do you run transformer models with high power efficiency and low
latency?
• Model size and memory usage
• We need a toolset (with tutorials) to compress model size to a reasonable size
so that it can be deployed at the edge.
Challenges in Deploying Transformer Models at the Edge
5
© 2024 Arm Inc.
6
© 2024 Arm Inc.
Arm Machine Learning Solution Supporting
Vision Transformers
Introducing Next Generation Arm NPU—
What Makes it Attractive?
7
© 2024 Arm Inc.
Higher power efficiency
• Targeting 20% over current generation
Increased performance
• Configurations from 128 MACs/cycle to 2048 MACs/cycle
Extended operator support
• Hardware accelerated transformer network support
Double MAC throughput
• For 2/4 sparse layers
• In addition to the operators currently supported by the original Ethos product
family, the latest Arm Ethos-U85 includes native hardware support for
transformer networks and DeeplabV3 semantic segmentation network, such as:
New Hardware Operators Accelerate Transformer
Networks
8
© 2024 Arm Inc.
TRANSPOSE GATHER MATMUL
RESIZE
BILINEAR
ARGMAX
Arm Example Subsystem
9
© 2024 Arm Inc.
Cortex-
M85
Interconnect
DMA-350 Mali-C55
Peripherals
Ethos-U85
Memory Arm IPs
Non-Arm IPs
• Pre-integrated and verified machine learning solution
How to Use Ethos-U85 in a System
10
© 2024 Arm Inc.
Cortex-M Ethos-U85
Interconnect
System SRAM System Flash
• End Point AI: Cortex-M
based system
Arm IPs Non-Arm IPs
• ML Island: Cortex-A
based system
Cortex-M
Ethos-
U85
DRAM
Interconnect
System SRAM System Flash
Cortex-A
Cortex-A
Cortex-A
Cortex-A
• Discrete NPU: Cortex-A
only
DRAM
Interconnect
System SRAM System Flash
Ethos-
U85
Cortex-A
Cortex-A
Cortex-A
Cortex-A
Software Flow on Arm Machine Learning Solution
11
© 2024 Arm Inc.
TFLu
Runtime
Ref. Kernels
CMSIS-NN
Optimized
Kernels
Cortex-M
CPU
TA R G E T / D E V I C E
H O ST ( O F F L I N E )
TF
Frame-
work
TF Quantization
Tooling
TFLite Converter
TFL flat
file
NN
Optimizer
Ethos-U85
Driver
Ethos-U85
NPU
• Cortex-M CPU with Ethos-U85
Arm IPs
Software Flow on Arm Machine Learning Solution
12
© 2024 Arm Inc.
• Cortex-M + Cortex-A system TA R G E T / D E V I C E
Ethos-U85
NPU
AXI bus
Linux OS
Cortex-A
+ cache, MMU
TFLiteµ
runtime
Application
.tflite
flatbuffer
Cortex-M +
cache, MPU
Subsystem driver
NPU carveout
(boot time)
DRAM
Linux/OS
managed area
Ethos-U85
Driver
Address filter
Wrapper App
Inference API
SRAM
AXI bus
H O ST ( O F F L I N E )
TF
Frame-
work
TF Quantization
Tooling
TFLite Converter
TFL flat
file
NN
Optimizer
Arm IPs
Software Flow on Arm Machine Learning Solution
13
© 2024 Arm Inc.
• Cortex-A based system TA R G E T / D E V I C E
Ethos-U85
NPU
AXI bus
Linux OS
Cortex-A
+ cache, MMU
Application
NPU driver
NPU carveout
(boot time)
DRAM
Linux/OS
managed area
Address filter
TFLite delegate
SRAM
AXI bus
H O ST ( O F F L I N E )
TF
Frame-
work
TF Quantization
Tooling
TFLite Converter
TFL flat
file
NN
Optimizer
: Arm IPs
14
© 2024 Arm Inc.
Arm Toolset Enables the Efficient Implementation of
Transformers on Ethos
Data
Use-
case
Optimized
Model
Integrated
with
Application
Model
Searcher
Model
Compressor
Trained
Model
Model
Zoo
Trained
Model
Device
Arm Vela
Compiler
Weight
clustering
Quantization Compile
Arm transformer Tutorials, the Jupyter
notebooks (.ipynb) showing how to
quantize and compress transformer
encoder and encoder-decoder models.
15
© 2024 Arm Inc.
Vision Transformer Example Implementation
DEiT Tiny Runs on Ethos-U85
16
© 2024 Arm Inc.
Arm MPS3
board with
previous
Ethos-U
Arm MPS3
board with
the latest
Ethos-U85
Demo is to compare how much faster the latest Ethos-
U85 runs a transformer network compared to the
previous Ethos, since there is no fall back for those
operators with Ethos-U85
images
images
Previous Ethos - hummingbird -
execution speed
Ethos-U85 - hummingbird -
execution speed
Output
Up to 8X Acceleration in Inference time
17
© 2024 Arm Inc.
• Previous Ethos • The Latest Ethos-U85
For more details, please visit Arm booth at #409.
• Machine learning (ML) is everywhere, and its landscape is evolving from
CNNs to transformer-based models
• Arm just launched the latest NPU in the Arm Ethos product family to
extend the support of accelerating transformers at the edge
• Finally, “Edge AI runs on Arm.”
Summary
18
© 2024 Arm Inc.
Resources
Please visit Arm booth #409 at the 2024
Embedded Vision Summit for more demos:
“The Newly Launched Arm Ethos-U85 NPU”
“Renesas RZ/V2H- Qual-core Cortex-A55
Vision AI MPU”
“Arm-Himax, the High-efficiency Embedded
Computer Vision”
19
© 2024 Arm Inc.
Arm Ethos-U product page
https://www.arm.com/products/silicon-
ip-cpu?families=ethos%20npus
Arm transformer tutorials
https://github.com/ARM-software/ML-
zoo/tree/master/tutorials/transformer_tu
torials
Arm keyword-transformer
https://github.com/ARM-
software/keyword-transformer
• Reference [1]: A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,
“Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing
Systems, 2017, pp. 6000–6010
Reference
20
© 2024 Arm Inc.
21
© 2024 Arm Inc.
Thank You

“How Arm’s Machine Learning Solution Enables Vision Transformers at the Edge,” a Presentation from Arm

  • 1.
    How Arm’s MachineLearning Solution Enables Vision Transformers at the Edge Stephen Su Sr. Segment Marketing Manager Arm Inc.
  • 2.
    • What isa transformer? Ref. [1] Vaswani et al. Attention is all you need, NIPS 2017 • A highly scalable network architecture based on self- attention Transformer Background 2 © 2024 Arm Inc.
  • 3.
    • Potentially unifiedarchitecture for text, audio, and image • Models based on transformers perform outstandingly in natural language processing (NLP) and computer vision (CV) • Support wide use cases, not only image classification but also applications such as super resolution, segmentation, object detection, and much more Why Transformers? 3 © 2024 Arm Inc.
  • 4.
    • While CNNshave inductive biases, e.g., locality and translation equivariance, • The transformer uses self-attention to capture the dependencies within the input sequences • Hence, models based on transformers are more extendable; i.e., work well in video understanding, image completion, multi-camera, and multi- modal domains Transformer in Vision Applications 4 © 2024 Arm Inc.
  • 5.
    • Hardware isfragmented, ranging from CPU only, (CPU + GPU), or (CPU + accelerator), and others • What is the most suitable hardware solution for transformers? • Efficiency is another challenge • How do you run transformer models with high power efficiency and low latency? • Model size and memory usage • We need a toolset (with tutorials) to compress model size to a reasonable size so that it can be deployed at the edge. Challenges in Deploying Transformer Models at the Edge 5 © 2024 Arm Inc.
  • 6.
    6 © 2024 ArmInc. Arm Machine Learning Solution Supporting Vision Transformers
  • 7.
    Introducing Next GenerationArm NPU— What Makes it Attractive? 7 © 2024 Arm Inc. Higher power efficiency • Targeting 20% over current generation Increased performance • Configurations from 128 MACs/cycle to 2048 MACs/cycle Extended operator support • Hardware accelerated transformer network support Double MAC throughput • For 2/4 sparse layers
  • 8.
    • In additionto the operators currently supported by the original Ethos product family, the latest Arm Ethos-U85 includes native hardware support for transformer networks and DeeplabV3 semantic segmentation network, such as: New Hardware Operators Accelerate Transformer Networks 8 © 2024 Arm Inc. TRANSPOSE GATHER MATMUL RESIZE BILINEAR ARGMAX
  • 9.
    Arm Example Subsystem 9 ©2024 Arm Inc. Cortex- M85 Interconnect DMA-350 Mali-C55 Peripherals Ethos-U85 Memory Arm IPs Non-Arm IPs • Pre-integrated and verified machine learning solution
  • 10.
    How to UseEthos-U85 in a System 10 © 2024 Arm Inc. Cortex-M Ethos-U85 Interconnect System SRAM System Flash • End Point AI: Cortex-M based system Arm IPs Non-Arm IPs • ML Island: Cortex-A based system Cortex-M Ethos- U85 DRAM Interconnect System SRAM System Flash Cortex-A Cortex-A Cortex-A Cortex-A • Discrete NPU: Cortex-A only DRAM Interconnect System SRAM System Flash Ethos- U85 Cortex-A Cortex-A Cortex-A Cortex-A
  • 11.
    Software Flow onArm Machine Learning Solution 11 © 2024 Arm Inc. TFLu Runtime Ref. Kernels CMSIS-NN Optimized Kernels Cortex-M CPU TA R G E T / D E V I C E H O ST ( O F F L I N E ) TF Frame- work TF Quantization Tooling TFLite Converter TFL flat file NN Optimizer Ethos-U85 Driver Ethos-U85 NPU • Cortex-M CPU with Ethos-U85 Arm IPs
  • 12.
    Software Flow onArm Machine Learning Solution 12 © 2024 Arm Inc. • Cortex-M + Cortex-A system TA R G E T / D E V I C E Ethos-U85 NPU AXI bus Linux OS Cortex-A + cache, MMU TFLiteµ runtime Application .tflite flatbuffer Cortex-M + cache, MPU Subsystem driver NPU carveout (boot time) DRAM Linux/OS managed area Ethos-U85 Driver Address filter Wrapper App Inference API SRAM AXI bus H O ST ( O F F L I N E ) TF Frame- work TF Quantization Tooling TFLite Converter TFL flat file NN Optimizer Arm IPs
  • 13.
    Software Flow onArm Machine Learning Solution 13 © 2024 Arm Inc. • Cortex-A based system TA R G E T / D E V I C E Ethos-U85 NPU AXI bus Linux OS Cortex-A + cache, MMU Application NPU driver NPU carveout (boot time) DRAM Linux/OS managed area Address filter TFLite delegate SRAM AXI bus H O ST ( O F F L I N E ) TF Frame- work TF Quantization Tooling TFLite Converter TFL flat file NN Optimizer : Arm IPs
  • 14.
    14 © 2024 ArmInc. Arm Toolset Enables the Efficient Implementation of Transformers on Ethos Data Use- case Optimized Model Integrated with Application Model Searcher Model Compressor Trained Model Model Zoo Trained Model Device Arm Vela Compiler Weight clustering Quantization Compile Arm transformer Tutorials, the Jupyter notebooks (.ipynb) showing how to quantize and compress transformer encoder and encoder-decoder models.
  • 15.
    15 © 2024 ArmInc. Vision Transformer Example Implementation
  • 16.
    DEiT Tiny Runson Ethos-U85 16 © 2024 Arm Inc. Arm MPS3 board with previous Ethos-U Arm MPS3 board with the latest Ethos-U85 Demo is to compare how much faster the latest Ethos- U85 runs a transformer network compared to the previous Ethos, since there is no fall back for those operators with Ethos-U85 images images Previous Ethos - hummingbird - execution speed Ethos-U85 - hummingbird - execution speed Output
  • 17.
    Up to 8XAcceleration in Inference time 17 © 2024 Arm Inc. • Previous Ethos • The Latest Ethos-U85 For more details, please visit Arm booth at #409.
  • 18.
    • Machine learning(ML) is everywhere, and its landscape is evolving from CNNs to transformer-based models • Arm just launched the latest NPU in the Arm Ethos product family to extend the support of accelerating transformers at the edge • Finally, “Edge AI runs on Arm.” Summary 18 © 2024 Arm Inc.
  • 19.
    Resources Please visit Armbooth #409 at the 2024 Embedded Vision Summit for more demos: “The Newly Launched Arm Ethos-U85 NPU” “Renesas RZ/V2H- Qual-core Cortex-A55 Vision AI MPU” “Arm-Himax, the High-efficiency Embedded Computer Vision” 19 © 2024 Arm Inc. Arm Ethos-U product page https://www.arm.com/products/silicon- ip-cpu?families=ethos%20npus Arm transformer tutorials https://github.com/ARM-software/ML- zoo/tree/master/tutorials/transformer_tu torials Arm keyword-transformer https://github.com/ARM- software/keyword-transformer
  • 20.
    • Reference [1]:A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6000–6010 Reference 20 © 2024 Arm Inc.
  • 21.
    21 © 2024 ArmInc. Thank You