For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/07/hate-it-or-love-it-your-neural-network-software-stack-defines-application-performance-and-reach-a-presentation-from-qualcomm/
Felix Baum, Director of Product Management at Qualcomm, presents the “Hate It Or Love It, Your Neural Network Software Stack Defines Application Performance and Reach” tutorial at the May 2021 Embedded Vision Summit.
Qualcomm has invested extensively in its state-of-the-art Qualcomm Neural Network and software stack, including the Qualcomm Neural Processing SDK and the AI Model Efficiency Toolkit. These toolkits have helped many developers take their applications to the next level with extreme efficiency and performance.
In this talk, Baum explains some of the most common misconceptions related to AI software stacks and frameworks, and how key attributes of these tools, directly and indirectly, impact application performance, the ability of developers with diverse skill levels to use the tools, and the ability to scale applications across processor performance tiers. If you are interested in improving your application performance or incorporating machine learning into your performance-sensitive application, this talk is for you.
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
“Hate It Or Love It, Your Neural Network Software Stack Defines Application Performance and Reach,” a Presentation from Qualcomm
1. Hate it or love it, your SW stack
defines application performance
and reach
Felix Baum, Director, Product
Management at Qualcomm
Technologies Inc.
1
2. Personas and scenarios
2
Lisa Ben
Expert developer
Seasoned ML warrior
Need to squeeze all the
performance offered by
hardware
Novice developer
Not very sophisticated at ML
Scalability is more important
than performance
3. Building an Application - Misconceptions
3
ML in the cloud and edge device is similar Nothing could be further from reality
Runtime frameworks differ in cadence,
range and performance
All runtime frameworks offer the same
performance and flexibility
Quantization is hard and offers little benefit
Tools are available to offer users > 6 times
in performance and power improvements
5. Adding ML to your Application
FP32
App
ML
Algorithm
Cognitive Toolkit
Training Frameworks
Runtime Frameworks
FP32
FP32
? ? ? ? ? ? ? ? ?
5
? ? ? ? ? ? ? ? ?
6. Adding ML to your Application
FP32
App
ML
Algorithm
Cognitive Toolkit
Training Frameworks
Runtime Frameworks
FP32
FP32
? ? ? ? ? ? ? ? ?
6
7. What if it is not accurate enough?
FP32
App
ML
Algorithm
Cognitive Toolkit
Training Frameworks
Runtime Frameworks
FP32
FP32
? ? ? ? ? ? ? ? ?
Hardware Acceleration
FP32
? ? ? ? ?
Why not
INT8?
7
8. Hardware Acceleration
FP32
? ? ? ? ?
What if it is not accurate enough?
FP32
App
ML
Algorithm
Cognitive Toolkit
Training Frameworks
Runtime Frameworks
FP32
FP32
? ? ? ? ? ? ? ? ?
8
32-bit floating point
3452.3194
01010101 01010101 01010101 01010101
0101
01010101
01010101 01010101
Models trained at
high precision
16-bit Integer
3452
Inference at
lower precision
8-bit Integer
255
4-bit Integer
15
64X
up to
16X
up to
4X
up to
Increase in
performance per watt
from savings in
memory and compute
Increase in
performance per
watt from savings in
memory and
compute
Increase in
performance per watt
from savings in
memory and compute
10. Adding more data types – need tools to make it seamless
FP32
App
ML
Algorithm
Cognitive Toolkit
Training Frameworks
Runtime Frameworks
FP32
? ? ? ? ? ? ? ? ?
INT8
INT16
FP16
FP32
Hardware Acceleration
? ? ? ? ?
Quantization,
Pruning,
Optimization
INT8
INT16
FP16
FP32
10
11. Hardware Acceleration
? ? ? ? ?
What if it is not flexible enough?
FP32
App
ML
Algorithm
Cognitive Toolkit
Training Frameworks
Runtime Frameworks
FP32
? ? ? ? ? ? ? ? ?
Where is the
choice?
11
INT8
INT16
FP16
FP32
Quantization,
Pruning,
Optimization
INT8
INT16
FP16
FP32
12. Hardware Acceleration
? ? ? ? ?
What if it is not flexible enough?
FP32
App
ML
Algorithm
Cognitive Toolkit
Training Frameworks
Runtime Frameworks
FP32
? ? ? ? ? ? ? ? ?
12
INT8
INT16
FP16
FP32
INT8
INT16
FP16
FP32
• Cadence – some runtime frameworks have a yearly cadence
of release while others have monthly
• Range – some frameworks offer wider range of supported
operators
• Performance – even if runtime claims support for an
operator, that does not always mean that it is accelerated
Quantization,
Pruning,
Optimization
13. Selecting a runtime framework that fits
FP32
App
ML
Algorithm
Cognitive Toolkit
Training Frameworks
Runtime Frameworks
FP32
INT8
INT16
FP16
FP32
INT8
INT16
FP16
FP32
Hardware Acceleration
? ? ? ? ?
Qualcomm Neural Processing SDK
PyTorch Mobile
NNAPI
TF-Lite
ONNX RT
13
Quantization,
Pruning,
Optimization
Qualcomm Neural Processing SDK is a product of Qualcomm Technologies, Inc and/or its subsidiaries.
.
14. Hardware Acceleration
? ? ? ? ?
What if it is not flexible enough?
FP32
App
ML
Algorithm
Cognitive Toolkit
Training Frameworks
Runtime Frameworks
FP32
14
INT8
INT16
FP16
FP32
INT8
INT16
FP16
FP32
Qualcomm Neural Processing SDK
PyTorch Mobile
NNAPI
TF-Lite
ONNX RT
Why so slow?
Quantization,
Pruning,
Optimization
Qualcomm Neural Processing SDK is a product of Qualcomm Technologies, Inc and/or its subsidiaries.
.
15. Hardware Acceleration
What if it is not fast enough?
FP32
App
ML
Algorithm
Cognitive Toolkit
Training Frameworks
Runtime Frameworks
FP32
15
INT8
INT16
FP16
FP32
INT8
INT16
FP16
FP32
Qualcomm Neural Processing SDK
PyTorch Mobile
NNAPI
TF-Lite
ONNX RT
Why so slow?
HTP
Audio
Quantization,
Pruning,
Optimization
Qualcomm Neural Processing SDK is a product of Qualcomm Technologies, Inc and/or its subsidiaries.
.
16. Hardware Acceleration
What if it is not fast enough?
FP32
App
ML
Algorithm
Cognitive Toolkit
Training Frameworks
Runtime Frameworks
FP32
16
INT8
INT16
FP16
FP32
INT8
INT16
FP16
FP32
Qualcomm Neural Processing SDK
PyTorch Mobile
NNAPI
TF-Lite
ONNX RT
• Not all accelerators support all data types
• Operator parity is not guaranteed
• Not all accelerators deliver the same
performance
HTP
Audio
Quantization,
Pruning,
Optimization
Qualcomm Neural Processing SDK is a product of Qualcomm Technologies, Inc and/or its subsidiaries.
.
17. Selecting a runtime framework that fits
FP32
App
ML
Algorithm
Cognitive Toolkit
Training Frameworks
Runtime Frameworks
FP32
INT8
INT16
FP16
FP32
INT8
INT16
FP16
FP32
Hardware Acceleration
? ? ? ? ?
Hexagon Processor SDK
PyTorch Mobile
NNAPI
TF-Lite
ONNX RT
Low Level Neural Network Library
Halide,
TVM
17
HTP
Audio
Quantization,
Pruning,
Optimization
Qualcomm Neural Processing SDK is a product of Qualcomm Technologies, Inc and/or its subsidiaries.
.
18. Selecting a runtime framework that fits
FP32
App
ML
Algorithm
Cognitive Toolkit
Training Frameworks
Runtime Frameworks
FP32
INT8
INT16
FP16
FP32
INT8
INT16
FP16
FP32
Hardware Acceleration
Hexagon Processor SDK
PyTorch Mobile
NNAPI
TF-Lite
ONNX RT
Low Level Neural Network Library
HTP
Audio
Halide,
TVM
18
Quantization,
Pruning,
Optimization
Qualcomm Neural Processing SDK is a product of Qualcomm Technologies, Inc and/or its subsidiaries.
.
19. In order to achieve your application full capacity, you need a software stack
that is tailored to specifically to what you are looking to accomplish
Different models require specific tools that only customizable stacks will offer
Not all applications are built the same way, your software stack
will determine how well your application will perform
Key Takeaways
19
Qualcomm and Hexagon are trademarks or registered trademarks of Qualcomm Incorporated