Model Quantization Technologies with AIMET.pdf

© GO Inc.
Model Quantization Technologies with AIMET
郭卓然

© GO Inc. 2
Outline
● Challenges of edge AI workloads
● Introduction of AIMET
● Post-training quantization (PTQ) techniques
● Quantization-aware training (QAT)
● Quantization Simulation

© GO Inc. 3
Challenges of edge AI workloads
● Power and thermal efficiency are essential for on-device AI
1. Limited Resources:
Edge devices typically have limited computational power, memory, and energy
resources compared with cloud servers and PC desktops.
2. Latency and Real-time Processing:
Edge AI often requires real-time or near real-time processing to enable applications

© GO Inc.
Methods to improve model performance on edge device
● Model Quantization:
Reduce bit-precision while keeping desired accuracy
● Model Compression:
Compression model size while keeping desired accuracy
● Neural architecture search:
Design smaller neural networks suitable for real hardware

© GO Inc.
Model Quantization
INT 8 FP 16 FP 32
Inference Speed
Higher Accuracy
● Int8 is faster than FP32 but it sacrifices accuracy during inference process

© GO Inc.
Introduction of AIMET
reference:
https://github.com/quic/aimet
AIMET: AI Model Efficiency Toolkit,
provides model quantization and compression techniques for AI models.

© GO Inc.
AIMET Features
● Support model quantization and compression techniques
● Support for both TensorFlow and PyTorch
● Benchmarks and tests for many models
● User-friendly APIs
● Provide visualization tools for debugging and analysis models

© GO Inc.
● Post-Training Quantization (PTQ):
Performs quantization after the model has been trained
● Quantization-Aware Training (QAT):
Applying fine-tuning to restore accuracy degradation caused by quantization
● Quantization Simulation
Predicts on-target accuracy before deploying model to hardware
AIMET Model Quantization Use Cases

© GO Inc.
Post-Training Quantization (PTQ)
PTQ:
● Performs quantization after the model has been trained without model retraining
Features of PTQ:
● PTQ methods can be data-free
● PTQ methods also can do range analysis using calibration data
○ To determine step size for activations
○ Step size for weights can be determined without any data
AutoQuant:
● AIMET provide the AutoQuant feature to analyzes the model, determines the sequence of
quantization techniques and applies these techniques.
○ AutoQuant feature saves time and automates the quantization of the neural networks.

© GO Inc.
● Designed to find the best combination of quantization methods to maximize model performance
● AutoQuant applies these optimization for better performance:
○ Cross-Layer Equalization（CLE）:
Equalizes weight ranges in consecutive layers
■ Markus Nagel, Mart van Baalen 「Data-Free Quantization Through Weight Equalization and Bias Correction」
https://arxiv.org/pdf/1906.04721.pdf
○ Bias Correction（BC）:
Focuses on correcting the bias parameters of individual layers in the quantized model
■ Markus Nagel, Mart van Baalen 「Data-Free Quantization Through Weight Equalization and Bias Correction」
○ Adaptive Rounding (AdaRound）:
Determines optimal rounding for weight tensors to improve quantized performance.
■ Markus Nagel, Rana Ali Amjad 「Up or Down? Adaptive Rounding for Post-Training Quantization」
AutoQuant

© GO Inc.
Workflow of AutoQuant
https://quic.github.io/aimet-pages/releases/latest/user_guide/auto_quant.html#ug-auto-quant

© GO Inc.
Cross Layer Equalization & Bias Correction
Cross-layer Equalization (CLE)
● Equalizes the weight ranges by using the scale-equivariance property of activation functions.
● Especially beneficial for models with depth-wise separable convolution layers.
Bias Correction
● Fixes shifts in layer outputs introduced due to quantization. When noise due to weight quantization is
biased, it also introduces a shift
● Adapts a layer’s bias parameter using a correction term to correct for the bias in the noise.
https://quic.github.io/aimet-pages/releases/latest/user_guide/post_training_quant_techniques.html#ug-post-training-quantization

© GO Inc.
AdaRound
Markus Nagel, Rana Ali Amjad 「Up or Down? Adaptive Rounding for Post-Training Quantization」
● Use the “nearest rounding” technique, this
weight value is quantized to the nearest integer
value.
● AdaRound feature let the weight value is
quantized to the integer value far from it.
AIMET use the “nearest rounding” technique for achieving quantization.

© GO Inc.
AdaRound Techniques
● AdaRound results compare with baseline:
Chirag Patel, Tijmen Blankevoort 「Intelligence at scale through AI model efficiency」
https://www.qualcomm.com/content/dam/qcomm-martech/dm-assets/documents/presentation_-_intelligence_at_scale_through_ai_model_efficiency.pdf

© GO Inc.
Quantization-Aware Training (QAT)
● Simulate quantization noise in forward pass.
● Finetune using training data
RELU
Conv/FC
Act quant
+
Input
Output
Bias
Wt quant
Weight
Backprop
Simulation ops added
automatically at appropriate
places in the model graph
● Learn quantization parameters (QAT
with Range Learning)
● Fine tune model weights

© GO Inc.
Two modes of QAT are supported by AIMET:
1. Regular QAT:
● Update:
○ Trainable parameters such as weights and biases
● Constant:
○ Scale and offset quantization parameters
2. QAT with Range Learning:
● Update:
○ Trainable parameters such as module weights, biases
○ Scale/offset parameters for weight quantizers
○ Scale/offset parameters for activation quantizers
Quantization-Aware Training (QAT)

© GO Inc.
AIMET’s Quantization Simulation provides functionality to simulate the quantization model in hardware.
Quantization Simulation
https://quic.github.io/aimet-pages/AimetDocs/user_guide/quantization_sim.html

© GO Inc.
● AIMET can simulate the quantization noise
● Since dequantizated value may not be exactly the same as quantized value,
the difference between the two values is the quantization noise.

© GO Inc.
Results of CV models accuracy on the AIMET simulator without QAT
Compare with the pytorch and SNPE accuracy
● AIMET quantized models can provide good accuracy, comparable to floating point models.
● Gap between AIMET quant and SNPE quant :
○ Execution on different runtimes (GPU and DSP) can lead to different results.
○ The default quantization algorithm in AIMET may not be fully aligned with the algorithm used on hardware
Model (accuracy) Pytorch Offical
（GPU）
Pytorch
(CPU）
AIMET quant
（GPU）
SNPE quant
（DSP）
ResNet18 69.758% 69.76% 69.608% 69.294%
ResNet50 76.13% 76.146% 75.86% 75.422%
Mobilenetv2 71.878% 71.87% 71.164% 69.226
Inceptionv3 77.294% 77.472% 76.564% 76.842%
SNPE:Snapdragon Neural
Processing Engine
DSP: Digital Signal Processor
https://pytorch.org/vision/main/models.html

© GO Inc.
Summary
Pros:
● AIMET provides QAT (Quantization-Aware Training) and PTQ (Post-Training
Quantization) technologies to improve the accuracy of models.
● AIMET is designed with user-friendliness in mind. It offers a user-friendly interface
and clear documentation
● AIMET offers debugging tools and visualization capabilities
Cons:
● Quantization simulations may ignore hardware-specific effects affecting model
performance.

Model Quantization Technologies with AIMET.pdf

Recommended

Recommended

More Related Content

Similar to Model Quantization Technologies with AIMET.pdf

Similar to Model Quantization Technologies with AIMET.pdf (20)

Recently uploaded

Recently uploaded (20)

Model Quantization Technologies with AIMET.pdf