SlideShare a Scribd company logo
INTEGER QUANTIZATION FOR DEEP
LEARNING INFERENCE: PRINCIPLES
AND EMPIRICAL EVALUATION
NVIDIA, Georgia Tech.
Arxiv 2020
Presenter: Jemin Lee
https://leejaymin.github.io/index.html
7 Oct. 2020
Neural Acceleration Study Season #3
Contents
• Introduction and Background
• Quantization Fundamentals
• Post-Training Quantization
• Techniques to Recover Accuracy
• Recommended Workflow
• Conclusion
Neural Acceleration Study #2
Introduction
• Low-precision formats offer several performance benefits
• First, many processors provide higher throughput math pipelines the low-bit
formats, which can speed up math-intensive operations, such as convolutions and
matrix multiplications.
• Second, smaller word sizes reduce memory bandwidth pressure, improving
performance for bandwidth-limited computations.
• Third, smaller word sizes lead to lower memory size requirements, which can
improve cache utilization as well as other aspects of memory-system operation.
Neural Acceleration Study #3
Introduction
• Review the mathematical fundamentals underlying various
integer quantization choices (Section 3) as well as techniques
for recovering accuracy lost due to quantization (Section 5).
Section 6 combines this information into a recommended
workflow.
Neural Acceleration Study #3
Related Work
• Uniform [7,59] quantization enables the use of integer or fixed-
point math pipelines, allowing computation to be performed in the
quantized domain.
• Non-uniform[13,60,34,2] quantization requires dequantization, e.g.
a codebook lookup, before doing computation in higher precision,
limiting its benefits to model compression and bandwidth reduction.
• This paper focuses on leveraging quantization to accelerate
computation, so we will restrict our focus to uniform quantization
schemes.
• In this paper, we present an evaluation of int8 quantization on all of
the major network architectures with both PTQ and QAT.
Neural Acceleration Study #3
Quantization Fundamentals
• Uniform quantization steps
• 1) Choose the range of the real numbers to be quantized, clamping the
values outside this range.
• 2) Map the real values to integers representable by the bit-width of the
quantized representation (round each mapped real value to the closest
integer value).
• Quantize: convert a real number to a quantized integer
representation (e.g. from fp32 to int8).
• Dequantize: convert a number from quantized integer
representation to a real number (e.g. from int32 to fp16).
Neural Acceleration Study #3
Quantization Fundamentals: Affine Quantization
(Asymmetric)
Neural Acceleration Study #3
• 𝒇 𝒙 = 𝒔 % 𝒙 + 𝒛
• S is scale factor and z is zero-point
• Int8
• S=
𝟐𝟓𝟓
𝜷$𝜶
, z=-round(𝜷 " 𝒔)-128
Quantization Fundamentals: Scale Quantization
(Symmetric)
Neural Acceleration Study #3
• integer range [−127, 127]
Quantization Fundamentals: Tensor Quantization
Granularity
Neural Acceleration Study #3
• At the coarsest, per-tensor granularity, the same quantization
parameters are shared by all elements in the tensor.
• The finest granularity would have individual quantization
parameters per element.
• tensor - per row or per column for 2D matrices,
• per channel for 3D (image-like) tensors.
Quantization Fundamentals: Tensor Quantization
Granularity
Neural Acceleration Study #3
• Integer matrix multiplication
• per-row or per-tensor for activations
• per-column or per-tensor for weights
• The equation is as below:
Quantization Fundamentals: Tensor Quantization
Granularity
Neural Acceleration Study #3
• For activations, only per-tensor quantization is practical for
performance reasons.
• The number of rows depends on batch instances.
• Row count can vary at inference time.
• It prevents the per-row scaling factor from being computation
offline.
• (which would not be meaningful for different instances in a mini-
batch), whereas determining them online imposes a compute
overhead and in some cases results in poor accuracy (Dynamic
Quantization discussion in [58]).
Quantization Fundamentals: Tensor Quantization
Granularity
Neural Acceleration Study #3
• The corresponding granularity to per-column in convolutions is
per-kernel, or equivalently per-output-channel since each
kernel produces a separate output channel [27, 29].
• This is commonly referred to as “per-channel” weight
quantization in literature and we follow that convention [21,
25, 26, 38, 46]. We examine the granularity impact on accuracy
in Section 4.1.
Quantization Fundamentals: Computational Cost of
Affine Quantization
Neural Acceleration Study #3
• Breakdown three terms:
• the integer dot product, just as in scale quantization (Equation 10).
• only integer weights and zero-points can be computed offline, only adding an
element-wise addition at inference time.
• If the layer has a bias then this term can be folded in without increasing inference cost.
• involves the quantized input matrix Xq, and thus cannot be computed
offline.
• overhead, reducing or even eliminating the throughput advantage that integer math
pipelines have over reduced precision floating-point.
• extra computation is incurred only if affine quantization is used for the
weight matrix.
• Scale quantization for weights.
• Affine quantization could be used for activations
Quantization Fundamentals: Calibration
Neural Acceleration Study #3
• Max: Use the maximum absolute value seen during calibration [52].
• Entropy: Use KL divergence to minimize information loss between the original floating-point
values and values that could be represented by the quantized format. This is the default method
used by TensorRT [36].
• Percentile: Set the range to a percentile of the distribution of absolute values seen during
calibration [33]. For example, 99% calibration would clip 1% of the largest magnitude values.
• Clipping the distribution trades off a large clipping error on a few outlier values for smaller
rounding errors on a majority of the values.
Post Training Quantization
Neural Acceleration Study #3
• Quantization parameters are calibrated offline by processing the trained model weights and
activations generated by running inference on a sample dataset, no further training is involved.
Post Training Quantization
Neural Acceleration Study #3
• refer to the relative accuracy change, computed by (accint8 − accfp32)/accfp32
Post Training Quantization: Weight Quantization
Neural Acceleration Study #3
• Int8 weights: max calibration
• However, as BN parameters are learned per channel, their folding can result in significantly
different weight value distributions across channels.
• Table 4 reports per-channel (per-column for linear layers) granularity and indicates that max
calibration is sufficient to maintain accuracy when quantizing weights to int8. The rest of the
experiments in this paper use per-channel max calibration for weights.
Post Training Quantization: Activation Quantization
Neural Acceleration Study #3
• For most of the networks, there is at least one activation calibration method that achieves
acceptable accuracy,
• except for MobileNets, EfficientNets, Transformer and BERT where the accuracy drop is larger than 1%.
• no single calibration is best for all networks.
Techniques to Recover Accuracy
Neural Acceleration Study #3
• Partial quantization: leaves the most sensitive layers unquantized
• One also has an option to train networks with quantization
• there are also approaches that jointly learn the model weights and quantization parameters.
Techniques to Recover Accuracy: Partial Quantization
Neural Acceleration Study #3
• Partial quantization: leaves the most sensitive layers unquantized
Techniques to Recover Accuracy: Quantization Aware
Training
Neural Acceleration Study #3
• Intuition for QAT
• the model has converged to a “narrow” minimum, since a small change in the weight leads to a large change in loss. By
training with quantization, we may potentially avoid these narrow minima by computing gradients with respect to the
quantized weights, as shown in Figure 6b. In doing so, narrow minima will result in larger gradients, potentially allowing
the model to explore the loss landscape for “wide” [22] or “flat” [16, 30] minima, where quantization points have lower
loss, and thus higher accuracy.
Techniques to Recover Accuracy: Quantization Aware
Training
Neural Acceleration Study #3
• Fake Quantization
• FakeQuant(x) = DQ(Q(x)) = INT8(s*x) / s
• Problem: The quantization function is non-differentiable
• Solution: Straight Through Estimation (STE)
Techniques to Recover Accuracy: Quantization Aware
Training
Neural Acceleration Study #3
• Result
Techniques to Recover Accuracy: Learning Quantization
Parameters
Neural Acceleration Study #3
• Apply the same fine-tuning schedule as before.
• but allow the ranges of each quantized activation tensor to be learned along with the weights, as opposed to keeping
them fixed throughout fine-tuning.
• Use PACT[1]
• Fixed best: best accuracy with PTQ as described in Table 5.
• Use max calibration for activation quantization
• Learning the range results in higher accuracy than keeping it fixed for most networks.
• In best calibration cases, learning the ranges generate very similar results.
• Not to offer extra benefit for int8 over QAT if activation ranges are already carefully calibrated.
• PACT could yield better results with longer fine-tuning or a separate process.
[1] Pact: Parameterized clipping activation for quantized neural networks, ICLR18
Techniques to Recover Accuracy: Learning Quantization
Parameters
Neural Acceleration Study #3
• Complex fine-tuning procedure
[1] Pact: Parameterized clipping activation for quantized neural networks, ICLR18
Recommended Workflow (Recap)
• Weights:
• Per-column (FC), per-channel (conv.)
• Symmetric (high performance)
• Max calibration (static)
• Activations:
• Per tensor, symmetric
• PTQ
• Activation calibration could select the method
among max, entropy, 99.99%, and 99.999%.
• If none, do Partial quant, or QAT
• Partial Quantization:
• Perform sensitivity analysis to identify the most sensitive layers
and leave them in floating-point.
• If none, do QAT.
• QAT:
• Start from the best calibrated quantized model.
• Use QAT to fine-tune for around 10% of the
original training schedule
with an annealing learning rate schedule starting at
1% of the initial training learning rate.
• For details, refer to Appendix A.2.
Neural Acceleration Study #3
Conclusion
• Review the mathematical background for integer quantization of neural networks, as
well as some performance-related reasons for choosing quantization parameters.
• Empirically evaluate various choices for int8 quantization of a variety of models, leading
to a quantization workflow proposal.
• MobileNets and BERT are challengeable for quantization.
• Workflow composes post-training quantization, partial quantization, and quantization-aware fine-tuning
techniques.
• Some more complex techniques, such as ADMM and distillation, were not required for
int8 quantization of these models.
• However, these techniques should be evaluated when quantizing to even lower-bit
integer representations, which we leave to future work.
Neural Acceleration Study #3
Thank you
Neural Acceleration Study #2

More Related Content

What's hot

“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
Edge AI and Vision Alliance
 
CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)
Mostafa G. M. Mostafa
 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)
Muhammad Haroon
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
Joonyoung Yi
 
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNN
Pradnya Saval
 
Mc Culloch Pitts Neuron
Mc Culloch Pitts NeuronMc Culloch Pitts Neuron
Mc Culloch Pitts Neuron
Shajun Nisha
 
Getting started with BeagleBone Black - Embedded Linux
Getting started with BeagleBone Black - Embedded LinuxGetting started with BeagleBone Black - Embedded Linux
Getting started with BeagleBone Black - Embedded Linux
Emertxe Information Technologies Pvt Ltd
 
Enabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through QuantizationEnabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through Quantization
Qualcomm Research
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA acceleration
Marco77328
 
Support Vector Machines- SVM
Support Vector Machines- SVMSupport Vector Machines- SVM
Support Vector Machines- SVM
Carlo Carandang
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Intel® Software
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
Viet-Trung TRAN
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
Junaid Bhat
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
milad abbasi
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Jinwon Lee
 
Pruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inferencePruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inference
Kaushalya Madhawa
 
linux kernel overview 2013
linux kernel overview 2013linux kernel overview 2013
linux kernel overview 2013
Rohit Pratap Singh
 
Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)
Rehan Guha
 

What's hot (20)

“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)
 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNN
 
Mc Culloch Pitts Neuron
Mc Culloch Pitts NeuronMc Culloch Pitts Neuron
Mc Culloch Pitts Neuron
 
Getting started with BeagleBone Black - Embedded Linux
Getting started with BeagleBone Black - Embedded LinuxGetting started with BeagleBone Black - Embedded Linux
Getting started with BeagleBone Black - Embedded Linux
 
Enabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through QuantizationEnabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through Quantization
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA acceleration
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Support Vector Machines- SVM
Support Vector Machines- SVMSupport Vector Machines- SVM
Support Vector Machines- SVM
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
 
Pruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inferencePruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inference
 
linux kernel overview 2013
linux kernel overview 2013linux kernel overview 2013
linux kernel overview 2013
 
Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)
 

Similar to Integer quantization for deep learning inference: principles and empirical evaluation

TensorFlow.pptx
TensorFlow.pptxTensorFlow.pptx
TensorFlow.pptx
Jayesh Patil
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
Zahra Sadeghi
 
A White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationA White Paper On Neural Network Quantization
A White Paper On Neural Network Quantization
April Knyff
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
Deep learning
Deep learningDeep learning
Deep learning
Rouyun Pan
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
Edge AI and Vision Alliance
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
 
Unsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptxUnsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptx
FaridAliMousa1
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
Tahmid Abtahi
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
Deadpool120050
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
Edge AI and Vision Alliance
 
Reducing Power Consumption during Test Application by Test Vector Ordering
Reducing Power Consumption during Test Application by Test Vector OrderingReducing Power Consumption during Test Application by Test Vector Ordering
Reducing Power Consumption during Test Application by Test Vector Ordering
International Journal of Engineering Inventions www.ijeijournal.com
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
ankit_ppt
 
Mx net image segmentation to predict and diagnose the cardiac diseases karp...
Mx net image segmentation to predict and diagnose the cardiac diseases   karp...Mx net image segmentation to predict and diagnose the cardiac diseases   karp...
Mx net image segmentation to predict and diagnose the cardiac diseases karp...
KannanRamasamy25
 
Optimal buffer allocation in
Optimal buffer allocation inOptimal buffer allocation in
Optimal buffer allocation in
csandit
 
Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Pedro Lopes
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
Sudhakar Chavan
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
Nimrita Koul
 

Similar to Integer quantization for deep learning inference: principles and empirical evaluation (20)

TensorFlow.pptx
TensorFlow.pptxTensorFlow.pptx
TensorFlow.pptx
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
 
A White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationA White Paper On Neural Network Quantization
A White Paper On Neural Network Quantization
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Deep learning
Deep learningDeep learning
Deep learning
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Unsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptxUnsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptx
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
Reducing Power Consumption during Test Application by Test Vector Ordering
Reducing Power Consumption during Test Application by Test Vector OrderingReducing Power Consumption during Test Application by Test Vector Ordering
Reducing Power Consumption during Test Application by Test Vector Ordering
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Mx net image segmentation to predict and diagnose the cardiac diseases karp...
Mx net image segmentation to predict and diagnose the cardiac diseases   karp...Mx net image segmentation to predict and diagnose the cardiac diseases   karp...
Mx net image segmentation to predict and diagnose the cardiac diseases karp...
 
Optimal buffer allocation in
Optimal buffer allocation inOptimal buffer allocation in
Optimal buffer allocation in
 
Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
 

More from jemin lee

MobileViTv1
MobileViTv1MobileViTv1
MobileViTv1
jemin lee
 
HAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network QuantizationHAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network Quantization
jemin lee
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
jemin lee
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performance
jemin lee
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
jemin lee
 
Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage
jemin lee
 
Jetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usageJetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usage
jemin lee
 

More from jemin lee (7)

MobileViTv1
MobileViTv1MobileViTv1
MobileViTv1
 
HAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network QuantizationHAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network Quantization
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performance
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
 
Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage
 
Jetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usageJetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usage
 

Recently uploaded

Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
varshanayak241
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 

Recently uploaded (20)

Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 

Integer quantization for deep learning inference: principles and empirical evaluation

  • 1. INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE: PRINCIPLES AND EMPIRICAL EVALUATION NVIDIA, Georgia Tech. Arxiv 2020 Presenter: Jemin Lee https://leejaymin.github.io/index.html 7 Oct. 2020 Neural Acceleration Study Season #3
  • 2. Contents • Introduction and Background • Quantization Fundamentals • Post-Training Quantization • Techniques to Recover Accuracy • Recommended Workflow • Conclusion Neural Acceleration Study #2
  • 3. Introduction • Low-precision formats offer several performance benefits • First, many processors provide higher throughput math pipelines the low-bit formats, which can speed up math-intensive operations, such as convolutions and matrix multiplications. • Second, smaller word sizes reduce memory bandwidth pressure, improving performance for bandwidth-limited computations. • Third, smaller word sizes lead to lower memory size requirements, which can improve cache utilization as well as other aspects of memory-system operation. Neural Acceleration Study #3
  • 4. Introduction • Review the mathematical fundamentals underlying various integer quantization choices (Section 3) as well as techniques for recovering accuracy lost due to quantization (Section 5). Section 6 combines this information into a recommended workflow. Neural Acceleration Study #3
  • 5. Related Work • Uniform [7,59] quantization enables the use of integer or fixed- point math pipelines, allowing computation to be performed in the quantized domain. • Non-uniform[13,60,34,2] quantization requires dequantization, e.g. a codebook lookup, before doing computation in higher precision, limiting its benefits to model compression and bandwidth reduction. • This paper focuses on leveraging quantization to accelerate computation, so we will restrict our focus to uniform quantization schemes. • In this paper, we present an evaluation of int8 quantization on all of the major network architectures with both PTQ and QAT. Neural Acceleration Study #3
  • 6. Quantization Fundamentals • Uniform quantization steps • 1) Choose the range of the real numbers to be quantized, clamping the values outside this range. • 2) Map the real values to integers representable by the bit-width of the quantized representation (round each mapped real value to the closest integer value). • Quantize: convert a real number to a quantized integer representation (e.g. from fp32 to int8). • Dequantize: convert a number from quantized integer representation to a real number (e.g. from int32 to fp16). Neural Acceleration Study #3
  • 7. Quantization Fundamentals: Affine Quantization (Asymmetric) Neural Acceleration Study #3 • 𝒇 𝒙 = 𝒔 % 𝒙 + 𝒛 • S is scale factor and z is zero-point • Int8 • S= 𝟐𝟓𝟓 𝜷$𝜶 , z=-round(𝜷 " 𝒔)-128
  • 8. Quantization Fundamentals: Scale Quantization (Symmetric) Neural Acceleration Study #3 • integer range [−127, 127]
  • 9. Quantization Fundamentals: Tensor Quantization Granularity Neural Acceleration Study #3 • At the coarsest, per-tensor granularity, the same quantization parameters are shared by all elements in the tensor. • The finest granularity would have individual quantization parameters per element. • tensor - per row or per column for 2D matrices, • per channel for 3D (image-like) tensors.
  • 10. Quantization Fundamentals: Tensor Quantization Granularity Neural Acceleration Study #3 • Integer matrix multiplication • per-row or per-tensor for activations • per-column or per-tensor for weights • The equation is as below:
  • 11. Quantization Fundamentals: Tensor Quantization Granularity Neural Acceleration Study #3 • For activations, only per-tensor quantization is practical for performance reasons. • The number of rows depends on batch instances. • Row count can vary at inference time. • It prevents the per-row scaling factor from being computation offline. • (which would not be meaningful for different instances in a mini- batch), whereas determining them online imposes a compute overhead and in some cases results in poor accuracy (Dynamic Quantization discussion in [58]).
  • 12. Quantization Fundamentals: Tensor Quantization Granularity Neural Acceleration Study #3 • The corresponding granularity to per-column in convolutions is per-kernel, or equivalently per-output-channel since each kernel produces a separate output channel [27, 29]. • This is commonly referred to as “per-channel” weight quantization in literature and we follow that convention [21, 25, 26, 38, 46]. We examine the granularity impact on accuracy in Section 4.1.
  • 13. Quantization Fundamentals: Computational Cost of Affine Quantization Neural Acceleration Study #3 • Breakdown three terms: • the integer dot product, just as in scale quantization (Equation 10). • only integer weights and zero-points can be computed offline, only adding an element-wise addition at inference time. • If the layer has a bias then this term can be folded in without increasing inference cost. • involves the quantized input matrix Xq, and thus cannot be computed offline. • overhead, reducing or even eliminating the throughput advantage that integer math pipelines have over reduced precision floating-point. • extra computation is incurred only if affine quantization is used for the weight matrix. • Scale quantization for weights. • Affine quantization could be used for activations
  • 14.
  • 15. Quantization Fundamentals: Calibration Neural Acceleration Study #3 • Max: Use the maximum absolute value seen during calibration [52]. • Entropy: Use KL divergence to minimize information loss between the original floating-point values and values that could be represented by the quantized format. This is the default method used by TensorRT [36]. • Percentile: Set the range to a percentile of the distribution of absolute values seen during calibration [33]. For example, 99% calibration would clip 1% of the largest magnitude values. • Clipping the distribution trades off a large clipping error on a few outlier values for smaller rounding errors on a majority of the values.
  • 16. Post Training Quantization Neural Acceleration Study #3 • Quantization parameters are calibrated offline by processing the trained model weights and activations generated by running inference on a sample dataset, no further training is involved.
  • 17. Post Training Quantization Neural Acceleration Study #3 • refer to the relative accuracy change, computed by (accint8 − accfp32)/accfp32
  • 18. Post Training Quantization: Weight Quantization Neural Acceleration Study #3 • Int8 weights: max calibration • However, as BN parameters are learned per channel, their folding can result in significantly different weight value distributions across channels. • Table 4 reports per-channel (per-column for linear layers) granularity and indicates that max calibration is sufficient to maintain accuracy when quantizing weights to int8. The rest of the experiments in this paper use per-channel max calibration for weights.
  • 19. Post Training Quantization: Activation Quantization Neural Acceleration Study #3 • For most of the networks, there is at least one activation calibration method that achieves acceptable accuracy, • except for MobileNets, EfficientNets, Transformer and BERT where the accuracy drop is larger than 1%. • no single calibration is best for all networks.
  • 20. Techniques to Recover Accuracy Neural Acceleration Study #3 • Partial quantization: leaves the most sensitive layers unquantized • One also has an option to train networks with quantization • there are also approaches that jointly learn the model weights and quantization parameters.
  • 21. Techniques to Recover Accuracy: Partial Quantization Neural Acceleration Study #3 • Partial quantization: leaves the most sensitive layers unquantized
  • 22. Techniques to Recover Accuracy: Quantization Aware Training Neural Acceleration Study #3 • Intuition for QAT • the model has converged to a “narrow” minimum, since a small change in the weight leads to a large change in loss. By training with quantization, we may potentially avoid these narrow minima by computing gradients with respect to the quantized weights, as shown in Figure 6b. In doing so, narrow minima will result in larger gradients, potentially allowing the model to explore the loss landscape for “wide” [22] or “flat” [16, 30] minima, where quantization points have lower loss, and thus higher accuracy.
  • 23. Techniques to Recover Accuracy: Quantization Aware Training Neural Acceleration Study #3 • Fake Quantization • FakeQuant(x) = DQ(Q(x)) = INT8(s*x) / s • Problem: The quantization function is non-differentiable • Solution: Straight Through Estimation (STE)
  • 24. Techniques to Recover Accuracy: Quantization Aware Training Neural Acceleration Study #3 • Result
  • 25. Techniques to Recover Accuracy: Learning Quantization Parameters Neural Acceleration Study #3 • Apply the same fine-tuning schedule as before. • but allow the ranges of each quantized activation tensor to be learned along with the weights, as opposed to keeping them fixed throughout fine-tuning. • Use PACT[1] • Fixed best: best accuracy with PTQ as described in Table 5. • Use max calibration for activation quantization • Learning the range results in higher accuracy than keeping it fixed for most networks. • In best calibration cases, learning the ranges generate very similar results. • Not to offer extra benefit for int8 over QAT if activation ranges are already carefully calibrated. • PACT could yield better results with longer fine-tuning or a separate process. [1] Pact: Parameterized clipping activation for quantized neural networks, ICLR18
  • 26. Techniques to Recover Accuracy: Learning Quantization Parameters Neural Acceleration Study #3 • Complex fine-tuning procedure [1] Pact: Parameterized clipping activation for quantized neural networks, ICLR18
  • 27. Recommended Workflow (Recap) • Weights: • Per-column (FC), per-channel (conv.) • Symmetric (high performance) • Max calibration (static) • Activations: • Per tensor, symmetric • PTQ • Activation calibration could select the method among max, entropy, 99.99%, and 99.999%. • If none, do Partial quant, or QAT • Partial Quantization: • Perform sensitivity analysis to identify the most sensitive layers and leave them in floating-point. • If none, do QAT. • QAT: • Start from the best calibrated quantized model. • Use QAT to fine-tune for around 10% of the original training schedule with an annealing learning rate schedule starting at 1% of the initial training learning rate. • For details, refer to Appendix A.2. Neural Acceleration Study #3
  • 28. Conclusion • Review the mathematical background for integer quantization of neural networks, as well as some performance-related reasons for choosing quantization parameters. • Empirically evaluate various choices for int8 quantization of a variety of models, leading to a quantization workflow proposal. • MobileNets and BERT are challengeable for quantization. • Workflow composes post-training quantization, partial quantization, and quantization-aware fine-tuning techniques. • Some more complex techniques, such as ADMM and distillation, were not required for int8 quantization of these models. • However, these techniques should be evaluated when quantizing to even lower-bit integer representations, which we leave to future work. Neural Acceleration Study #3