SlideShare a Scribd company logo
1 of 29
Download to read offline
INTEGER QUANTIZATION FOR DEEP
LEARNING INFERENCE: PRINCIPLES
AND EMPIRICAL EVALUATION
NVIDIA, Georgia Tech.
Arxiv 2020
Presenter: Jemin Lee
https://leejaymin.github.io/index.html
7 Oct. 2020
Neural Acceleration Study Season #3
Contents
• Introduction and Background
• Quantization Fundamentals
• Post-Training Quantization
• Techniques to Recover Accuracy
• Recommended Workflow
• Conclusion
Neural Acceleration Study #2
Introduction
• Low-precision formats offer several performance benefits
• First, many processors provide higher throughput math pipelines the low-bit
formats, which can speed up math-intensive operations, such as convolutions and
matrix multiplications.
• Second, smaller word sizes reduce memory bandwidth pressure, improving
performance for bandwidth-limited computations.
• Third, smaller word sizes lead to lower memory size requirements, which can
improve cache utilization as well as other aspects of memory-system operation.
Neural Acceleration Study #3
Introduction
• Review the mathematical fundamentals underlying various
integer quantization choices (Section 3) as well as techniques
for recovering accuracy lost due to quantization (Section 5).
Section 6 combines this information into a recommended
workflow.
Neural Acceleration Study #3
Related Work
• Uniform [7,59] quantization enables the use of integer or fixed-
point math pipelines, allowing computation to be performed in the
quantized domain.
• Non-uniform[13,60,34,2] quantization requires dequantization, e.g.
a codebook lookup, before doing computation in higher precision,
limiting its benefits to model compression and bandwidth reduction.
• This paper focuses on leveraging quantization to accelerate
computation, so we will restrict our focus to uniform quantization
schemes.
• In this paper, we present an evaluation of int8 quantization on all of
the major network architectures with both PTQ and QAT.
Neural Acceleration Study #3
Quantization Fundamentals
• Uniform quantization steps
• 1) Choose the range of the real numbers to be quantized, clamping the
values outside this range.
• 2) Map the real values to integers representable by the bit-width of the
quantized representation (round each mapped real value to the closest
integer value).
• Quantize: convert a real number to a quantized integer
representation (e.g. from fp32 to int8).
• Dequantize: convert a number from quantized integer
representation to a real number (e.g. from int32 to fp16).
Neural Acceleration Study #3
Quantization Fundamentals: Affine Quantization
(Asymmetric)
Neural Acceleration Study #3
• 𝒇 𝒙 = 𝒔 % 𝒙 + 𝒛
• S is scale factor and z is zero-point
• Int8
• S=
𝟐𝟓𝟓
𝜷$𝜶
, z=-round(𝜷 " 𝒔)-128
Quantization Fundamentals: Scale Quantization
(Symmetric)
Neural Acceleration Study #3
• integer range [−127, 127]
Quantization Fundamentals: Tensor Quantization
Granularity
Neural Acceleration Study #3
• At the coarsest, per-tensor granularity, the same quantization
parameters are shared by all elements in the tensor.
• The finest granularity would have individual quantization
parameters per element.
• tensor - per row or per column for 2D matrices,
• per channel for 3D (image-like) tensors.
Quantization Fundamentals: Tensor Quantization
Granularity
Neural Acceleration Study #3
• Integer matrix multiplication
• per-row or per-tensor for activations
• per-column or per-tensor for weights
• The equation is as below:
Quantization Fundamentals: Tensor Quantization
Granularity
Neural Acceleration Study #3
• For activations, only per-tensor quantization is practical for
performance reasons.
• The number of rows depends on batch instances.
• Row count can vary at inference time.
• It prevents the per-row scaling factor from being computation
offline.
• (which would not be meaningful for different instances in a mini-
batch), whereas determining them online imposes a compute
overhead and in some cases results in poor accuracy (Dynamic
Quantization discussion in [58]).
Quantization Fundamentals: Tensor Quantization
Granularity
Neural Acceleration Study #3
• The corresponding granularity to per-column in convolutions is
per-kernel, or equivalently per-output-channel since each
kernel produces a separate output channel [27, 29].
• This is commonly referred to as “per-channel” weight
quantization in literature and we follow that convention [21,
25, 26, 38, 46]. We examine the granularity impact on accuracy
in Section 4.1.
Quantization Fundamentals: Computational Cost of
Affine Quantization
Neural Acceleration Study #3
• Breakdown three terms:
• the integer dot product, just as in scale quantization (Equation 10).
• only integer weights and zero-points can be computed offline, only adding an
element-wise addition at inference time.
• If the layer has a bias then this term can be folded in without increasing inference cost.
• involves the quantized input matrix Xq, and thus cannot be computed
offline.
• overhead, reducing or even eliminating the throughput advantage that integer math
pipelines have over reduced precision floating-point.
• extra computation is incurred only if affine quantization is used for the
weight matrix.
• Scale quantization for weights.
• Affine quantization could be used for activations
Quantization Fundamentals: Calibration
Neural Acceleration Study #3
• Max: Use the maximum absolute value seen during calibration [52].
• Entropy: Use KL divergence to minimize information loss between the original floating-point
values and values that could be represented by the quantized format. This is the default method
used by TensorRT [36].
• Percentile: Set the range to a percentile of the distribution of absolute values seen during
calibration [33]. For example, 99% calibration would clip 1% of the largest magnitude values.
• Clipping the distribution trades off a large clipping error on a few outlier values for smaller
rounding errors on a majority of the values.
Post Training Quantization
Neural Acceleration Study #3
• Quantization parameters are calibrated offline by processing the trained model weights and
activations generated by running inference on a sample dataset, no further training is involved.
Post Training Quantization
Neural Acceleration Study #3
• refer to the relative accuracy change, computed by (accint8 − accfp32)/accfp32
Post Training Quantization: Weight Quantization
Neural Acceleration Study #3
• Int8 weights: max calibration
• However, as BN parameters are learned per channel, their folding can result in significantly
different weight value distributions across channels.
• Table 4 reports per-channel (per-column for linear layers) granularity and indicates that max
calibration is sufficient to maintain accuracy when quantizing weights to int8. The rest of the
experiments in this paper use per-channel max calibration for weights.
Post Training Quantization: Activation Quantization
Neural Acceleration Study #3
• For most of the networks, there is at least one activation calibration method that achieves
acceptable accuracy,
• except for MobileNets, EfficientNets, Transformer and BERT where the accuracy drop is larger than 1%.
• no single calibration is best for all networks.
Techniques to Recover Accuracy
Neural Acceleration Study #3
• Partial quantization: leaves the most sensitive layers unquantized
• One also has an option to train networks with quantization
• there are also approaches that jointly learn the model weights and quantization parameters.
Techniques to Recover Accuracy: Partial Quantization
Neural Acceleration Study #3
• Partial quantization: leaves the most sensitive layers unquantized
Techniques to Recover Accuracy: Quantization Aware
Training
Neural Acceleration Study #3
• Intuition for QAT
• the model has converged to a “narrow” minimum, since a small change in the weight leads to a large change in loss. By
training with quantization, we may potentially avoid these narrow minima by computing gradients with respect to the
quantized weights, as shown in Figure 6b. In doing so, narrow minima will result in larger gradients, potentially allowing
the model to explore the loss landscape for “wide” [22] or “flat” [16, 30] minima, where quantization points have lower
loss, and thus higher accuracy.
Techniques to Recover Accuracy: Quantization Aware
Training
Neural Acceleration Study #3
• Fake Quantization
• FakeQuant(x) = DQ(Q(x)) = INT8(s*x) / s
• Problem: The quantization function is non-differentiable
• Solution: Straight Through Estimation (STE)
Techniques to Recover Accuracy: Quantization Aware
Training
Neural Acceleration Study #3
• Result
Techniques to Recover Accuracy: Learning Quantization
Parameters
Neural Acceleration Study #3
• Apply the same fine-tuning schedule as before.
• but allow the ranges of each quantized activation tensor to be learned along with the weights, as opposed to keeping
them fixed throughout fine-tuning.
• Use PACT[1]
• Fixed best: best accuracy with PTQ as described in Table 5.
• Use max calibration for activation quantization
• Learning the range results in higher accuracy than keeping it fixed for most networks.
• In best calibration cases, learning the ranges generate very similar results.
• Not to offer extra benefit for int8 over QAT if activation ranges are already carefully calibrated.
• PACT could yield better results with longer fine-tuning or a separate process.
[1] Pact: Parameterized clipping activation for quantized neural networks, ICLR18
Techniques to Recover Accuracy: Learning Quantization
Parameters
Neural Acceleration Study #3
• Complex fine-tuning procedure
[1] Pact: Parameterized clipping activation for quantized neural networks, ICLR18
Recommended Workflow (Recap)
• Weights:
• Per-column (FC), per-channel (conv.)
• Symmetric (high performance)
• Max calibration (static)
• Activations:
• Per tensor, symmetric
• PTQ
• Activation calibration could select the method
among max, entropy, 99.99%, and 99.999%.
• If none, do Partial quant, or QAT
• Partial Quantization:
• Perform sensitivity analysis to identify the most sensitive layers
and leave them in floating-point.
• If none, do QAT.
• QAT:
• Start from the best calibrated quantized model.
• Use QAT to fine-tune for around 10% of the
original training schedule
with an annealing learning rate schedule starting at
1% of the initial training learning rate.
• For details, refer to Appendix A.2.
Neural Acceleration Study #3
Conclusion
• Review the mathematical background for integer quantization of neural networks, as
well as some performance-related reasons for choosing quantization parameters.
• Empirically evaluate various choices for int8 quantization of a variety of models, leading
to a quantization workflow proposal.
• MobileNets and BERT are challengeable for quantization.
• Workflow composes post-training quantization, partial quantization, and quantization-aware fine-tuning
techniques.
• Some more complex techniques, such as ADMM and distillation, were not required for
int8 quantization of these models.
• However, these techniques should be evaluated when quantizing to even lower-bit
integer representations, which we leave to future work.
Neural Acceleration Study #3
Thank you
Neural Acceleration Study #2

More Related Content

What's hot

Tutorial on IEEE 802.15.4e standard
Tutorial on IEEE 802.15.4e standardTutorial on IEEE 802.15.4e standard
Tutorial on IEEE 802.15.4e standardGiuseppe Anastasi
 
“DEEPX’s New M1 NPU Delivers Flexibility, Accuracy, Efficiency and Performanc...
“DEEPX’s New M1 NPU Delivers Flexibility, Accuracy, Efficiency and Performanc...“DEEPX’s New M1 NPU Delivers Flexibility, Accuracy, Efficiency and Performanc...
“DEEPX’s New M1 NPU Delivers Flexibility, Accuracy, Efficiency and Performanc...Edge AI and Vision Alliance
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learningMehdi Shibahara
 
# Neural network toolbox
# Neural network toolbox # Neural network toolbox
# Neural network toolbox VineetKumar508
 
DDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationDDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationSubhajit Sahu
 
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based NodesBenefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based Nodesinside-BigData.com
 
The Path to "Zen 2"
The Path to "Zen 2"The Path to "Zen 2"
The Path to "Zen 2"AMD
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware LandscapeGrigory Sapunov
 
What CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBDWhat CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBDShapeBlue
 
Shor’s algorithm the ppt
Shor’s algorithm the pptShor’s algorithm the ppt
Shor’s algorithm the pptMrinal Mondal
 
Pruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inferencePruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inferenceKaushalya Madhawa
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD) Ali Raza
 
PR-340: DVC: An End-to-end Deep Video Compression Framework
PR-340: DVC: An End-to-end Deep Video Compression FrameworkPR-340: DVC: An End-to-end Deep Video Compression Framework
PR-340: DVC: An End-to-end Deep Video Compression FrameworkHyeongmin Lee
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Saksham Tanwar
 
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_FinalGopi Krishnamurthy
 

What's hot (20)

Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
Tutorial on IEEE 802.15.4e standard
Tutorial on IEEE 802.15.4e standardTutorial on IEEE 802.15.4e standard
Tutorial on IEEE 802.15.4e standard
 
Deep learning with FPGA
Deep learning with FPGADeep learning with FPGA
Deep learning with FPGA
 
“DEEPX’s New M1 NPU Delivers Flexibility, Accuracy, Efficiency and Performanc...
“DEEPX’s New M1 NPU Delivers Flexibility, Accuracy, Efficiency and Performanc...“DEEPX’s New M1 NPU Delivers Flexibility, Accuracy, Efficiency and Performanc...
“DEEPX’s New M1 NPU Delivers Flexibility, Accuracy, Efficiency and Performanc...
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
 
# Neural network toolbox
# Neural network toolbox # Neural network toolbox
# Neural network toolbox
 
DDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationDDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : Presentation
 
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based NodesBenefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
 
The Path to "Zen 2"
The Path to "Zen 2"The Path to "Zen 2"
The Path to "Zen 2"
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Multi-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IKMulti-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IK
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
What CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBDWhat CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBD
 
TinyOS
TinyOSTinyOS
TinyOS
 
Shor’s algorithm the ppt
Shor’s algorithm the pptShor’s algorithm the ppt
Shor’s algorithm the ppt
 
Pruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inferencePruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inference
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
 
PR-340: DVC: An End-to-end Deep Video Compression Framework
PR-340: DVC: An End-to-end Deep Video Compression FrameworkPR-340: DVC: An End-to-end Deep Video Compression Framework
PR-340: DVC: An End-to-end Deep Video Compression Framework
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
 

Similar to Integer quantization for deep learning inference: principles and empirical evaluation

Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerZahra Sadeghi
 
A White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationA White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationApril Knyff
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic LeapEdge AI and Vision Alliance
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...Tahmid Abtahi
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.pptDeadpool120050
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)DonghyunKang12
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...Edge AI and Vision Alliance
 
08 neural networks
08 neural networks08 neural networks
08 neural networksankit_ppt
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Ryo Takahashi
 
Mx net image segmentation to predict and diagnose the cardiac diseases karp...
Mx net image segmentation to predict and diagnose the cardiac diseases   karp...Mx net image segmentation to predict and diagnose the cardiac diseases   karp...
Mx net image segmentation to predict and diagnose the cardiac diseases karp...KannanRamasamy25
 
Optimal buffer allocation in
Optimal buffer allocation inOptimal buffer allocation in
Optimal buffer allocation incsandit
 
Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Pedro Lopes
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learningNimrita Koul
 

Similar to Integer quantization for deep learning inference: principles and empirical evaluation (20)

TensorFlow.pptx
TensorFlow.pptxTensorFlow.pptx
TensorFlow.pptx
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
 
A White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationA White Paper On Neural Network Quantization
A White Paper On Neural Network Quantization
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Deep learning
Deep learningDeep learning
Deep learning
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
Reducing Power Consumption during Test Application by Test Vector Ordering
Reducing Power Consumption during Test Application by Test Vector OrderingReducing Power Consumption during Test Application by Test Vector Ordering
Reducing Power Consumption during Test Application by Test Vector Ordering
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
 
Mx net image segmentation to predict and diagnose the cardiac diseases karp...
Mx net image segmentation to predict and diagnose the cardiac diseases   karp...Mx net image segmentation to predict and diagnose the cardiac diseases   karp...
Mx net image segmentation to predict and diagnose the cardiac diseases karp...
 
Optimal buffer allocation in
Optimal buffer allocation inOptimal buffer allocation in
Optimal buffer allocation in
 
Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
 

More from jemin lee

HAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network QuantizationHAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network Quantizationjemin lee
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachjemin lee
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performancejemin lee
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...jemin lee
 
Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage jemin lee
 
Jetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usageJetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usagejemin lee
 

More from jemin lee (6)

HAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network QuantizationHAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network Quantization
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performance
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
 
Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage
 
Jetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usageJetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usage
 

Recently uploaded

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 

Recently uploaded (20)

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 

Integer quantization for deep learning inference: principles and empirical evaluation

  • 1. INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE: PRINCIPLES AND EMPIRICAL EVALUATION NVIDIA, Georgia Tech. Arxiv 2020 Presenter: Jemin Lee https://leejaymin.github.io/index.html 7 Oct. 2020 Neural Acceleration Study Season #3
  • 2. Contents • Introduction and Background • Quantization Fundamentals • Post-Training Quantization • Techniques to Recover Accuracy • Recommended Workflow • Conclusion Neural Acceleration Study #2
  • 3. Introduction • Low-precision formats offer several performance benefits • First, many processors provide higher throughput math pipelines the low-bit formats, which can speed up math-intensive operations, such as convolutions and matrix multiplications. • Second, smaller word sizes reduce memory bandwidth pressure, improving performance for bandwidth-limited computations. • Third, smaller word sizes lead to lower memory size requirements, which can improve cache utilization as well as other aspects of memory-system operation. Neural Acceleration Study #3
  • 4. Introduction • Review the mathematical fundamentals underlying various integer quantization choices (Section 3) as well as techniques for recovering accuracy lost due to quantization (Section 5). Section 6 combines this information into a recommended workflow. Neural Acceleration Study #3
  • 5. Related Work • Uniform [7,59] quantization enables the use of integer or fixed- point math pipelines, allowing computation to be performed in the quantized domain. • Non-uniform[13,60,34,2] quantization requires dequantization, e.g. a codebook lookup, before doing computation in higher precision, limiting its benefits to model compression and bandwidth reduction. • This paper focuses on leveraging quantization to accelerate computation, so we will restrict our focus to uniform quantization schemes. • In this paper, we present an evaluation of int8 quantization on all of the major network architectures with both PTQ and QAT. Neural Acceleration Study #3
  • 6. Quantization Fundamentals • Uniform quantization steps • 1) Choose the range of the real numbers to be quantized, clamping the values outside this range. • 2) Map the real values to integers representable by the bit-width of the quantized representation (round each mapped real value to the closest integer value). • Quantize: convert a real number to a quantized integer representation (e.g. from fp32 to int8). • Dequantize: convert a number from quantized integer representation to a real number (e.g. from int32 to fp16). Neural Acceleration Study #3
  • 7. Quantization Fundamentals: Affine Quantization (Asymmetric) Neural Acceleration Study #3 • 𝒇 𝒙 = 𝒔 % 𝒙 + 𝒛 • S is scale factor and z is zero-point • Int8 • S= 𝟐𝟓𝟓 𝜷$𝜶 , z=-round(𝜷 " 𝒔)-128
  • 8. Quantization Fundamentals: Scale Quantization (Symmetric) Neural Acceleration Study #3 • integer range [−127, 127]
  • 9. Quantization Fundamentals: Tensor Quantization Granularity Neural Acceleration Study #3 • At the coarsest, per-tensor granularity, the same quantization parameters are shared by all elements in the tensor. • The finest granularity would have individual quantization parameters per element. • tensor - per row or per column for 2D matrices, • per channel for 3D (image-like) tensors.
  • 10. Quantization Fundamentals: Tensor Quantization Granularity Neural Acceleration Study #3 • Integer matrix multiplication • per-row or per-tensor for activations • per-column or per-tensor for weights • The equation is as below:
  • 11. Quantization Fundamentals: Tensor Quantization Granularity Neural Acceleration Study #3 • For activations, only per-tensor quantization is practical for performance reasons. • The number of rows depends on batch instances. • Row count can vary at inference time. • It prevents the per-row scaling factor from being computation offline. • (which would not be meaningful for different instances in a mini- batch), whereas determining them online imposes a compute overhead and in some cases results in poor accuracy (Dynamic Quantization discussion in [58]).
  • 12. Quantization Fundamentals: Tensor Quantization Granularity Neural Acceleration Study #3 • The corresponding granularity to per-column in convolutions is per-kernel, or equivalently per-output-channel since each kernel produces a separate output channel [27, 29]. • This is commonly referred to as “per-channel” weight quantization in literature and we follow that convention [21, 25, 26, 38, 46]. We examine the granularity impact on accuracy in Section 4.1.
  • 13. Quantization Fundamentals: Computational Cost of Affine Quantization Neural Acceleration Study #3 • Breakdown three terms: • the integer dot product, just as in scale quantization (Equation 10). • only integer weights and zero-points can be computed offline, only adding an element-wise addition at inference time. • If the layer has a bias then this term can be folded in without increasing inference cost. • involves the quantized input matrix Xq, and thus cannot be computed offline. • overhead, reducing or even eliminating the throughput advantage that integer math pipelines have over reduced precision floating-point. • extra computation is incurred only if affine quantization is used for the weight matrix. • Scale quantization for weights. • Affine quantization could be used for activations
  • 14.
  • 15. Quantization Fundamentals: Calibration Neural Acceleration Study #3 • Max: Use the maximum absolute value seen during calibration [52]. • Entropy: Use KL divergence to minimize information loss between the original floating-point values and values that could be represented by the quantized format. This is the default method used by TensorRT [36]. • Percentile: Set the range to a percentile of the distribution of absolute values seen during calibration [33]. For example, 99% calibration would clip 1% of the largest magnitude values. • Clipping the distribution trades off a large clipping error on a few outlier values for smaller rounding errors on a majority of the values.
  • 16. Post Training Quantization Neural Acceleration Study #3 • Quantization parameters are calibrated offline by processing the trained model weights and activations generated by running inference on a sample dataset, no further training is involved.
  • 17. Post Training Quantization Neural Acceleration Study #3 • refer to the relative accuracy change, computed by (accint8 − accfp32)/accfp32
  • 18. Post Training Quantization: Weight Quantization Neural Acceleration Study #3 • Int8 weights: max calibration • However, as BN parameters are learned per channel, their folding can result in significantly different weight value distributions across channels. • Table 4 reports per-channel (per-column for linear layers) granularity and indicates that max calibration is sufficient to maintain accuracy when quantizing weights to int8. The rest of the experiments in this paper use per-channel max calibration for weights.
  • 19. Post Training Quantization: Activation Quantization Neural Acceleration Study #3 • For most of the networks, there is at least one activation calibration method that achieves acceptable accuracy, • except for MobileNets, EfficientNets, Transformer and BERT where the accuracy drop is larger than 1%. • no single calibration is best for all networks.
  • 20. Techniques to Recover Accuracy Neural Acceleration Study #3 • Partial quantization: leaves the most sensitive layers unquantized • One also has an option to train networks with quantization • there are also approaches that jointly learn the model weights and quantization parameters.
  • 21. Techniques to Recover Accuracy: Partial Quantization Neural Acceleration Study #3 • Partial quantization: leaves the most sensitive layers unquantized
  • 22. Techniques to Recover Accuracy: Quantization Aware Training Neural Acceleration Study #3 • Intuition for QAT • the model has converged to a “narrow” minimum, since a small change in the weight leads to a large change in loss. By training with quantization, we may potentially avoid these narrow minima by computing gradients with respect to the quantized weights, as shown in Figure 6b. In doing so, narrow minima will result in larger gradients, potentially allowing the model to explore the loss landscape for “wide” [22] or “flat” [16, 30] minima, where quantization points have lower loss, and thus higher accuracy.
  • 23. Techniques to Recover Accuracy: Quantization Aware Training Neural Acceleration Study #3 • Fake Quantization • FakeQuant(x) = DQ(Q(x)) = INT8(s*x) / s • Problem: The quantization function is non-differentiable • Solution: Straight Through Estimation (STE)
  • 24. Techniques to Recover Accuracy: Quantization Aware Training Neural Acceleration Study #3 • Result
  • 25. Techniques to Recover Accuracy: Learning Quantization Parameters Neural Acceleration Study #3 • Apply the same fine-tuning schedule as before. • but allow the ranges of each quantized activation tensor to be learned along with the weights, as opposed to keeping them fixed throughout fine-tuning. • Use PACT[1] • Fixed best: best accuracy with PTQ as described in Table 5. • Use max calibration for activation quantization • Learning the range results in higher accuracy than keeping it fixed for most networks. • In best calibration cases, learning the ranges generate very similar results. • Not to offer extra benefit for int8 over QAT if activation ranges are already carefully calibrated. • PACT could yield better results with longer fine-tuning or a separate process. [1] Pact: Parameterized clipping activation for quantized neural networks, ICLR18
  • 26. Techniques to Recover Accuracy: Learning Quantization Parameters Neural Acceleration Study #3 • Complex fine-tuning procedure [1] Pact: Parameterized clipping activation for quantized neural networks, ICLR18
  • 27. Recommended Workflow (Recap) • Weights: • Per-column (FC), per-channel (conv.) • Symmetric (high performance) • Max calibration (static) • Activations: • Per tensor, symmetric • PTQ • Activation calibration could select the method among max, entropy, 99.99%, and 99.999%. • If none, do Partial quant, or QAT • Partial Quantization: • Perform sensitivity analysis to identify the most sensitive layers and leave them in floating-point. • If none, do QAT. • QAT: • Start from the best calibrated quantized model. • Use QAT to fine-tune for around 10% of the original training schedule with an annealing learning rate schedule starting at 1% of the initial training learning rate. • For details, refer to Appendix A.2. Neural Acceleration Study #3
  • 28. Conclusion • Review the mathematical background for integer quantization of neural networks, as well as some performance-related reasons for choosing quantization parameters. • Empirically evaluate various choices for int8 quantization of a variety of models, leading to a quantization workflow proposal. • MobileNets and BERT are challengeable for quantization. • Workflow composes post-training quantization, partial quantization, and quantization-aware fine-tuning techniques. • Some more complex techniques, such as ADMM and distillation, were not required for int8 quantization of these models. • However, these techniques should be evaluated when quantizing to even lower-bit integer representations, which we leave to future work. Neural Acceleration Study #3