FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer

Sungchul Kim
2022. 07. 28
https://arxiv.org/pdf/2111.13824.pdf

Contents
▪ Introduction
▪ Related Work
▪ Proposed Method
• Power-of-Two Factor for LayerNorm Quantization
• Log-Int-Softmax for Softmax Quantization
▪ Experiments
▪ Conclusions

Introduction
▪ RoBERTa for sentence classification
▪ There are many quantization approaches for CNNs.
• “Post-Training Quantization for Vision Transformer”, NeurIPS 2021.
• Found a significant accuracy degradation when quantizing LayerNorm and Softmax of ViT
• Retained LayerNorm and Softmax as floating-point units
→ large computation and low inference speed!
Original model (FP32)
PTQ from original model
(INT8)
PTQ with
ignore_scope=layernorm
(INT8)
0.8773 0.3648 (-0.5125) 0.8551 (-0.0222)

Introduction
▪ In this paper,
• There are a serious inter-channel variation of LayerNorm inputs, which some channel ranges
even exceed 40x of the median.
• Power-of-Two Factor (PTF) to quantize the inputs of LayerNorm
• The values of the attention map have an extreme non-uniform distribution, with most values
clustered in 0 ∼ 0.01, and a few high attention values close to 1.
• Log-Int-Softmax (LIS) to provide higher quantization resolution for small values
• LIS can present a more efficient integer inference for Softmax

Related Work
▪ Vision Transformer
• ViT, Swin Transformer, …
• These models are attributed to the large number of parameters and high computational overhead.
ViT : https://arxiv.org/pdf/2010.11929.pdf
SwinTransformer : https://arxiv.org/pdf/2103.14030.pdf

Related Work
▪ Vision Transformer w/ Efficient Model Designing
• LeVit
• Downsampling, patch descriptor, and a redesign of Attention-MLP block
• DynamicViT
• Prune redundant tokens progressively and dynamically
• Evo-ViT
• A slow-fast updating mechanism
LeViT : https://arxiv.org/pdf/2104.01136.pdf
DynamicViT : https://arxiv.org/pdf/2106.02034.pdf
Evo-ViT : https://arxiv.org/pdf/2108.01390.pdf

Related Work
▪ Network Quantization
• Quantization-Aware Training (QAT)
• Training to achieve aggressively low-bit (e.g. 2-bit) quantization and promising performance
• Often require a high-level expert knowledge and huge GPU resources for training or fine-tuning
• Post-Training Quantization (PTQ)
• Training free!
• OMSE : determine the value range of activation by minimizing the quantization error
• AdaRound : A novel rounding mechanism to adapt the data and the task loss
• PTQ for ViT : quantize ViT with similarity-aware and rank-aware strategies,
but does not quantize SoftMax and LayerNorm
OMSE : https://arxiv.org/pdf/1902.06822.pdf
AdaRound : https://arxiv.org/pdf/2004.10568v2.pdf
PTQ for ViT : https://arxiv.org/pdf/2106.14156.pdf

Proposed Method
▪ Preliminary
• 𝑏 : the quantization bit-width
• Q X 𝑏 : the quantizer (X ∈ ℝ)
• There are various quantizer Q X 𝑏 , uniform and log2
• 𝑠 : scale / 𝑧𝑝 : zero-point
Log2 Quantization
Uniform Quantization

Proposed Method
▪ Power-of-Two Factor for LayerNorm Quantization
• LayerNorm (=
X−𝜇X
𝜎X
2+𝜖
⋅ 𝛾 + 𝛽)
• Unlike BatchNorm, LayerNorm cannot be folded into the previous layers, so we have to quantize it separately.
• There is a significant performance degradation while applying PTQ on LayerNorm!
• There is a serious inter-channel variation. (Left)
• I also had this issue when looking at RoBERTa. (Right)

Proposed Method
• Power-of-Two Factor (PTF)
• Equip different channels with different factors, rather than different quantization parameters
• X ∈ ℝB×L×C
, 𝑠, 𝑧𝑝 ∈ ℝ1
, the PTF 𝛼 ∈ ℕ𝐶
• Set 𝐾 = 3 as default

Proposed Method
• Power-of-Two Factor (PTF)
• During inference, layer-wise parameters 𝑠 and 𝑧𝑝 can be extracted, so the computation of 𝜇, 𝜎 could be done in the
integer domain rather than floating-point.
• Thanks to the nature of powers of two, PTF 𝛼 can be efficiently combined with layer-wise quantization by BitShift
operator.
ex)
𝑥 ≪ 3 = 𝑥 ⋅ 23
𝑥 ≫ 3 =
𝑥
23

Proposed Method
▪ Log-Int-Softmax for Softmax Quantization
• Multi-head Self-Attention (MSA)
• Higher resolution and smaller patch size were proven to benefit model performance (in ViT paper),
• but the storage and computation of attention maps become the bottleneck which directly affect the throughput
and latency.

Proposed Method
• Log2 Quantization for Attention Map
• As experimenting on quantization of attention maps from 8-bit to 4-bit with uniform quantization, all vision
transformers show severe performance drop.
• e.g.) DeiT-T w/ 8bit : 71.74% → 4bit : 8.69% (top-1 acc)

Proposed Method
• Log2 Quantization for Attention Map
• Log2 quantization proved to be suitable combining with MSA from two aspects.
1. The fixed output range (0, 1) of Softmax makes the log2 function calibration-free
2. It also introduces the merits of converting the MatMul to BitShift between the quantized attention map (AttnQ) and values (VQ)
• 𝑁 = 2𝑏
− 1

Proposed Method
• Integer-only Inference
• Log-Int-Softmax
• Combine log2 quantization with i-exp, which is a polynomial approximation of exponential function
• 𝑁 = 2𝑏
− 1

Experiments
▪ Implementation Details
• Dataset
• Image classification : ImageNet
• Object detection : COCO
• Calibration : randomly sampled 1,000 training images
• Evaluation : validation set
• Quantization
• Weights : symmetric channel-wise quantization
• Fixed as MinMax for a fair comparison
• Activations : asymmetric layer-wise quantization
• K in Power-of-Two Factor = 3

Experiments
▪ Comparison with State-of-the-art Methods
• MinMax
• EMA : Collect [a; b] ranges seen on activations during training and aggregate them via EMA
• Percentile : Discard outlier activation values based on percentiles and clamp quantized activations and gradients
• OMSE : Optimal (Minimum) MSE Quantization
• Bit-Split : Split integers into multiple bits, then optimize each bit, and finally stitch all bits back to integers
• PTQ for ViT : Similarity-aware quantization for linear operation and Ranking-aware quantization for self-attention
• It is closest to this paper, however it does not quantize the LayerNorm and Softmax
EMA : https://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
Percentile : https://openaccess.thecvf.com/content_CVPR_2019/papers/Li_Fully_Quantized_Network_for_Object_Detection_CVPR_2019_paper.pdf
OMSE : https://arxiv.org/pdf/1902.06822.pdf
BitSplit : http://proceedings.mlr.press/v119/wang20c/wang20c.pdf
PTQ for ViT : https://arxiv.org/pdf/2106.14156.pdf

Experiments
• Image Classification on ImageNet
• FQ-ViT significantly exceeds PTQ for ViT
• FQ-ViT achieves 81.20% accuracy on DeiT-B in the case of all modules quantized to 8-bit, and it can still achieve
80.85% accuracy when the attention maps are compressed to 4-bit

Experiments
• Object Detection on COCO
• FQ-ViT significantly improves the quantization accuracy and achieves 47.2 mAP on Mask R-CNN (Swin-S)
and 50.8 mAP on Cascade Mask R-CNN (Swin-S) with 8-bit on weights/activations and 4-bit on attention maps

Experiments
• Ablation Studies
• Baseline #8 : fully quantized by MinMax with 8-bit weights, activations and attention maps
• Baseline #4 : a lower bit-width (4-bit) on attention maps

Experiments
• Visualization of Quantized Attention Map

Conclusions
▪ Power-of-Two Factor (PTF)
• to deal with the serious inter-channel variations in the inputs of LayerNorm
▪ Log-Int-Softmax (LIS)
• to implement 4-bit quantization of attention maps and utilize the BitShift operator instead of
MatMul during inference

FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer

More Related Content

Similar to FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer

More from Sungchul Kim

Recently uploaded

FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer