Sungchul Kim
2022. 07. 28
https://arxiv.org/pdf/2111.13824.pdf
Contents
▪ Introduction
▪ Related Work
▪ Proposed Method
• Power-of-Two Factor for LayerNorm Quantization
• Log-Int-Softmax for Softmax Quantization
▪ Experiments
▪ Conclusions
Introduction
▪ RoBERTa for sentence classification
▪ There are many quantization approaches for CNNs.
• “Post-Training Quantization for Vision Transformer”, NeurIPS 2021.
• Found a significant accuracy degradation when quantizing LayerNorm and Softmax of ViT
• Retained LayerNorm and Softmax as floating-point units
→ large computation and low inference speed!
Original model (FP32)
PTQ from original model
(INT8)
PTQ with
ignore_scope=layernorm
(INT8)
0.8773 0.3648 (-0.5125) 0.8551 (-0.0222)
Introduction
▪ In this paper,
• There are a serious inter-channel variation of LayerNorm inputs, which some channel ranges
even exceed 40x of the median.
• Power-of-Two Factor (PTF) to quantize the inputs of LayerNorm
• The values of the attention map have an extreme non-uniform distribution, with most values
clustered in 0 ∼ 0.01, and a few high attention values close to 1.
• Log-Int-Softmax (LIS) to provide higher quantization resolution for small values
• LIS can present a more efficient integer inference for Softmax
Related Work
▪ Vision Transformer
• ViT, Swin Transformer, …
• These models are attributed to the large number of parameters and high computational overhead.
ViT : https://arxiv.org/pdf/2010.11929.pdf
SwinTransformer : https://arxiv.org/pdf/2103.14030.pdf
Related Work
▪ Vision Transformer w/ Efficient Model Designing
• LeVit
• Downsampling, patch descriptor, and a redesign of Attention-MLP block
• DynamicViT
• Prune redundant tokens progressively and dynamically
• Evo-ViT
• A slow-fast updating mechanism
LeViT : https://arxiv.org/pdf/2104.01136.pdf
DynamicViT : https://arxiv.org/pdf/2106.02034.pdf
Evo-ViT : https://arxiv.org/pdf/2108.01390.pdf
Related Work
▪ Network Quantization
• Quantization-Aware Training (QAT)
• Training to achieve aggressively low-bit (e.g. 2-bit) quantization and promising performance
• Often require a high-level expert knowledge and huge GPU resources for training or fine-tuning
• Post-Training Quantization (PTQ)
• Training free!
• OMSE : determine the value range of activation by minimizing the quantization error
• AdaRound : A novel rounding mechanism to adapt the data and the task loss
• PTQ for ViT : quantize ViT with similarity-aware and rank-aware strategies,
but does not quantize SoftMax and LayerNorm
OMSE : https://arxiv.org/pdf/1902.06822.pdf
AdaRound : https://arxiv.org/pdf/2004.10568v2.pdf
PTQ for ViT : https://arxiv.org/pdf/2106.14156.pdf
Proposed Method
▪ Preliminary
• 𝑏 : the quantization bit-width
• Q X 𝑏 : the quantizer (X ∈ ℝ)
• There are various quantizer Q X 𝑏 , uniform and log2
• 𝑠 : scale / 𝑧𝑝 : zero-point
Log2 Quantization
Uniform Quantization
Proposed Method
▪ Power-of-Two Factor for LayerNorm Quantization
• LayerNorm (=
X−𝜇X
𝜎X
2+𝜖
⋅ 𝛾 + 𝛽)
• Unlike BatchNorm, LayerNorm cannot be folded into the previous layers, so we have to quantize it separately.
• There is a significant performance degradation while applying PTQ on LayerNorm!
• There is a serious inter-channel variation. (Left)
• I also had this issue when looking at RoBERTa. (Right)
Proposed Method
▪ Power-of-Two Factor for LayerNorm Quantization
• Power-of-Two Factor (PTF)
• Equip different channels with different factors, rather than different quantization parameters
• X ∈ ℝB×L×C
, 𝑠, 𝑧𝑝 ∈ ℝ1
, the PTF 𝛼 ∈ ℕ𝐶
• Set 𝐾 = 3 as default
Proposed Method
▪ Power-of-Two Factor for LayerNorm Quantization
• Power-of-Two Factor (PTF)
• During inference, layer-wise parameters 𝑠 and 𝑧𝑝 can be extracted, so the computation of 𝜇, 𝜎 could be done in the
integer domain rather than floating-point.
• Thanks to the nature of powers of two, PTF 𝛼 can be efficiently combined with layer-wise quantization by BitShift
operator.
ex)
𝑥 ≪ 3 = 𝑥 ⋅ 23
𝑥 ≫ 3 =
𝑥
23
Proposed Method
▪ Log-Int-Softmax for Softmax Quantization
• Multi-head Self-Attention (MSA)
• Higher resolution and smaller patch size were proven to benefit model performance (in ViT paper),
• but the storage and computation of attention maps become the bottleneck which directly affect the throughput
and latency.
Proposed Method
▪ Log-Int-Softmax for Softmax Quantization
• Log2 Quantization for Attention Map
• As experimenting on quantization of attention maps from 8-bit to 4-bit with uniform quantization, all vision
transformers show severe performance drop.
• e.g.) DeiT-T w/ 8bit : 71.74% → 4bit : 8.69% (top-1 acc)
Proposed Method
▪ Log-Int-Softmax for Softmax Quantization
• Log2 Quantization for Attention Map
• Log2 quantization proved to be suitable combining with MSA from two aspects.
1. The fixed output range (0, 1) of Softmax makes the log2 function calibration-free
2. It also introduces the merits of converting the MatMul to BitShift between the quantized attention map (AttnQ) and values (VQ)
• 𝑁 = 2𝑏
− 1
Proposed Method
▪ Log-Int-Softmax for Softmax Quantization
• Integer-only Inference
• Log-Int-Softmax
• Combine log2 quantization with i-exp, which is a polynomial approximation of exponential function
• 𝑁 = 2𝑏
− 1
Experiments
▪ Implementation Details
• Dataset
• Image classification : ImageNet
• Object detection : COCO
• Calibration : randomly sampled 1,000 training images
• Evaluation : validation set
• Quantization
• Weights : symmetric channel-wise quantization
• Fixed as MinMax for a fair comparison
• Activations : asymmetric layer-wise quantization
• K in Power-of-Two Factor = 3
Experiments
▪ Comparison with State-of-the-art Methods
• MinMax
• EMA : Collect [a; b] ranges seen on activations during training and aggregate them via EMA
• Percentile : Discard outlier activation values based on percentiles and clamp quantized activations and gradients
• OMSE : Optimal (Minimum) MSE Quantization
• Bit-Split : Split integers into multiple bits, then optimize each bit, and finally stitch all bits back to integers
• PTQ for ViT : Similarity-aware quantization for linear operation and Ranking-aware quantization for self-attention
• It is closest to this paper, however it does not quantize the LayerNorm and Softmax
EMA : https://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
Percentile : https://openaccess.thecvf.com/content_CVPR_2019/papers/Li_Fully_Quantized_Network_for_Object_Detection_CVPR_2019_paper.pdf
OMSE : https://arxiv.org/pdf/1902.06822.pdf
BitSplit : http://proceedings.mlr.press/v119/wang20c/wang20c.pdf
PTQ for ViT : https://arxiv.org/pdf/2106.14156.pdf
Experiments
▪ Comparison with State-of-the-art Methods
• Image Classification on ImageNet
• FQ-ViT significantly exceeds PTQ for ViT
• FQ-ViT achieves 81.20% accuracy on DeiT-B in the case of all modules quantized to 8-bit, and it can still achieve
80.85% accuracy when the attention maps are compressed to 4-bit
Experiments
▪ Comparison with State-of-the-art Methods
• Object Detection on COCO
• FQ-ViT significantly improves the quantization accuracy and achieves 47.2 mAP on Mask R-CNN (Swin-S)
and 50.8 mAP on Cascade Mask R-CNN (Swin-S) with 8-bit on weights/activations and 4-bit on attention maps
Experiments
▪ Comparison with State-of-the-art Methods
• Ablation Studies
• Baseline #8 : fully quantized by MinMax with 8-bit weights, activations and attention maps
• Baseline #4 : a lower bit-width (4-bit) on attention maps
Experiments
▪ Comparison with State-of-the-art Methods
• Visualization of Quantized Attention Map
Conclusions
▪ Power-of-Two Factor (PTF)
• to deal with the serious inter-channel variations in the inputs of LayerNorm
▪ Log-Int-Softmax (LIS)
• to implement 4-bit quantization of attention maps and utilize the BitShift operator instead of
MatMul during inference

FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer

  • 1.
    Sungchul Kim 2022. 07.28 https://arxiv.org/pdf/2111.13824.pdf
  • 2.
    Contents ▪ Introduction ▪ RelatedWork ▪ Proposed Method • Power-of-Two Factor for LayerNorm Quantization • Log-Int-Softmax for Softmax Quantization ▪ Experiments ▪ Conclusions
  • 3.
    Introduction ▪ RoBERTa forsentence classification ▪ There are many quantization approaches for CNNs. • “Post-Training Quantization for Vision Transformer”, NeurIPS 2021. • Found a significant accuracy degradation when quantizing LayerNorm and Softmax of ViT • Retained LayerNorm and Softmax as floating-point units → large computation and low inference speed! Original model (FP32) PTQ from original model (INT8) PTQ with ignore_scope=layernorm (INT8) 0.8773 0.3648 (-0.5125) 0.8551 (-0.0222)
  • 4.
    Introduction ▪ In thispaper, • There are a serious inter-channel variation of LayerNorm inputs, which some channel ranges even exceed 40x of the median. • Power-of-Two Factor (PTF) to quantize the inputs of LayerNorm • The values of the attention map have an extreme non-uniform distribution, with most values clustered in 0 ∼ 0.01, and a few high attention values close to 1. • Log-Int-Softmax (LIS) to provide higher quantization resolution for small values • LIS can present a more efficient integer inference for Softmax
  • 5.
    Related Work ▪ VisionTransformer • ViT, Swin Transformer, … • These models are attributed to the large number of parameters and high computational overhead. ViT : https://arxiv.org/pdf/2010.11929.pdf SwinTransformer : https://arxiv.org/pdf/2103.14030.pdf
  • 6.
    Related Work ▪ VisionTransformer w/ Efficient Model Designing • LeVit • Downsampling, patch descriptor, and a redesign of Attention-MLP block • DynamicViT • Prune redundant tokens progressively and dynamically • Evo-ViT • A slow-fast updating mechanism LeViT : https://arxiv.org/pdf/2104.01136.pdf DynamicViT : https://arxiv.org/pdf/2106.02034.pdf Evo-ViT : https://arxiv.org/pdf/2108.01390.pdf
  • 7.
    Related Work ▪ NetworkQuantization • Quantization-Aware Training (QAT) • Training to achieve aggressively low-bit (e.g. 2-bit) quantization and promising performance • Often require a high-level expert knowledge and huge GPU resources for training or fine-tuning • Post-Training Quantization (PTQ) • Training free! • OMSE : determine the value range of activation by minimizing the quantization error • AdaRound : A novel rounding mechanism to adapt the data and the task loss • PTQ for ViT : quantize ViT with similarity-aware and rank-aware strategies, but does not quantize SoftMax and LayerNorm OMSE : https://arxiv.org/pdf/1902.06822.pdf AdaRound : https://arxiv.org/pdf/2004.10568v2.pdf PTQ for ViT : https://arxiv.org/pdf/2106.14156.pdf
  • 8.
    Proposed Method ▪ Preliminary •𝑏 : the quantization bit-width • Q X 𝑏 : the quantizer (X ∈ ℝ) • There are various quantizer Q X 𝑏 , uniform and log2 • 𝑠 : scale / 𝑧𝑝 : zero-point Log2 Quantization Uniform Quantization
  • 9.
    Proposed Method ▪ Power-of-TwoFactor for LayerNorm Quantization • LayerNorm (= X−𝜇X 𝜎X 2+𝜖 ⋅ 𝛾 + 𝛽) • Unlike BatchNorm, LayerNorm cannot be folded into the previous layers, so we have to quantize it separately. • There is a significant performance degradation while applying PTQ on LayerNorm! • There is a serious inter-channel variation. (Left) • I also had this issue when looking at RoBERTa. (Right)
  • 10.
    Proposed Method ▪ Power-of-TwoFactor for LayerNorm Quantization • Power-of-Two Factor (PTF) • Equip different channels with different factors, rather than different quantization parameters • X ∈ ℝB×L×C , 𝑠, 𝑧𝑝 ∈ ℝ1 , the PTF 𝛼 ∈ ℕ𝐶 • Set 𝐾 = 3 as default
  • 11.
    Proposed Method ▪ Power-of-TwoFactor for LayerNorm Quantization • Power-of-Two Factor (PTF) • During inference, layer-wise parameters 𝑠 and 𝑧𝑝 can be extracted, so the computation of 𝜇, 𝜎 could be done in the integer domain rather than floating-point. • Thanks to the nature of powers of two, PTF 𝛼 can be efficiently combined with layer-wise quantization by BitShift operator. ex) 𝑥 ≪ 3 = 𝑥 ⋅ 23 𝑥 ≫ 3 = 𝑥 23
  • 12.
    Proposed Method ▪ Log-Int-Softmaxfor Softmax Quantization • Multi-head Self-Attention (MSA) • Higher resolution and smaller patch size were proven to benefit model performance (in ViT paper), • but the storage and computation of attention maps become the bottleneck which directly affect the throughput and latency.
  • 13.
    Proposed Method ▪ Log-Int-Softmaxfor Softmax Quantization • Log2 Quantization for Attention Map • As experimenting on quantization of attention maps from 8-bit to 4-bit with uniform quantization, all vision transformers show severe performance drop. • e.g.) DeiT-T w/ 8bit : 71.74% → 4bit : 8.69% (top-1 acc)
  • 14.
    Proposed Method ▪ Log-Int-Softmaxfor Softmax Quantization • Log2 Quantization for Attention Map • Log2 quantization proved to be suitable combining with MSA from two aspects. 1. The fixed output range (0, 1) of Softmax makes the log2 function calibration-free 2. It also introduces the merits of converting the MatMul to BitShift between the quantized attention map (AttnQ) and values (VQ) • 𝑁 = 2𝑏 − 1
  • 15.
    Proposed Method ▪ Log-Int-Softmaxfor Softmax Quantization • Integer-only Inference • Log-Int-Softmax • Combine log2 quantization with i-exp, which is a polynomial approximation of exponential function • 𝑁 = 2𝑏 − 1
  • 16.
    Experiments ▪ Implementation Details •Dataset • Image classification : ImageNet • Object detection : COCO • Calibration : randomly sampled 1,000 training images • Evaluation : validation set • Quantization • Weights : symmetric channel-wise quantization • Fixed as MinMax for a fair comparison • Activations : asymmetric layer-wise quantization • K in Power-of-Two Factor = 3
  • 17.
    Experiments ▪ Comparison withState-of-the-art Methods • MinMax • EMA : Collect [a; b] ranges seen on activations during training and aggregate them via EMA • Percentile : Discard outlier activation values based on percentiles and clamp quantized activations and gradients • OMSE : Optimal (Minimum) MSE Quantization • Bit-Split : Split integers into multiple bits, then optimize each bit, and finally stitch all bits back to integers • PTQ for ViT : Similarity-aware quantization for linear operation and Ranking-aware quantization for self-attention • It is closest to this paper, however it does not quantize the LayerNorm and Softmax EMA : https://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf Percentile : https://openaccess.thecvf.com/content_CVPR_2019/papers/Li_Fully_Quantized_Network_for_Object_Detection_CVPR_2019_paper.pdf OMSE : https://arxiv.org/pdf/1902.06822.pdf BitSplit : http://proceedings.mlr.press/v119/wang20c/wang20c.pdf PTQ for ViT : https://arxiv.org/pdf/2106.14156.pdf
  • 18.
    Experiments ▪ Comparison withState-of-the-art Methods • Image Classification on ImageNet • FQ-ViT significantly exceeds PTQ for ViT • FQ-ViT achieves 81.20% accuracy on DeiT-B in the case of all modules quantized to 8-bit, and it can still achieve 80.85% accuracy when the attention maps are compressed to 4-bit
  • 19.
    Experiments ▪ Comparison withState-of-the-art Methods • Object Detection on COCO • FQ-ViT significantly improves the quantization accuracy and achieves 47.2 mAP on Mask R-CNN (Swin-S) and 50.8 mAP on Cascade Mask R-CNN (Swin-S) with 8-bit on weights/activations and 4-bit on attention maps
  • 20.
    Experiments ▪ Comparison withState-of-the-art Methods • Ablation Studies • Baseline #8 : fully quantized by MinMax with 8-bit weights, activations and attention maps • Baseline #4 : a lower bit-width (4-bit) on attention maps
  • 21.
    Experiments ▪ Comparison withState-of-the-art Methods • Visualization of Quantized Attention Map
  • 22.
    Conclusions ▪ Power-of-Two Factor(PTF) • to deal with the serious inter-channel variations in the inputs of LayerNorm ▪ Log-Int-Softmax (LIS) • to implement 4-bit quantization of attention maps and utilize the BitShift operator instead of MatMul during inference