We propose a mixed-precision binary-code-based quantization strategy to represent Transformer weights by an extremely low number of bits (e.g., under 3 bits). For example, for each word in an embedding block, we assign different quantization bits based on statistical property. Our quantized Transformer model achieves 11.8× smaller model size than the baseline model, with less than -0.5 BLEU. We achieve 8.3× reduction in run-time memory footprints and 3.5× speed up (Galaxy N10+) such that our proposed compression strategy enables efficient implementation for on-device NMT.
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
Extremely Low Bit Transformer Quantization for On-Device NMT
1. Extremely Low Bit
Transformer Quantization
for On-Device NMT
Insoo Chung*, Byeongwook Kim*, Yoonjung Choi, Se Jung Kwon,
Yongkweon Jeon, Baeseong Park, Sangha Kim, Dongsoo Lee
{insooo.chung, byeonguk.kim, yj0807.choi, sejung0.kwon,
dragwon.jeon, bpbs.park, sangha01.kim, dongsoo3.lee}@samsung.com
* Equal contribution
2. Fig. 1. The Transformer architecture
❏ High computation and memory requirement
❏ Transformer (Vaswani et al. 2017) requires huge # of parameters
❏ 60.9M parameters for base configuration with vocab. size 32k
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020).
Problem with Transformer’s On-Device Deployment
3. Fig. 1. The Transformer architecture
❏ High computation and memory requirement
❏ Transformer (Vaswani et al. 2017) requires huge # of parameters
❏ 60.9M parameters for base configuration with vocab. size 32k
❏ Hard-to-parallelize inference
❏ Recurrent decode steps dependent on outputs of all prior steps
(→ hard to parallelize)
❏ During recurrent decode steps, decoder and embedding
parameters are repeatedly loaded onto the cache
(→ memory-wall problem) (Jeon et al. 2020)
❏ # params in decoder + embedding ≈ 70% of all parameters (base)
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020).
Problem with Transformer’s On-Device Deployment
4. Fig. 1. The Transformer architecture
❏ High computation and memory requirement
❏ Transformer (Vaswani et al. 2017) requires huge # of parameters
❏ 60.9M parameters for base configuration with vocab. size 32k
❏ Hard-to-parallelize inference
❏ Recurrent decode steps dependent on outputs of all prior steps
(→ hard to parallelize)
❏ During recurrent decode steps, decoder and embedding
parameters are repeatedly loaded onto the cache
(→ memory-wall problem) (Jeon et al. 2020)
❏ # params in decoder + embedding ≈ 70% of all parameters (base)
❏ Objective: Reduce computational burden,
but keep translation quality high!
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020).
Problem with Transformer’s On-Device Deployment
5. Background - Quantization
❏ Uniform quantization
❏ Representing FP32 parameters with lower precision parameters (e.g., INT8, INT4)
❏ Minor accuracy loss (for some NN) with weight quantization
Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019).
Stock, Pierre, et al. "And the Bit Goes Down: Revisiting the Quantization of Neural Networks." Eighth International Conference on Learning Representations. 2020.
Fig. 2. INT8 quantization
6. Background - Quantization
❏ Uniform quantization
❏ Representing FP32 parameters with lower precision parameters (e.g., INT8, INT4)
❏ Minor accuracy loss (for some NN) with weight quantization
❏ For efficient integer arithmetic, activation quantization is required
❏ Activation quantization lead to severe accuracy loss
in softmax and layer normalization (Bhandare et al., 2019) 😢
Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019).
Stock, Pierre, et al. "And the Bit Goes Down: Revisiting the Quantization of Neural Networks." Eighth International Conference on Learning Representations. 2020.
Fig. 2. INT8 quantization
7. Background - Quantization
❏ Uniform quantization
❏ Representing FP32 parameters with lower precision parameters (e.g., INT8, INT4)
❏ Minor accuracy loss (for some NN) with weight quantization
❏ For efficient integer arithmetic, activation quantization is required
❏ Activation quantization lead to severe accuracy loss
in softmax and layer normalization (Bhandare et al., 2019) 😢
❏ Non-uniform quantization (codebook-based)
❏ Representing parameters with selected number of centroids
❏ Low bit quantization (for some NN) and reduced runtime memory footprint
Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019).
Stock, Pierre, et al. "And the Bit Goes Down: Revisiting the Quantization of Neural Networks." Eighth International Conference on Learning Representations. 2020.
Fig. 2. INT8 quantization
8. Background - Quantization
❏ Uniform quantization
❏ Representing FP32 parameters with lower precision parameters (e.g., INT8, INT4)
❏ Minor accuracy loss (for some NN) with weight quantization
❏ For efficient integer arithmetic, activation quantization is required
❏ Activation quantization lead to severe accuracy loss
in softmax and layer normalization (Bhandare et al., 2019) 😢
❏ Non-uniform quantization (codebook-based)
❏ Representing parameters with selected number of centroids
❏ Low bit quantization (for some NN) and reduced runtime memory footprint
❏ Inference requires mandatory dequantization process (Stock et al., 2020) 😢
Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019).
Stock, Pierre, et al. "And the Bit Goes Down: Revisiting the Quantization of Neural Networks." Eighth International Conference on Learning Representations. 2020.
Fig. 2. INT8 quantization
9. ❏ Non-uniform quantization (based on binary codes ∈{−1, +1}) (Rastegari et al., 2016)
❏ q-bit quantization based on binary codes
Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European conference on computer vision. Springer, Cham, 2016.
Guo, Yunhui. "A survey on methods and theories of quantized neural networks." arXiv preprint arXiv:1808.04752 (2018).
Xu, Chen, et al. "Alternating multi-bit quantization for recurrent neural networks." Sixth International Conference on Learning Representations. 2018.
Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020).
Background - Quantization
10. Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European conference on computer vision. Springer, Cham, 2016.
Guo, Yunhui. "A survey on methods and theories of quantized neural networks." arXiv preprint arXiv:1808.04752 (2018).
Xu, Chen, et al. "Alternating multi-bit quantization for recurrent neural networks." Sixth International Conference on Learning Representations. 2018.
Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020).
Background - Quantization
w : a FP vector (e.g., a row in a weight matrix)
αi : i-th FP scale (scalar)
bi : i-th binary code vector
❏ Non-uniform quantization (based on binary codes ∈{−1, +1}) (Rastegari et al., 2016)
❏ q-bit quantization based on binary codes
❏ Low bit quantization (for some NN)
❏ Approximation algorithms: Guo et al., 2017; Xu et al., 2018
❏ We use greedy approximation for simplicity
11. Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European conference on computer vision. Springer, Cham, 2016.
Guo, Yunhui. "A survey on methods and theories of quantized neural networks." arXiv preprint arXiv:1808.04752 (2018).
Xu, Chen, et al. "Alternating multi-bit quantization for recurrent neural networks." Sixth International Conference on Learning Representations. 2018.
Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020).
Background - Quantization
w : a FP vector (e.g., a row in a weight matrix)
αi : i-th FP scale (scalar)
bi : i-th binary code vector
Expand to matrix quantization
❏ Non-uniform quantization (based on binary codes ∈{−1, +1}) (Rastegari et al., 2016)
❏ q-bit quantization based on binary codes
❏ Low bit quantization (for some NN)
❏ Approximation algorithms: Guo et al., 2017; Xu et al., 2018
❏ We use greedy approximation for simplicity
12. Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European conference on computer vision. Springer, Cham, 2016.
Guo, Yunhui. "A survey on methods and theories of quantized neural networks." arXiv preprint arXiv:1808.04752 (2018).
Xu, Chen, et al. "Alternating multi-bit quantization for recurrent neural networks." Sixth International Conference on Learning Representations. 2018.
Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020).
Background - Quantization
w : a FP vector (e.g., a row in a weight matrix)
αi : i-th FP scale (scalar)
bi : i-th binary code vector
element-wise multiplication
=
Expand to matrix quantization
❏ Non-uniform quantization (based on binary codes ∈{−1, +1}) (Rastegari et al., 2016)
❏ q-bit quantization based on binary codes
❏ Low bit quantization (for some NN)
❏ Approximation algorithms: Guo et al., 2017; Xu et al., 2018
❏ We use greedy approximation for simplicity
❏ Matmul with FP activations and quantized weights
❏ Bi · x can be pre-computed and reused (Jeon et al., 2020)
13. I. We explore binary-code-based quantization for on-device NMT
Transformer Quantization for On-Device NMT
14. I. We explore binary-code-based quantization for on-device NMT
II. Our contribution
A. A mixed precision quantization strategy for extremely low bit quantization
Transformer Quantization for On-Device NMT
15. I. We explore binary-code-based quantization for on-device NMT
II. Our contribution
A. A mixed precision quantization strategy for extremely low bit quantization
B. Under 3-bit quantization with less than -0.5 BLEU
Transformer Quantization for On-Device NMT
16. I. We explore binary-code-based quantization for on-device NMT
II. Our contribution
A. A mixed precision quantization strategy for extremely low bit quantization
B. Under 3-bit quantization with less than -0.5 BLEU
C. Significantly reduced runtime memory and high speedup on actual mobile device
Transformer Quantization for On-Device NMT
17. Quantization Method for Embedding
❏ Word frequency follows a power-law distribution
❏ 1% of words account for 95% of word occurrences
in WMT2014 data
❏ See also, Chen et al. (2018)
Chen, Patrick, et al. "Groupreduce: Block-wise low-rank approximation for neural language model shrinking." Advances in neural information processing systems. 2018.
Fig. 3. Power law distribution of word frequencies
18. Quantization Method for Embedding
❏ Word frequency follows a power-law distribution
❏ 1% of words account for 95% of word occurrences
in WMT2014 data
❏ See also, Chen et al. (2018)
❏ Assign higher number of bits to more frequent words
Chen, Patrick, et al. "Groupreduce: Block-wise low-rank approximation for neural language model shrinking." Advances in neural information processing systems. 2018.
Fig. 3. Power law distribution of word frequencies
19. Quantization Method for Embedding
❏ Word frequency follows a power-law distribution
❏ 1% of words account for 95% of word occurrences
in WMT2014 data
❏ See also, Chen et al. (2018)
❏ Assign higher number of bits to more frequent words
❏ Algorithm 1: assign {b, b-1, ..., 1} bits to words in
frequency-based clusters of size ratio r^0:r^1:...:r^(b-1)
❏ With r > 1, smaller clusters hold more frequent words
Chen, Patrick, et al. "Groupreduce: Block-wise low-rank approximation for neural language model shrinking." Advances in neural information processing systems. 2018.
Fig. 3. Power law distribution of word frequencies
20. Quantization Method for Embedding
❏ Word frequency follows a power-law distribution
❏ 1% of words account for 95% of word occurrences
in WMT2014 data
❏ See also, Chen et al. (2018)
❏ Assign higher number of bits to more frequent words
❏ Algorithm 1: assign {b, b-1, ..., 1} bits to words in
frequency-based clusters of size ratio r^0:r^1:...:r^(b-1)
❏ With r > 1, smaller clusters hold more frequent words
❏ We achieve 1.32-bit embedding that performs
on par with 2-bit embedding (all 2-bit words)
❏ Algorithm 1 with b=4, r=4 applied
❏ Assigned {4, 3, 2, 1} bits to words in freq. clusters of
size ratio 1:4:16:64
Chen, Patrick, et al. "Groupreduce: Block-wise low-rank approximation for neural language model shrinking." Advances in neural information processing systems. 2018.
Fig. 3. Power law distribution of word frequencies
Fig. 4. BLEU score on En2De newstest2013
with quantized embeddings (no retraining,
baseline BLEU 25.4)
21. Quantization Method for Encoder and Decoder
❏ Bit assignment based on sensitivity
❏ Different degree of BLEU score degradation when quantized
❏ Model exhibit severe BLEU degradation when
Encffn sublayers or Deced sublayers are quantized
❏ See also, Michael at el. (2019)
❏ Within encoder and decoder blocks, we assign more
sensitive sublayers higher number of bits
Michel, Paul, Omer Levy, and Graham Neubig. "Are sixteen heads really better than one?" Advances in Neural Information Processing Systems. 2019.
Table 1. BLEU on En2De newstest2013 with
a type of sublayers quantized
(no retraining, baseline BLEU 25.4)
22. Quantization Method for Encoder and Decoder
❏ Bit assignment based on sensitivity
❏ Different degree of BLEU score degradation when quantized
❏ Model exhibit severe BLEU degradation when
Encffn sublayers or Deced sublayers are quantized
❏ See also, Michael at el. (2019)
❏ Within encoder and decoder blocks, we assign more
sensitive sublayers higher number of bits
❏ Decoder block is assigned lower bits than Encoder block
❏ Recap - Hard-to-parallelize decoder
Michel, Paul, Omer Levy, and Graham Neubig. "Are sixteen heads really better than one?" Advances in Neural Information Processing Systems. 2019.
Table 1. BLEU on En2De newstest2013 with
a type of sublayers quantized
(no retraining, baseline BLEU 25.4)
23. ❏ Quantization method
① Train a FP baseline (base configuration)
Implementation Details - Retraining
24. ❏ Quantization method
① Train a FP baseline (base configuration)
② Retrain and quantize Emb.
③ Retrain and quantize Emb. + Dec.
④ Retrain and quantize Emb. + Dec. + Enc.
Implementation Details - Retraining
25. ❏ Quantization method
① Train a FP baseline (base configuration)
② Retrain and quantize Emb.
③ Retrain and quantize Emb. + Dec.
④ Retrain and quantize Emb. + Dec. + Enc.
❏ Details
❏ We adopt pNR in our retraining (Fig. 5)
❏ To modulate strong regularization strength (1000 to 2000 works 👍)
Implementation Details - Retraining
Fig. 5. Illustration of pNR variable in action.
In our case, the weight regularization applied
is binary coding based quantization.
26. ❏ Quantization method
① Train a FP baseline (base configuration)
② Retrain and quantize Emb.
③ Retrain and quantize Emb. + Dec.
④ Retrain and quantize Emb. + Dec. + Enc.
❏ Details
❏ We adopt pNR in our retraining (Fig. 5)
❏ To modulate strong regularization strength (1000 to 2000 works 👍)
❏ We set quantization bits as follows:
❏ For Decdd, Deced, Decffn, 2, 3, 1 bits (1.8-bit decoder)
❏ For Encee and Encffn, 3 and 4 bits (3.7-bit encoder) → 2.6-bit Transformer
❏ For Emb, apply Alg. 1 with b=4, r=1 (2.5-bit embedding)
❏ Bias and layer normalization parameters are not quantized More details in the paper
Implementation Details - Retraining
Fig. 5. Illustration of pNR variable in action.
In our case, the weight regularization applied
is binary coding based quantization.
27. Results
❏ We achieve 2.6-bit Transformer models
❏ Within -0.5 BLEU from baseline
❏ 11.8X comp. in actual model size
← 1st retraining (Emb.)
← 2nd retraining (Emb. + Dec.)
← Last retraining (Emb. + Dec. + Enc.)
Table 2. Average quantization bits, BLEU scores,
and model size of our quantized models.
Note that 2.5-bit embeddings quantized using
Algorithm 1 with b=4, r=1
28. Results
❏ We achieve 3.5X speedup and 8.3X runtime memory compression
❏ Most of the speedup is achieved with embedding and decoder quantization
❏ Recap - They are involved with decode steps with limited parallelization
❏ Encoder quantization contributes little to the speedup (high parallelizability)
Table 3. Speedup and runtime memory reduction
achieved with quantization. Tested on-device
inference on Galaxy N10+ with En2De model
translating newstest2013 using our C++ code
-244.6
-262.9
-0.7
2.6-bit model
29. More Analysis
❏ Memory inefficiency in FP Transformer ❏ Comparison with other quantization strategy
Table 4. FLOPs requirement by a translation run, and
actual on-device latency accountable by each of the
Transformer blocks. A translation run denotes
30-words input to 30-words output translation.
Tested with Galaxy N10+
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019).
Prato, Gabriele, Ella Charlaix, and Mehdi Rezagholizadeh. "Fully quantized transformer for improved translation." arXiv preprint arXiv:1910.10485 (2019).
Table 5. Comparison with other quantization strategies.
Bhandare et al. (2019) reports 1.5X speedup from
quantization and Prato et al. (2019) reports no
measurements of speed.
More results and analysis available in our paper!
30. References - Figures and Tables
❏ Fig. 1. The Transformer architecture from:
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
❏ Fig. 2. INT8 quantization from:
"Quantization - Neural Network Distiller", https://nervanasystems.github.io/distiller/algo_quantization.html.
❏ Fig. 5. Illustration of pNR variable in action from:
Lee, Dongsoo, et al. "Decoupling Weight Regularization from Batch Size for Model Compression", Sep. 2019, openreview.net,
https://openreview.net/forum?id=BJlaG0VFDH.
❏ All other tables and figures from:
Chung, Insoo, Kim, Byeongwook et al. "Extremely Low Bit Transformer Quantization for On-Device Neural Machine
Translation", 2020, Findings of Empirical Methods in Natural Language Processing