Extremely Low Bit Transformer Quantization for On-Device NMT

Extremely Low Bit
Transformer Quantization
for On-Device NMT
Insoo Chung*, Byeongwook Kim*, Yoonjung Choi, Se Jung Kwon,
Yongkweon Jeon, Baeseong Park, Sangha Kim, Dongsoo Lee
{insooo.chung, byeonguk.kim, yj0807.choi, sejung0.kwon,
dragwon.jeon, bpbs.park, sangha01.kim, dongsoo3.lee}@samsung.com
* Equal contribution

Fig. 1. The Transformer architecture
❏ High computation and memory requirement
❏ Transformer (Vaswani et al. 2017) requires huge # of parameters
❏ 60.9M parameters for base configuration with vocab. size 32k
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020).
Problem with Transformer’s On-Device Deployment

❏ Hard-to-parallelize inference
❏ Recurrent decode steps dependent on outputs of all prior steps
(→ hard to parallelize)
❏ During recurrent decode steps, decoder and embedding
parameters are repeatedly loaded onto the cache
(→ memory-wall problem) (Jeon et al. 2020)
❏ # params in decoder + embedding ≈ 70% of all parameters (base)

❏ Hard-to-parallelize inference
❏ Recurrent decode steps dependent on outputs of all prior steps
(→ hard to parallelize)
❏ During recurrent decode steps, decoder and embedding
parameters are repeatedly loaded onto the cache
(→ memory-wall problem) (Jeon et al. 2020)
❏ # params in decoder + embedding ≈ 70% of all parameters (base)
❏ Objective: Reduce computational burden,
but keep translation quality high!

Background - Quantization
❏ Uniform quantization
❏ Representing FP32 parameters with lower precision parameters (e.g., INT8, INT4)
❏ Minor accuracy loss (for some NN) with weight quantization
Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019).
Stock, Pierre, et al. "And the Bit Goes Down: Revisiting the Quantization of Neural Networks." Eighth International Conference on Learning Representations. 2020.
Fig. 2. INT8 quantization

❏ For efficient integer arithmetic, activation quantization is required
❏ Activation quantization lead to severe accuracy loss
in softmax and layer normalization (Bhandare et al., 2019) 😢

❏ Non-uniform quantization (codebook-based)
❏ Representing parameters with selected number of centroids
❏ Low bit quantization (for some NN) and reduced runtime memory footprint

❏ Non-uniform quantization (codebook-based)
❏ Representing parameters with selected number of centroids
❏ Low bit quantization (for some NN) and reduced runtime memory footprint
❏ Inference requires mandatory dequantization process (Stock et al., 2020) 😢

❏ Non-uniform quantization (based on binary codes ∈{−1, +1}) (Rastegari et al., 2016)
❏ q-bit quantization based on binary codes
Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European conference on computer vision. Springer, Cham, 2016.
Guo, Yunhui. "A survey on methods and theories of quantized neural networks." arXiv preprint arXiv:1808.04752 (2018).
Xu, Chen, et al. "Alternating multi-bit quantization for recurrent neural networks." Sixth International Conference on Learning Representations. 2018.

w : a FP vector (e.g., a row in a weight matrix)
αi : i-th FP scale (scalar)
bi : i-th binary code vector
❏ Low bit quantization (for some NN)
❏ Approximation algorithms: Guo et al., 2017; Xu et al., 2018
❏ We use greedy approximation for simplicity

Expand to matrix quantization

element-wise multiplication
=
Expand to matrix quantization
❏ Matmul with FP activations and quantized weights
❏ Bi · x can be pre-computed and reused (Jeon et al., 2020)

I. We explore binary-code-based quantization for on-device NMT
Transformer Quantization for On-Device NMT

II. Our contribution
A. A mixed precision quantization strategy for extremely low bit quantization

B. Under 3-bit quantization with less than -0.5 BLEU

B. Under 3-bit quantization with less than -0.5 BLEU
C. Significantly reduced runtime memory and high speedup on actual mobile device

Quantization Method for Embedding
❏ Word frequency follows a power-law distribution
❏ 1% of words account for 95% of word occurrences
in WMT2014 data
❏ See also, Chen et al. (2018)
Chen, Patrick, et al. "Groupreduce: Block-wise low-rank approximation for neural language model shrinking." Advances in neural information processing systems. 2018.
Fig. 3. Power law distribution of word frequencies

in WMT2014 data
❏ Assign higher number of bits to more frequent words

in WMT2014 data
❏ Algorithm 1: assign {b, b-1, ..., 1} bits to words in
frequency-based clusters of size ratio r^0:r^1:...:r^(b-1)
❏ With r > 1, smaller clusters hold more frequent words

in WMT2014 data
❏ Algorithm 1: assign {b, b-1, ..., 1} bits to words in
frequency-based clusters of size ratio r^0:r^1:...:r^(b-1)
❏ With r > 1, smaller clusters hold more frequent words
❏ We achieve 1.32-bit embedding that performs
on par with 2-bit embedding (all 2-bit words)
❏ Algorithm 1 with b=4, r=4 applied
❏ Assigned {4, 3, 2, 1} bits to words in freq. clusters of
size ratio 1:4:16:64
Fig. 4. BLEU score on En2De newstest2013
with quantized embeddings (no retraining,
baseline BLEU 25.4)

Quantization Method for Encoder and Decoder
❏ Bit assignment based on sensitivity
❏ Different degree of BLEU score degradation when quantized
❏ Model exhibit severe BLEU degradation when
Encffn sublayers or Deced sublayers are quantized
❏ See also, Michael at el. (2019)
❏ Within encoder and decoder blocks, we assign more
sensitive sublayers higher number of bits
Michel, Paul, Omer Levy, and Graham Neubig. "Are sixteen heads really better than one?" Advances in Neural Information Processing Systems. 2019.
Table 1. BLEU on En2De newstest2013 with
a type of sublayers quantized
(no retraining, baseline BLEU 25.4)

Quantization Method for Encoder and Decoder
❏ Bit assignment based on sensitivity
❏ Different degree of BLEU score degradation when quantized
❏ Model exhibit severe BLEU degradation when
Encffn sublayers or Deced sublayers are quantized
❏ See also, Michael at el. (2019)
❏ Within encoder and decoder blocks, we assign more
sensitive sublayers higher number of bits
❏ Decoder block is assigned lower bits than Encoder block
❏ Recap - Hard-to-parallelize decoder
Michel, Paul, Omer Levy, and Graham Neubig. "Are sixteen heads really better than one?" Advances in Neural Information Processing Systems. 2019.
Table 1. BLEU on En2De newstest2013 with
a type of sublayers quantized
(no retraining, baseline BLEU 25.4)

❏ Quantization method
① Train a FP baseline (base configuration)
Implementation Details - Retraining

② Retrain and quantize Emb.
③ Retrain and quantize Emb. + Dec.
④ Retrain and quantize Emb. + Dec. + Enc.

❏ Details
❏ We adopt pNR in our retraining (Fig. 5)
❏ To modulate strong regularization strength (1000 to 2000 works 👍)
Fig. 5. Illustration of pNR variable in action.
In our case, the weight regularization applied
is binary coding based quantization.

❏ Details
❏ We adopt pNR in our retraining (Fig. 5)
❏ To modulate strong regularization strength (1000 to 2000 works 👍)
❏ We set quantization bits as follows:
❏ For Decdd, Deced, Decffn, 2, 3, 1 bits (1.8-bit decoder)
❏ For Encee and Encffn, 3 and 4 bits (3.7-bit encoder) → 2.6-bit Transformer
❏ For Emb, apply Alg. 1 with b=4, r=1 (2.5-bit embedding)
❏ Bias and layer normalization parameters are not quantized More details in the paper
Fig. 5. Illustration of pNR variable in action.
In our case, the weight regularization applied
is binary coding based quantization.

Results
❏ We achieve 2.6-bit Transformer models
❏ Within -0.5 BLEU from baseline
❏ 11.8X comp. in actual model size
← 1st retraining (Emb.)
← 2nd retraining (Emb. + Dec.)
← Last retraining (Emb. + Dec. + Enc.)
Table 2. Average quantization bits, BLEU scores,
and model size of our quantized models.
Note that 2.5-bit embeddings quantized using
Algorithm 1 with b=4, r=1

Results
❏ We achieve 3.5X speedup and 8.3X runtime memory compression
❏ Most of the speedup is achieved with embedding and decoder quantization
❏ Recap - They are involved with decode steps with limited parallelization
❏ Encoder quantization contributes little to the speedup (high parallelizability)
Table 3. Speedup and runtime memory reduction
achieved with quantization. Tested on-device
inference on Galaxy N10+ with En2De model
translating newstest2013 using our C++ code
-244.6
-262.9
-0.7
2.6-bit model

More Analysis
❏ Memory inefficiency in FP Transformer ❏ Comparison with other quantization strategy
Table 4. FLOPs requirement by a translation run, and
actual on-device latency accountable by each of the
Transformer blocks. A translation run denotes
30-words input to 30-words output translation.
Tested with Galaxy N10+
Prato, Gabriele, Ella Charlaix, and Mehdi Rezagholizadeh. "Fully quantized transformer for improved translation." arXiv preprint arXiv:1910.10485 (2019).
Table 5. Comparison with other quantization strategies.
Bhandare et al. (2019) reports 1.5X speedup from
quantization and Prato et al. (2019) reports no
measurements of speed.
More results and analysis available in our paper!

References - Figures and Tables
❏ Fig. 1. The Transformer architecture from:
❏ Fig. 2. INT8 quantization from:
"Quantization - Neural Network Distiller", https://nervanasystems.github.io/distiller/algo_quantization.html.
❏ Fig. 5. Illustration of pNR variable in action from:
Lee, Dongsoo, et al. "Decoupling Weight Regularization from Batch Size for Model Compression", Sep. 2019, openreview.net,
https://openreview.net/forum?id=BJlaG0VFDH.
❏ All other tables and figures from:
Chung, Insoo, Kim, Byeongwook et al. "Extremely Low Bit Transformer Quantization for On-Device Neural Machine
Translation", 2020, Findings of Empirical Methods in Natural Language Processing

Extremely Low Bit Transformer Quantization for On-Device NMT

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Extremely Low Bit Transformer Quantization for On-Device NMT

Similar to Extremely Low Bit Transformer Quantization for On-Device NMT (20)

Recently uploaded

Recently uploaded (20)

Extremely Low Bit Transformer Quantization for On-Device NMT