SlideShare a Scribd company logo
1 of 31
Download to read offline
Extremely Low Bit
Transformer Quantization
for On-Device NMT
Insoo Chung*, Byeongwook Kim*, Yoonjung Choi, Se Jung Kwon,
Yongkweon Jeon, Baeseong Park, Sangha Kim, Dongsoo Lee
{insooo.chung, byeonguk.kim, yj0807.choi, sejung0.kwon,
dragwon.jeon, bpbs.park, sangha01.kim, dongsoo3.lee}@samsung.com
* Equal contribution
Fig. 1. The Transformer architecture
❏ High computation and memory requirement
❏ Transformer (Vaswani et al. 2017) requires huge # of parameters
❏ 60.9M parameters for base configuration with vocab. size 32k
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020).
Problem with Transformer’s On-Device Deployment
Fig. 1. The Transformer architecture
❏ High computation and memory requirement
❏ Transformer (Vaswani et al. 2017) requires huge # of parameters
❏ 60.9M parameters for base configuration with vocab. size 32k
❏ Hard-to-parallelize inference
❏ Recurrent decode steps dependent on outputs of all prior steps
(→ hard to parallelize)
❏ During recurrent decode steps, decoder and embedding
parameters are repeatedly loaded onto the cache
(→ memory-wall problem) (Jeon et al. 2020)
❏ # params in decoder + embedding ≈ 70% of all parameters (base)
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020).
Problem with Transformer’s On-Device Deployment
Fig. 1. The Transformer architecture
❏ High computation and memory requirement
❏ Transformer (Vaswani et al. 2017) requires huge # of parameters
❏ 60.9M parameters for base configuration with vocab. size 32k
❏ Hard-to-parallelize inference
❏ Recurrent decode steps dependent on outputs of all prior steps
(→ hard to parallelize)
❏ During recurrent decode steps, decoder and embedding
parameters are repeatedly loaded onto the cache
(→ memory-wall problem) (Jeon et al. 2020)
❏ # params in decoder + embedding ≈ 70% of all parameters (base)
❏ Objective: Reduce computational burden,
but keep translation quality high!
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020).
Problem with Transformer’s On-Device Deployment
Background - Quantization
❏ Uniform quantization
❏ Representing FP32 parameters with lower precision parameters (e.g., INT8, INT4)
❏ Minor accuracy loss (for some NN) with weight quantization
Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019).
Stock, Pierre, et al. "And the Bit Goes Down: Revisiting the Quantization of Neural Networks." Eighth International Conference on Learning Representations. 2020.
Fig. 2. INT8 quantization
Background - Quantization
❏ Uniform quantization
❏ Representing FP32 parameters with lower precision parameters (e.g., INT8, INT4)
❏ Minor accuracy loss (for some NN) with weight quantization
❏ For efficient integer arithmetic, activation quantization is required
❏ Activation quantization lead to severe accuracy loss
in softmax and layer normalization (Bhandare et al., 2019) 😢
Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019).
Stock, Pierre, et al. "And the Bit Goes Down: Revisiting the Quantization of Neural Networks." Eighth International Conference on Learning Representations. 2020.
Fig. 2. INT8 quantization
Background - Quantization
❏ Uniform quantization
❏ Representing FP32 parameters with lower precision parameters (e.g., INT8, INT4)
❏ Minor accuracy loss (for some NN) with weight quantization
❏ For efficient integer arithmetic, activation quantization is required
❏ Activation quantization lead to severe accuracy loss
in softmax and layer normalization (Bhandare et al., 2019) 😢
❏ Non-uniform quantization (codebook-based)
❏ Representing parameters with selected number of centroids
❏ Low bit quantization (for some NN) and reduced runtime memory footprint
Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019).
Stock, Pierre, et al. "And the Bit Goes Down: Revisiting the Quantization of Neural Networks." Eighth International Conference on Learning Representations. 2020.
Fig. 2. INT8 quantization
Background - Quantization
❏ Uniform quantization
❏ Representing FP32 parameters with lower precision parameters (e.g., INT8, INT4)
❏ Minor accuracy loss (for some NN) with weight quantization
❏ For efficient integer arithmetic, activation quantization is required
❏ Activation quantization lead to severe accuracy loss
in softmax and layer normalization (Bhandare et al., 2019) 😢
❏ Non-uniform quantization (codebook-based)
❏ Representing parameters with selected number of centroids
❏ Low bit quantization (for some NN) and reduced runtime memory footprint
❏ Inference requires mandatory dequantization process (Stock et al., 2020) 😢
Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019).
Stock, Pierre, et al. "And the Bit Goes Down: Revisiting the Quantization of Neural Networks." Eighth International Conference on Learning Representations. 2020.
Fig. 2. INT8 quantization
❏ Non-uniform quantization (based on binary codes ∈{−1, +1}) (Rastegari et al., 2016)
❏ q-bit quantization based on binary codes
Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European conference on computer vision. Springer, Cham, 2016.
Guo, Yunhui. "A survey on methods and theories of quantized neural networks." arXiv preprint arXiv:1808.04752 (2018).
Xu, Chen, et al. "Alternating multi-bit quantization for recurrent neural networks." Sixth International Conference on Learning Representations. 2018.
Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020).
Background - Quantization
Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European conference on computer vision. Springer, Cham, 2016.
Guo, Yunhui. "A survey on methods and theories of quantized neural networks." arXiv preprint arXiv:1808.04752 (2018).
Xu, Chen, et al. "Alternating multi-bit quantization for recurrent neural networks." Sixth International Conference on Learning Representations. 2018.
Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020).
Background - Quantization
w : a FP vector (e.g., a row in a weight matrix)
αi : i-th FP scale (scalar)
bi : i-th binary code vector
❏ Non-uniform quantization (based on binary codes ∈{−1, +1}) (Rastegari et al., 2016)
❏ q-bit quantization based on binary codes
❏ Low bit quantization (for some NN)
❏ Approximation algorithms: Guo et al., 2017; Xu et al., 2018
❏ We use greedy approximation for simplicity
Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European conference on computer vision. Springer, Cham, 2016.
Guo, Yunhui. "A survey on methods and theories of quantized neural networks." arXiv preprint arXiv:1808.04752 (2018).
Xu, Chen, et al. "Alternating multi-bit quantization for recurrent neural networks." Sixth International Conference on Learning Representations. 2018.
Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020).
Background - Quantization
w : a FP vector (e.g., a row in a weight matrix)
αi : i-th FP scale (scalar)
bi : i-th binary code vector
Expand to matrix quantization
❏ Non-uniform quantization (based on binary codes ∈{−1, +1}) (Rastegari et al., 2016)
❏ q-bit quantization based on binary codes
❏ Low bit quantization (for some NN)
❏ Approximation algorithms: Guo et al., 2017; Xu et al., 2018
❏ We use greedy approximation for simplicity
Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European conference on computer vision. Springer, Cham, 2016.
Guo, Yunhui. "A survey on methods and theories of quantized neural networks." arXiv preprint arXiv:1808.04752 (2018).
Xu, Chen, et al. "Alternating multi-bit quantization for recurrent neural networks." Sixth International Conference on Learning Representations. 2018.
Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020).
Background - Quantization
w : a FP vector (e.g., a row in a weight matrix)
αi : i-th FP scale (scalar)
bi : i-th binary code vector
element-wise multiplication
=
Expand to matrix quantization
❏ Non-uniform quantization (based on binary codes ∈{−1, +1}) (Rastegari et al., 2016)
❏ q-bit quantization based on binary codes
❏ Low bit quantization (for some NN)
❏ Approximation algorithms: Guo et al., 2017; Xu et al., 2018
❏ We use greedy approximation for simplicity
❏ Matmul with FP activations and quantized weights
❏ Bi · x can be pre-computed and reused (Jeon et al., 2020)
I. We explore binary-code-based quantization for on-device NMT
Transformer Quantization for On-Device NMT
I. We explore binary-code-based quantization for on-device NMT
II. Our contribution
A. A mixed precision quantization strategy for extremely low bit quantization
Transformer Quantization for On-Device NMT
I. We explore binary-code-based quantization for on-device NMT
II. Our contribution
A. A mixed precision quantization strategy for extremely low bit quantization
B. Under 3-bit quantization with less than -0.5 BLEU
Transformer Quantization for On-Device NMT
I. We explore binary-code-based quantization for on-device NMT
II. Our contribution
A. A mixed precision quantization strategy for extremely low bit quantization
B. Under 3-bit quantization with less than -0.5 BLEU
C. Significantly reduced runtime memory and high speedup on actual mobile device
Transformer Quantization for On-Device NMT
Quantization Method for Embedding
❏ Word frequency follows a power-law distribution
❏ 1% of words account for 95% of word occurrences
in WMT2014 data
❏ See also, Chen et al. (2018)
Chen, Patrick, et al. "Groupreduce: Block-wise low-rank approximation for neural language model shrinking." Advances in neural information processing systems. 2018.
Fig. 3. Power law distribution of word frequencies
Quantization Method for Embedding
❏ Word frequency follows a power-law distribution
❏ 1% of words account for 95% of word occurrences
in WMT2014 data
❏ See also, Chen et al. (2018)
❏ Assign higher number of bits to more frequent words
Chen, Patrick, et al. "Groupreduce: Block-wise low-rank approximation for neural language model shrinking." Advances in neural information processing systems. 2018.
Fig. 3. Power law distribution of word frequencies
Quantization Method for Embedding
❏ Word frequency follows a power-law distribution
❏ 1% of words account for 95% of word occurrences
in WMT2014 data
❏ See also, Chen et al. (2018)
❏ Assign higher number of bits to more frequent words
❏ Algorithm 1: assign {b, b-1, ..., 1} bits to words in
frequency-based clusters of size ratio r^0:r^1:...:r^(b-1)
❏ With r > 1, smaller clusters hold more frequent words
Chen, Patrick, et al. "Groupreduce: Block-wise low-rank approximation for neural language model shrinking." Advances in neural information processing systems. 2018.
Fig. 3. Power law distribution of word frequencies
Quantization Method for Embedding
❏ Word frequency follows a power-law distribution
❏ 1% of words account for 95% of word occurrences
in WMT2014 data
❏ See also, Chen et al. (2018)
❏ Assign higher number of bits to more frequent words
❏ Algorithm 1: assign {b, b-1, ..., 1} bits to words in
frequency-based clusters of size ratio r^0:r^1:...:r^(b-1)
❏ With r > 1, smaller clusters hold more frequent words
❏ We achieve 1.32-bit embedding that performs
on par with 2-bit embedding (all 2-bit words)
❏ Algorithm 1 with b=4, r=4 applied
❏ Assigned {4, 3, 2, 1} bits to words in freq. clusters of
size ratio 1:4:16:64
Chen, Patrick, et al. "Groupreduce: Block-wise low-rank approximation for neural language model shrinking." Advances in neural information processing systems. 2018.
Fig. 3. Power law distribution of word frequencies
Fig. 4. BLEU score on En2De newstest2013
with quantized embeddings (no retraining,
baseline BLEU 25.4)
Quantization Method for Encoder and Decoder
❏ Bit assignment based on sensitivity
❏ Different degree of BLEU score degradation when quantized
❏ Model exhibit severe BLEU degradation when
Encffn sublayers or Deced sublayers are quantized
❏ See also, Michael at el. (2019)
❏ Within encoder and decoder blocks, we assign more
sensitive sublayers higher number of bits
Michel, Paul, Omer Levy, and Graham Neubig. "Are sixteen heads really better than one?" Advances in Neural Information Processing Systems. 2019.
Table 1. BLEU on En2De newstest2013 with
a type of sublayers quantized
(no retraining, baseline BLEU 25.4)
Quantization Method for Encoder and Decoder
❏ Bit assignment based on sensitivity
❏ Different degree of BLEU score degradation when quantized
❏ Model exhibit severe BLEU degradation when
Encffn sublayers or Deced sublayers are quantized
❏ See also, Michael at el. (2019)
❏ Within encoder and decoder blocks, we assign more
sensitive sublayers higher number of bits
❏ Decoder block is assigned lower bits than Encoder block
❏ Recap - Hard-to-parallelize decoder
Michel, Paul, Omer Levy, and Graham Neubig. "Are sixteen heads really better than one?" Advances in Neural Information Processing Systems. 2019.
Table 1. BLEU on En2De newstest2013 with
a type of sublayers quantized
(no retraining, baseline BLEU 25.4)
❏ Quantization method
① Train a FP baseline (base configuration)
Implementation Details - Retraining
❏ Quantization method
① Train a FP baseline (base configuration)
② Retrain and quantize Emb.
③ Retrain and quantize Emb. + Dec.
④ Retrain and quantize Emb. + Dec. + Enc.
Implementation Details - Retraining
❏ Quantization method
① Train a FP baseline (base configuration)
② Retrain and quantize Emb.
③ Retrain and quantize Emb. + Dec.
④ Retrain and quantize Emb. + Dec. + Enc.
❏ Details
❏ We adopt pNR in our retraining (Fig. 5)
❏ To modulate strong regularization strength (1000 to 2000 works 👍)
Implementation Details - Retraining
Fig. 5. Illustration of pNR variable in action.
In our case, the weight regularization applied
is binary coding based quantization.
❏ Quantization method
① Train a FP baseline (base configuration)
② Retrain and quantize Emb.
③ Retrain and quantize Emb. + Dec.
④ Retrain and quantize Emb. + Dec. + Enc.
❏ Details
❏ We adopt pNR in our retraining (Fig. 5)
❏ To modulate strong regularization strength (1000 to 2000 works 👍)
❏ We set quantization bits as follows:
❏ For Decdd, Deced, Decffn, 2, 3, 1 bits (1.8-bit decoder)
❏ For Encee and Encffn, 3 and 4 bits (3.7-bit encoder) → 2.6-bit Transformer
❏ For Emb, apply Alg. 1 with b=4, r=1 (2.5-bit embedding)
❏ Bias and layer normalization parameters are not quantized More details in the paper
Implementation Details - Retraining
Fig. 5. Illustration of pNR variable in action.
In our case, the weight regularization applied
is binary coding based quantization.
Results
❏ We achieve 2.6-bit Transformer models
❏ Within -0.5 BLEU from baseline
❏ 11.8X comp. in actual model size
← 1st retraining (Emb.)
← 2nd retraining (Emb. + Dec.)
← Last retraining (Emb. + Dec. + Enc.)
Table 2. Average quantization bits, BLEU scores,
and model size of our quantized models.
Note that 2.5-bit embeddings quantized using
Algorithm 1 with b=4, r=1
Results
❏ We achieve 3.5X speedup and 8.3X runtime memory compression
❏ Most of the speedup is achieved with embedding and decoder quantization
❏ Recap - They are involved with decode steps with limited parallelization
❏ Encoder quantization contributes little to the speedup (high parallelizability)
Table 3. Speedup and runtime memory reduction
achieved with quantization. Tested on-device
inference on Galaxy N10+ with En2De model
translating newstest2013 using our C++ code
-244.6
-262.9
-0.7
2.6-bit model
More Analysis
❏ Memory inefficiency in FP Transformer ❏ Comparison with other quantization strategy
Table 4. FLOPs requirement by a translation run, and
actual on-device latency accountable by each of the
Transformer blocks. A translation run denotes
30-words input to 30-words output translation.
Tested with Galaxy N10+
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019).
Prato, Gabriele, Ella Charlaix, and Mehdi Rezagholizadeh. "Fully quantized transformer for improved translation." arXiv preprint arXiv:1910.10485 (2019).
Table 5. Comparison with other quantization strategies.
Bhandare et al. (2019) reports 1.5X speedup from
quantization and Prato et al. (2019) reports no
measurements of speed.
More results and analysis available in our paper!
References - Figures and Tables
❏ Fig. 1. The Transformer architecture from:
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
❏ Fig. 2. INT8 quantization from:
"Quantization - Neural Network Distiller", https://nervanasystems.github.io/distiller/algo_quantization.html.
❏ Fig. 5. Illustration of pNR variable in action from:
Lee, Dongsoo, et al. "Decoupling Weight Regularization from Batch Size for Model Compression", Sep. 2019, openreview.net,
https://openreview.net/forum?id=BJlaG0VFDH.
❏ All other tables and figures from:
Chung, Insoo, Kim, Byeongwook et al. "Extremely Low Bit Transformer Quantization for On-Device Neural Machine
Translation", 2020, Findings of Empirical Methods in Natural Language Processing
Thank y 😊u

More Related Content

What's hot

IMPROVING OF ARTIFICIAL NEURAL NETWORKS PERFORMANCE BY USING GPU’S: A SURVEY
IMPROVING OF ARTIFICIAL NEURAL NETWORKS PERFORMANCE BY USING GPU’S: A SURVEYIMPROVING OF ARTIFICIAL NEURAL NETWORKS PERFORMANCE BY USING GPU’S: A SURVEY
IMPROVING OF ARTIFICIAL NEURAL NETWORKS PERFORMANCE BY USING GPU’S: A SURVEYcsandit
 
Improving of artifical neural networks performance by using gpu's a survey
Improving of artifical neural networks performance by using gpu's  a surveyImproving of artifical neural networks performance by using gpu's  a survey
Improving of artifical neural networks performance by using gpu's a surveycsandit
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsDevansh16
 
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Bioinformatics kernels relations
Bioinformatics kernels relationsBioinformatics kernels relations
Bioinformatics kernels relationsMichiel Stock
 
A R T I F I C I A L N E U R A L N E T W O R K S J N T U M O D E L P A P ...
A R T I F I C I A L  N E U R A L  N E T W O R K S  J N T U  M O D E L  P A P ...A R T I F I C I A L  N E U R A L  N E T W O R K S  J N T U  M O D E L  P A P ...
A R T I F I C I A L N E U R A L N E T W O R K S J N T U M O D E L P A P ...guest3f9c6b
 
Compressive Sensing Based Simultaneous Data Compression and Convergent Encryp...
Compressive Sensing Based Simultaneous Data Compression and Convergent Encryp...Compressive Sensing Based Simultaneous Data Compression and Convergent Encryp...
Compressive Sensing Based Simultaneous Data Compression and Convergent Encryp...IJCSIS Research Publications
 
Compressed sensing techniques for sensor data using unsupervised learning
Compressed sensing techniques for sensor data using unsupervised learningCompressed sensing techniques for sensor data using unsupervised learning
Compressed sensing techniques for sensor data using unsupervised learningSong Cui, Ph.D
 
Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology tuxette
 
Reducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksReducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksHakky St
 
Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biologyKernel methods for data integration in systems biology
Kernel methods for data integration in systems biologytuxette
 
A new partial image encryption method for document images using variance base...
A new partial image encryption method for document images using variance base...A new partial image encryption method for document images using variance base...
A new partial image encryption method for document images using variance base...IJECEIAES
 
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...csandit
 
Non-Separable Histogram Based Reversible Data Hiding Approach Using Inverse S...
Non-Separable Histogram Based Reversible Data Hiding Approach Using Inverse S...Non-Separable Histogram Based Reversible Data Hiding Approach Using Inverse S...
Non-Separable Histogram Based Reversible Data Hiding Approach Using Inverse S...IJCSIS Research Publications
 
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...ijaia
 
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 

What's hot (17)

IMPROVING OF ARTIFICIAL NEURAL NETWORKS PERFORMANCE BY USING GPU’S: A SURVEY
IMPROVING OF ARTIFICIAL NEURAL NETWORKS PERFORMANCE BY USING GPU’S: A SURVEYIMPROVING OF ARTIFICIAL NEURAL NETWORKS PERFORMANCE BY USING GPU’S: A SURVEY
IMPROVING OF ARTIFICIAL NEURAL NETWORKS PERFORMANCE BY USING GPU’S: A SURVEY
 
Improving of artifical neural networks performance by using gpu's a survey
Improving of artifical neural networks performance by using gpu's  a surveyImproving of artifical neural networks performance by using gpu's  a survey
Improving of artifical neural networks performance by using gpu's a survey
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
 
Neural Networks: Introducton
Neural Networks: IntroductonNeural Networks: Introducton
Neural Networks: Introducton
 
Bioinformatics kernels relations
Bioinformatics kernels relationsBioinformatics kernels relations
Bioinformatics kernels relations
 
A R T I F I C I A L N E U R A L N E T W O R K S J N T U M O D E L P A P ...
A R T I F I C I A L  N E U R A L  N E T W O R K S  J N T U  M O D E L  P A P ...A R T I F I C I A L  N E U R A L  N E T W O R K S  J N T U  M O D E L  P A P ...
A R T I F I C I A L N E U R A L N E T W O R K S J N T U M O D E L P A P ...
 
Compressive Sensing Based Simultaneous Data Compression and Convergent Encryp...
Compressive Sensing Based Simultaneous Data Compression and Convergent Encryp...Compressive Sensing Based Simultaneous Data Compression and Convergent Encryp...
Compressive Sensing Based Simultaneous Data Compression and Convergent Encryp...
 
Compressed sensing techniques for sensor data using unsupervised learning
Compressed sensing techniques for sensor data using unsupervised learningCompressed sensing techniques for sensor data using unsupervised learning
Compressed sensing techniques for sensor data using unsupervised learning
 
Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biology
 
Reducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksReducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networks
 
Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biologyKernel methods for data integration in systems biology
Kernel methods for data integration in systems biology
 
A new partial image encryption method for document images using variance base...
A new partial image encryption method for document images using variance base...A new partial image encryption method for document images using variance base...
A new partial image encryption method for document images using variance base...
 
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
 
Non-Separable Histogram Based Reversible Data Hiding Approach Using Inverse S...
Non-Separable Histogram Based Reversible Data Hiding Approach Using Inverse S...Non-Separable Histogram Based Reversible Data Hiding Approach Using Inverse S...
Non-Separable Histogram Based Reversible Data Hiding Approach Using Inverse S...
 
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...
 
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
 

Similar to Extremely Low Bit Transformer Quantization for On-Device NMT

HANDWRITTEN DIGIT RECOGNITION
HANDWRITTEN DIGIT RECOGNITIONHANDWRITTEN DIGIT RECOGNITION
HANDWRITTEN DIGIT RECOGNITIONIRJET Journal
 
HANDWRITTEN DIGIT RECOGNITION
HANDWRITTEN DIGIT RECOGNITIONHANDWRITTEN DIGIT RECOGNITION
HANDWRITTEN DIGIT RECOGNITIONIRJET Journal
 
A White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationA White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationApril Knyff
 
IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
IBM Cloud Paris Meetup 20180517 - Deep Learning ChallengesIBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
IBM Cloud Paris Meetup 20180517 - Deep Learning ChallengesIBM France Lab
 
a paper reading of table recognition
a paper reading of table recognitiona paper reading of table recognition
a paper reading of table recognitionNing Lu
 
Artificial Intelligence Applications in Petroleum Engineering - Part I
Artificial Intelligence Applications in Petroleum Engineering - Part IArtificial Intelligence Applications in Petroleum Engineering - Part I
Artificial Intelligence Applications in Petroleum Engineering - Part IRamez Abdalla, M.Sc
 
Accelerating algorithmic and hardware advancements for power efficient on-dev...
Accelerating algorithmic and hardware advancements for power efficient on-dev...Accelerating algorithmic and hardware advancements for power efficient on-dev...
Accelerating algorithmic and hardware advancements for power efficient on-dev...Qualcomm Research
 
Federated Learning of Neural Network Models with Heterogeneous Structures.pdf
Federated Learning of Neural Network Models with Heterogeneous Structures.pdfFederated Learning of Neural Network Models with Heterogeneous Structures.pdf
Federated Learning of Neural Network Models with Heterogeneous Structures.pdfKundjanasith Thonglek
 
DLD_WeightSharing_Slide
DLD_WeightSharing_SlideDLD_WeightSharing_Slide
DLD_WeightSharing_SlideKang-Ho Lee
 
SANN: Programming Code Representation Using Attention Neural Network with Opt...
SANN: Programming Code Representation Using Attention Neural Network with Opt...SANN: Programming Code Representation Using Attention Neural Network with Opt...
SANN: Programming Code Representation Using Attention Neural Network with Opt...Peter Brusilovsky
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classificationijtsrd
 
Neural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep LearningNeural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep Learningcomifa7406
 
DLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningDLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningBrodmann17
 
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...miyurud
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesmustafa sarac
 
Application of Hybrid Genetic Algorithm Using Artificial Neural Network in Da...
Application of Hybrid Genetic Algorithm Using Artificial Neural Network in Da...Application of Hybrid Genetic Algorithm Using Artificial Neural Network in Da...
Application of Hybrid Genetic Algorithm Using Artificial Neural Network in Da...IOSRjournaljce
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Handwritten Digit Recognition Using CNN
Handwritten Digit Recognition Using CNNHandwritten Digit Recognition Using CNN
Handwritten Digit Recognition Using CNNIRJET Journal
 
Introduction To Machine Learning and Neural Networks
Introduction To Machine Learning and Neural NetworksIntroduction To Machine Learning and Neural Networks
Introduction To Machine Learning and Neural Networks德平 黄
 

Similar to Extremely Low Bit Transformer Quantization for On-Device NMT (20)

HANDWRITTEN DIGIT RECOGNITION
HANDWRITTEN DIGIT RECOGNITIONHANDWRITTEN DIGIT RECOGNITION
HANDWRITTEN DIGIT RECOGNITION
 
HANDWRITTEN DIGIT RECOGNITION
HANDWRITTEN DIGIT RECOGNITIONHANDWRITTEN DIGIT RECOGNITION
HANDWRITTEN DIGIT RECOGNITION
 
A White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationA White Paper On Neural Network Quantization
A White Paper On Neural Network Quantization
 
IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
IBM Cloud Paris Meetup 20180517 - Deep Learning ChallengesIBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
 
a paper reading of table recognition
a paper reading of table recognitiona paper reading of table recognition
a paper reading of table recognition
 
Artificial Intelligence Applications in Petroleum Engineering - Part I
Artificial Intelligence Applications in Petroleum Engineering - Part IArtificial Intelligence Applications in Petroleum Engineering - Part I
Artificial Intelligence Applications in Petroleum Engineering - Part I
 
Accelerating algorithmic and hardware advancements for power efficient on-dev...
Accelerating algorithmic and hardware advancements for power efficient on-dev...Accelerating algorithmic and hardware advancements for power efficient on-dev...
Accelerating algorithmic and hardware advancements for power efficient on-dev...
 
Federated Learning of Neural Network Models with Heterogeneous Structures.pdf
Federated Learning of Neural Network Models with Heterogeneous Structures.pdfFederated Learning of Neural Network Models with Heterogeneous Structures.pdf
Federated Learning of Neural Network Models with Heterogeneous Structures.pdf
 
DLD_WeightSharing_Slide
DLD_WeightSharing_SlideDLD_WeightSharing_Slide
DLD_WeightSharing_Slide
 
SANN: Programming Code Representation Using Attention Neural Network with Opt...
SANN: Programming Code Representation Using Attention Neural Network with Opt...SANN: Programming Code Representation Using Attention Neural Network with Opt...
SANN: Programming Code Representation Using Attention Neural Network with Opt...
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classification
 
Neural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep LearningNeural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep Learning
 
DLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningDLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep Learning
 
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
 
Application of Hybrid Genetic Algorithm Using Artificial Neural Network in Da...
Application of Hybrid Genetic Algorithm Using Artificial Neural Network in Da...Application of Hybrid Genetic Algorithm Using Artificial Neural Network in Da...
Application of Hybrid Genetic Algorithm Using Artificial Neural Network in Da...
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Handwritten Digit Recognition Using CNN
Handwritten Digit Recognition Using CNNHandwritten Digit Recognition Using CNN
Handwritten Digit Recognition Using CNN
 
Survey of recent deep learning with low precision
Survey of recent deep learning with low precisionSurvey of recent deep learning with low precision
Survey of recent deep learning with low precision
 
Introduction To Machine Learning and Neural Networks
Introduction To Machine Learning and Neural NetworksIntroduction To Machine Learning and Neural Networks
Introduction To Machine Learning and Neural Networks
 

Recently uploaded

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 

Recently uploaded (20)

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 

Extremely Low Bit Transformer Quantization for On-Device NMT

  • 1. Extremely Low Bit Transformer Quantization for On-Device NMT Insoo Chung*, Byeongwook Kim*, Yoonjung Choi, Se Jung Kwon, Yongkweon Jeon, Baeseong Park, Sangha Kim, Dongsoo Lee {insooo.chung, byeonguk.kim, yj0807.choi, sejung0.kwon, dragwon.jeon, bpbs.park, sangha01.kim, dongsoo3.lee}@samsung.com * Equal contribution
  • 2. Fig. 1. The Transformer architecture ❏ High computation and memory requirement ❏ Transformer (Vaswani et al. 2017) requires huge # of parameters ❏ 60.9M parameters for base configuration with vocab. size 32k Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020). Problem with Transformer’s On-Device Deployment
  • 3. Fig. 1. The Transformer architecture ❏ High computation and memory requirement ❏ Transformer (Vaswani et al. 2017) requires huge # of parameters ❏ 60.9M parameters for base configuration with vocab. size 32k ❏ Hard-to-parallelize inference ❏ Recurrent decode steps dependent on outputs of all prior steps (→ hard to parallelize) ❏ During recurrent decode steps, decoder and embedding parameters are repeatedly loaded onto the cache (→ memory-wall problem) (Jeon et al. 2020) ❏ # params in decoder + embedding ≈ 70% of all parameters (base) Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020). Problem with Transformer’s On-Device Deployment
  • 4. Fig. 1. The Transformer architecture ❏ High computation and memory requirement ❏ Transformer (Vaswani et al. 2017) requires huge # of parameters ❏ 60.9M parameters for base configuration with vocab. size 32k ❏ Hard-to-parallelize inference ❏ Recurrent decode steps dependent on outputs of all prior steps (→ hard to parallelize) ❏ During recurrent decode steps, decoder and embedding parameters are repeatedly loaded onto the cache (→ memory-wall problem) (Jeon et al. 2020) ❏ # params in decoder + embedding ≈ 70% of all parameters (base) ❏ Objective: Reduce computational burden, but keep translation quality high! Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020). Problem with Transformer’s On-Device Deployment
  • 5. Background - Quantization ❏ Uniform quantization ❏ Representing FP32 parameters with lower precision parameters (e.g., INT8, INT4) ❏ Minor accuracy loss (for some NN) with weight quantization Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019). Stock, Pierre, et al. "And the Bit Goes Down: Revisiting the Quantization of Neural Networks." Eighth International Conference on Learning Representations. 2020. Fig. 2. INT8 quantization
  • 6. Background - Quantization ❏ Uniform quantization ❏ Representing FP32 parameters with lower precision parameters (e.g., INT8, INT4) ❏ Minor accuracy loss (for some NN) with weight quantization ❏ For efficient integer arithmetic, activation quantization is required ❏ Activation quantization lead to severe accuracy loss in softmax and layer normalization (Bhandare et al., 2019) 😢 Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019). Stock, Pierre, et al. "And the Bit Goes Down: Revisiting the Quantization of Neural Networks." Eighth International Conference on Learning Representations. 2020. Fig. 2. INT8 quantization
  • 7. Background - Quantization ❏ Uniform quantization ❏ Representing FP32 parameters with lower precision parameters (e.g., INT8, INT4) ❏ Minor accuracy loss (for some NN) with weight quantization ❏ For efficient integer arithmetic, activation quantization is required ❏ Activation quantization lead to severe accuracy loss in softmax and layer normalization (Bhandare et al., 2019) 😢 ❏ Non-uniform quantization (codebook-based) ❏ Representing parameters with selected number of centroids ❏ Low bit quantization (for some NN) and reduced runtime memory footprint Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019). Stock, Pierre, et al. "And the Bit Goes Down: Revisiting the Quantization of Neural Networks." Eighth International Conference on Learning Representations. 2020. Fig. 2. INT8 quantization
  • 8. Background - Quantization ❏ Uniform quantization ❏ Representing FP32 parameters with lower precision parameters (e.g., INT8, INT4) ❏ Minor accuracy loss (for some NN) with weight quantization ❏ For efficient integer arithmetic, activation quantization is required ❏ Activation quantization lead to severe accuracy loss in softmax and layer normalization (Bhandare et al., 2019) 😢 ❏ Non-uniform quantization (codebook-based) ❏ Representing parameters with selected number of centroids ❏ Low bit quantization (for some NN) and reduced runtime memory footprint ❏ Inference requires mandatory dequantization process (Stock et al., 2020) 😢 Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019). Stock, Pierre, et al. "And the Bit Goes Down: Revisiting the Quantization of Neural Networks." Eighth International Conference on Learning Representations. 2020. Fig. 2. INT8 quantization
  • 9. ❏ Non-uniform quantization (based on binary codes ∈{−1, +1}) (Rastegari et al., 2016) ❏ q-bit quantization based on binary codes Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European conference on computer vision. Springer, Cham, 2016. Guo, Yunhui. "A survey on methods and theories of quantized neural networks." arXiv preprint arXiv:1808.04752 (2018). Xu, Chen, et al. "Alternating multi-bit quantization for recurrent neural networks." Sixth International Conference on Learning Representations. 2018. Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020). Background - Quantization
  • 10. Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European conference on computer vision. Springer, Cham, 2016. Guo, Yunhui. "A survey on methods and theories of quantized neural networks." arXiv preprint arXiv:1808.04752 (2018). Xu, Chen, et al. "Alternating multi-bit quantization for recurrent neural networks." Sixth International Conference on Learning Representations. 2018. Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020). Background - Quantization w : a FP vector (e.g., a row in a weight matrix) αi : i-th FP scale (scalar) bi : i-th binary code vector ❏ Non-uniform quantization (based on binary codes ∈{−1, +1}) (Rastegari et al., 2016) ❏ q-bit quantization based on binary codes ❏ Low bit quantization (for some NN) ❏ Approximation algorithms: Guo et al., 2017; Xu et al., 2018 ❏ We use greedy approximation for simplicity
  • 11. Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European conference on computer vision. Springer, Cham, 2016. Guo, Yunhui. "A survey on methods and theories of quantized neural networks." arXiv preprint arXiv:1808.04752 (2018). Xu, Chen, et al. "Alternating multi-bit quantization for recurrent neural networks." Sixth International Conference on Learning Representations. 2018. Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020). Background - Quantization w : a FP vector (e.g., a row in a weight matrix) αi : i-th FP scale (scalar) bi : i-th binary code vector Expand to matrix quantization ❏ Non-uniform quantization (based on binary codes ∈{−1, +1}) (Rastegari et al., 2016) ❏ q-bit quantization based on binary codes ❏ Low bit quantization (for some NN) ❏ Approximation algorithms: Guo et al., 2017; Xu et al., 2018 ❏ We use greedy approximation for simplicity
  • 12. Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European conference on computer vision. Springer, Cham, 2016. Guo, Yunhui. "A survey on methods and theories of quantized neural networks." arXiv preprint arXiv:1808.04752 (2018). Xu, Chen, et al. "Alternating multi-bit quantization for recurrent neural networks." Sixth International Conference on Learning Representations. 2018. Jeon, Yongkweon, et al. "BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs." arXiv preprint arXiv:2005.09904 (2020). Background - Quantization w : a FP vector (e.g., a row in a weight matrix) αi : i-th FP scale (scalar) bi : i-th binary code vector element-wise multiplication = Expand to matrix quantization ❏ Non-uniform quantization (based on binary codes ∈{−1, +1}) (Rastegari et al., 2016) ❏ q-bit quantization based on binary codes ❏ Low bit quantization (for some NN) ❏ Approximation algorithms: Guo et al., 2017; Xu et al., 2018 ❏ We use greedy approximation for simplicity ❏ Matmul with FP activations and quantized weights ❏ Bi · x can be pre-computed and reused (Jeon et al., 2020)
  • 13. I. We explore binary-code-based quantization for on-device NMT Transformer Quantization for On-Device NMT
  • 14. I. We explore binary-code-based quantization for on-device NMT II. Our contribution A. A mixed precision quantization strategy for extremely low bit quantization Transformer Quantization for On-Device NMT
  • 15. I. We explore binary-code-based quantization for on-device NMT II. Our contribution A. A mixed precision quantization strategy for extremely low bit quantization B. Under 3-bit quantization with less than -0.5 BLEU Transformer Quantization for On-Device NMT
  • 16. I. We explore binary-code-based quantization for on-device NMT II. Our contribution A. A mixed precision quantization strategy for extremely low bit quantization B. Under 3-bit quantization with less than -0.5 BLEU C. Significantly reduced runtime memory and high speedup on actual mobile device Transformer Quantization for On-Device NMT
  • 17. Quantization Method for Embedding ❏ Word frequency follows a power-law distribution ❏ 1% of words account for 95% of word occurrences in WMT2014 data ❏ See also, Chen et al. (2018) Chen, Patrick, et al. "Groupreduce: Block-wise low-rank approximation for neural language model shrinking." Advances in neural information processing systems. 2018. Fig. 3. Power law distribution of word frequencies
  • 18. Quantization Method for Embedding ❏ Word frequency follows a power-law distribution ❏ 1% of words account for 95% of word occurrences in WMT2014 data ❏ See also, Chen et al. (2018) ❏ Assign higher number of bits to more frequent words Chen, Patrick, et al. "Groupreduce: Block-wise low-rank approximation for neural language model shrinking." Advances in neural information processing systems. 2018. Fig. 3. Power law distribution of word frequencies
  • 19. Quantization Method for Embedding ❏ Word frequency follows a power-law distribution ❏ 1% of words account for 95% of word occurrences in WMT2014 data ❏ See also, Chen et al. (2018) ❏ Assign higher number of bits to more frequent words ❏ Algorithm 1: assign {b, b-1, ..., 1} bits to words in frequency-based clusters of size ratio r^0:r^1:...:r^(b-1) ❏ With r > 1, smaller clusters hold more frequent words Chen, Patrick, et al. "Groupreduce: Block-wise low-rank approximation for neural language model shrinking." Advances in neural information processing systems. 2018. Fig. 3. Power law distribution of word frequencies
  • 20. Quantization Method for Embedding ❏ Word frequency follows a power-law distribution ❏ 1% of words account for 95% of word occurrences in WMT2014 data ❏ See also, Chen et al. (2018) ❏ Assign higher number of bits to more frequent words ❏ Algorithm 1: assign {b, b-1, ..., 1} bits to words in frequency-based clusters of size ratio r^0:r^1:...:r^(b-1) ❏ With r > 1, smaller clusters hold more frequent words ❏ We achieve 1.32-bit embedding that performs on par with 2-bit embedding (all 2-bit words) ❏ Algorithm 1 with b=4, r=4 applied ❏ Assigned {4, 3, 2, 1} bits to words in freq. clusters of size ratio 1:4:16:64 Chen, Patrick, et al. "Groupreduce: Block-wise low-rank approximation for neural language model shrinking." Advances in neural information processing systems. 2018. Fig. 3. Power law distribution of word frequencies Fig. 4. BLEU score on En2De newstest2013 with quantized embeddings (no retraining, baseline BLEU 25.4)
  • 21. Quantization Method for Encoder and Decoder ❏ Bit assignment based on sensitivity ❏ Different degree of BLEU score degradation when quantized ❏ Model exhibit severe BLEU degradation when Encffn sublayers or Deced sublayers are quantized ❏ See also, Michael at el. (2019) ❏ Within encoder and decoder blocks, we assign more sensitive sublayers higher number of bits Michel, Paul, Omer Levy, and Graham Neubig. "Are sixteen heads really better than one?" Advances in Neural Information Processing Systems. 2019. Table 1. BLEU on En2De newstest2013 with a type of sublayers quantized (no retraining, baseline BLEU 25.4)
  • 22. Quantization Method for Encoder and Decoder ❏ Bit assignment based on sensitivity ❏ Different degree of BLEU score degradation when quantized ❏ Model exhibit severe BLEU degradation when Encffn sublayers or Deced sublayers are quantized ❏ See also, Michael at el. (2019) ❏ Within encoder and decoder blocks, we assign more sensitive sublayers higher number of bits ❏ Decoder block is assigned lower bits than Encoder block ❏ Recap - Hard-to-parallelize decoder Michel, Paul, Omer Levy, and Graham Neubig. "Are sixteen heads really better than one?" Advances in Neural Information Processing Systems. 2019. Table 1. BLEU on En2De newstest2013 with a type of sublayers quantized (no retraining, baseline BLEU 25.4)
  • 23. ❏ Quantization method ① Train a FP baseline (base configuration) Implementation Details - Retraining
  • 24. ❏ Quantization method ① Train a FP baseline (base configuration) ② Retrain and quantize Emb. ③ Retrain and quantize Emb. + Dec. ④ Retrain and quantize Emb. + Dec. + Enc. Implementation Details - Retraining
  • 25. ❏ Quantization method ① Train a FP baseline (base configuration) ② Retrain and quantize Emb. ③ Retrain and quantize Emb. + Dec. ④ Retrain and quantize Emb. + Dec. + Enc. ❏ Details ❏ We adopt pNR in our retraining (Fig. 5) ❏ To modulate strong regularization strength (1000 to 2000 works 👍) Implementation Details - Retraining Fig. 5. Illustration of pNR variable in action. In our case, the weight regularization applied is binary coding based quantization.
  • 26. ❏ Quantization method ① Train a FP baseline (base configuration) ② Retrain and quantize Emb. ③ Retrain and quantize Emb. + Dec. ④ Retrain and quantize Emb. + Dec. + Enc. ❏ Details ❏ We adopt pNR in our retraining (Fig. 5) ❏ To modulate strong regularization strength (1000 to 2000 works 👍) ❏ We set quantization bits as follows: ❏ For Decdd, Deced, Decffn, 2, 3, 1 bits (1.8-bit decoder) ❏ For Encee and Encffn, 3 and 4 bits (3.7-bit encoder) → 2.6-bit Transformer ❏ For Emb, apply Alg. 1 with b=4, r=1 (2.5-bit embedding) ❏ Bias and layer normalization parameters are not quantized More details in the paper Implementation Details - Retraining Fig. 5. Illustration of pNR variable in action. In our case, the weight regularization applied is binary coding based quantization.
  • 27. Results ❏ We achieve 2.6-bit Transformer models ❏ Within -0.5 BLEU from baseline ❏ 11.8X comp. in actual model size ← 1st retraining (Emb.) ← 2nd retraining (Emb. + Dec.) ← Last retraining (Emb. + Dec. + Enc.) Table 2. Average quantization bits, BLEU scores, and model size of our quantized models. Note that 2.5-bit embeddings quantized using Algorithm 1 with b=4, r=1
  • 28. Results ❏ We achieve 3.5X speedup and 8.3X runtime memory compression ❏ Most of the speedup is achieved with embedding and decoder quantization ❏ Recap - They are involved with decode steps with limited parallelization ❏ Encoder quantization contributes little to the speedup (high parallelizability) Table 3. Speedup and runtime memory reduction achieved with quantization. Tested on-device inference on Galaxy N10+ with En2De model translating newstest2013 using our C++ code -244.6 -262.9 -0.7 2.6-bit model
  • 29. More Analysis ❏ Memory inefficiency in FP Transformer ❏ Comparison with other quantization strategy Table 4. FLOPs requirement by a translation run, and actual on-device latency accountable by each of the Transformer blocks. A translation run denotes 30-words input to 30-words output translation. Tested with Galaxy N10+ Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. Bhandare, Aishwarya, et al. "Efficient 8-bit quantization of transformer neural machine language translation model." arXiv preprint arXiv:1906.00532 (2019). Prato, Gabriele, Ella Charlaix, and Mehdi Rezagholizadeh. "Fully quantized transformer for improved translation." arXiv preprint arXiv:1910.10485 (2019). Table 5. Comparison with other quantization strategies. Bhandare et al. (2019) reports 1.5X speedup from quantization and Prato et al. (2019) reports no measurements of speed. More results and analysis available in our paper!
  • 30. References - Figures and Tables ❏ Fig. 1. The Transformer architecture from: Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. ❏ Fig. 2. INT8 quantization from: "Quantization - Neural Network Distiller", https://nervanasystems.github.io/distiller/algo_quantization.html. ❏ Fig. 5. Illustration of pNR variable in action from: Lee, Dongsoo, et al. "Decoupling Weight Regularization from Batch Size for Model Compression", Sep. 2019, openreview.net, https://openreview.net/forum?id=BJlaG0VFDH. ❏ All other tables and figures from: Chung, Insoo, Kim, Byeongwook et al. "Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation", 2020, Findings of Empirical Methods in Natural Language Processing