SlideShare a Scribd company logo
1 of 82
Download to read offline
© 2019 Synopsys
5+ Techniques for Efficient
Implementation of Neural
Networks
Bert Moons
Synopsys
May 2019
© 2019 Synopsys
Introduction --
Challenges of Embedding
Deep Learning
© 2019 Synopsys
Neural Network accuracy comes at a high cost in terms of model
storage and operations per input feature
3 Major challenges
Introduction – Challenges of embedded deep learning
Many embedded applications require real-time operation on high-
dimensional, large input data from various input sources
Many embedded applications require support for a variety of
networks: CNN’s in feature extraction, RNN’s in sequence modeling
© 2019 Synopsys
3 Major challenges
Introduction – Challenges of embedded deep learning
1. Many operations per pixel
2. Process a lot of pixels in real-time
3. A large variation of different algorithms
© 2019 Synopsys
Classification accuracy comes at a cost
Introduction – Challenges of embedded deep learning
Conventional Machine
Learning
Deep
Learning
Human
Bestreportedtop-5accuracy
onIMAGENET-1000[%]
Neural network accuracy comes at a cost of a high workload per input pixel
and huge model sizes and bandwidth requirements
© 2019 Synopsys
Computing on large input data
Introduction – Challenges of embedded deep learning
4KFHDIMAGENET
1X 40X 160X
Embedded applications require
real-time operation on large input frames
© 2019 Synopsys
Massive workload in real-time applications
Introduction – Challenges of embedded deep learning
1GOP 1TOP
Top-1IMAGENETaccuracy[%]
70
75
65
Single
ImageNet
Image
1GOP to 10GOP per IMAGENET image
# operations / ImageNet image1GOP/s 1TOP/s
Top-1IMAGENETaccuracy[%]
70
75
65
6 Cameras
30fps
Full HD
Image
5-to-180 TOPS @ 30 fps, FHD, ADAS
# operations / second
MobileNet V2
ResNet-50
GoogleNet
VGG-16
© 2019 Synopsys
5+ Techniques to reduce the DNN workload
A. Neural Networks are
error-tolerant
Introduction – Challenges of embedded deep learning
1. Linear post-training 8/12/16b quantization
2. Linear trained 2/4/8 bit quantization
3. Non-linear trained 2/4/8 bit quantization
through clustering
C. Neural Networks have
sparse and correlated
intermediate results
B. Neural Networks have
redundancies and
are over-dimensioned
4. Network pruning and compression
5. Network decomposition: low-rank network
approximations
6. Sparsity and correlation based feature map
compression
© 2019 Synopsys
A. Neural Networks Are Error-
Tolerant
9
© 2019 Synopsys
The benefits of quantized number representations
5 Techniques – A. Neural Networks Are Error-Tolerant
8 bit fixed is 3-4x faster, 2-4x more efficient than 16b floating point
Energy consumption
per unit
Processing
units per chip
Classification
time per chip
* [Choi,2019]
16b float 8b fixed
O(1)
relative fps
O(16)
relative fps
4b fixed
O(256)
relative fps
~ 16 ~ 6-8 ~ 2-4
Relative accuracy 100%
no loss
99% 50-95%
© 2019 Synopsys
Linear post-training quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Convert floating point pretrained models to Dynamic Fixed Point
0 1 1 0 1
1 0 0 1 1 1 0 1 0 1
0 1 1 1 0 0 0 0 1 0
0 0 1 0 1 1 0 0 1 1
1 1 1 0 0 1 0 1 0 0
0 1 1 0 1
1 0 0 1 1 1 0 1 0 1
0 1 1 1 0 0 0 0 1 0
0 0 1 0 1 1 0 0 1 1
1 1 1 0 0 1 0 1 0 0
Fixed Point Dynamic Fixed Point
0 1 1 0 1System Exponent Group 1 Exponent
Group 2 Exponent
* [Courbariaux,2019]
© 2019 Synopsys
Linear post-training quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Dynamic Fixed-Point Quantization allows running neural networks with 8
bit weights and activations across the board
32 bit float baseline 8 bit fixed point
* [Nvidia,2017]
© 2019 Synopsys
Linear post-training quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
How to optimally choose: dynamic exponent groups, saturation
thresholds, weight and activation exponents?
Min-max scaling throws away small values A saturation threshold better represents
small values, but clips large values
* [Nvidia,2017]
© 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Floating point models are a bad initializer for low-precision fixed-point.
Trained quantization from scratch automates heuristic-based optimization.
Quantizing weights and activations with straight-
through estimators, allowing back-prop +
Train saturation range
for activations
Forward Backward
* PACT, Parametrized Clipping
Activation [Choi,2019]
*
© 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Good accuracy down to 2b
Graceful performance degradation
0.85
0.9
0.95
1
1.05
CIFAR10 SVHN AlexNet ResNet18 ResNet50
full precision 5b 4b 3b 2b
* [Choi,2018]
Relativebenchmark
accuracyvsfloatbaseline*
© 2019 Synopsys
Non-linear trained quantization – codebook clustering
5 Techniques – A. Neural Networks Are Error-Tolerant
Clustered, codebook quantization can be optimally trained.
This only reduces bandwidth, computations are still in floating point.
* [Han,2015]
© 2019 Synopsys
B. Neural Networks Are Over-
Dimensioned & Redundant
17
© 2019 Synopsys
Pruning Neural Networks
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
Pruning removes unnecessary connections in the neural network. Accuracy
is recovered through retraining the pruned network
* [Han,2015]
© 2019 Synopsys
Low Rank Singular Value Decomposition (SVD) in DNNs
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
Many singular values are small and can be discarded
* [Xue,2013]
𝑨 = 𝑼 𝜮 𝑽 𝑻
𝑨 ≅ 𝑼′𝜮′𝑽′ 𝑻 = 𝑵𝑼
© 2019 Synopsys
Low Rank Canonical Polyadic (CP) decomp. in CNNs
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
Convert a large convolutional filter in a triplet of smaller filters
* [Astrid,2017]
© 2019 Synopsys
Basic example: Combining SVD, pruning and clustering
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
11x model compression in a phone-recognition LSTM
0
2
4
6
8
10
12
Base P SVD SVD+P P+C SVD+P+C
LSTMCompressionRate
(P)
(C)
* [Goetschalckx,2018]
P = Pruning
SVD = Singular Value Decomposition
C = Clustering / Codebook Compression
© 2019 Synopsys
C. Neural Networks have
Sparse & Correlated
Intermediate Results
22
© 2019 Synopsys
Sparse feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Feature map bandwidth dominates in modern CNNs
0
2
4
6
8
10
12
Coefficient
BW
Feature Map
BW
BWinMObileNet-V1[MB]
1x1, 32
3x3DW, 32
1x1, 64
32
© 2019 Synopsys
Sparse feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
ReLU activation introduces 50-90% zero-valued numbers in intermediate
feature maps
-5 4 12
-10 0 17
-1 3 2
0 4 12
0 0 17
0 3 2
ReLU activation
8b Features 8b Features
© 2019 Synopsys
Sparse feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Hardware support for multi-bit Huffman encoding allows up to 2x
bandwidth reduction in typical networks.
Zero-runlength encoding as in [Chen, 2016],
Huffman-encoding as in [Moons, 2017]
0 4 12
0 0 17
0 3 2
8b Features
72b
Huffman Features
zero 2’b00
<16 2’b01, 4’b WORD
nonzero 1’b1, 8’b WORD
41b < 72b
© 2019 Synopsys
Correlated feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Intermediate features in the same channel-plane are highly correlated
Intermediate featuremaps in ReLU-less YOLOV2
An example featuremap
scale1
An example featuremap
scale9
© 2019 Synopsys
Correlated feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Super-linear correlation based extended bit-plane compression allows
feature-map compression even on non-sparse data
* [Cavigelli,2018]
0,4,0,0,0,8,0,0,71
Zero Values
Non-Zero
Values
Correlated Values: 16, 20, 20, 20, 28, 28, 28, 99
Delta Values: 0, 4, 0, 0, 8, 0, 0, 71
© 2019 Synopsys
Correlated feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Correlated compression outperforms sparsity-based compression
0
0.5
1
1.5
2
2.5
3
Mobilenet ResNet-50 Yolo V2 VOC VGG-16
CompressionRate
Sparsity-based Correlation-based
© 2019 Synopsys
Conclusion –
Bringing it All Together
© 2019 Synopsys
A first-order analysis on ResNet-50
5 Techniques – Conclusion: bringing it all together
A first-order energy model for Neural Network Inference
Assume:
• Quadratic energy scaling / MAC
when going from 32 to 8 bit.
• Linear energy saving / read-write in DDR/SRAM
when going from 32 to 8 bit
• 50% of coefficients zero when pruning
• 50% compute reduction under decomposition
• 50% of activations can be compressed
*[Han,2015]
*
© 2019 Synopsys
A first-order analysis on ResNet-50
5 Techniques – Conclusion: bringing it all together
When all model data is stored in DRAM optimized ResNet-50 is 10x
more efficient than its plain 32b counterpart
O(65MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame
10x
100%
22%
16%
11%
0%
20%
40%
60%
80%
100%
120%
32b float A. 8b fixed B. Decomposition
+ Pruning
C. Featuremap
Compression
RelativeEnergy
Consumption
© 2019 Synopsys
A first-order analysis on ResNet-50
5 Techniques – Conclusion: bringing it all together
In a system with sufficient on-chip SRAM, optimized ResNet-50 is 12.5x
more efficient than its plain 32b counterpart
O(0MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame
100%
15%
9% 8%
0%
20%
40%
60%
80%
100%
120%
32b float A. 8b fixed B. Decomposition
+ Pruning
C. Featuremap
Compression
RelativeEnergy
Consumption
12.5x
© 2019 Synopsys
For More Information
Visit the Synopsys booth for
demos on Automotive ADAS,
Virtual Reality & More
33
EV6x Embedded Vision
Processor IP with Safety
Enhancement Package
• Thursday, May 23
• Santa Clara Convention Center
• Doors open 8 AM
• Sessions on EV6x Vision Processor IP, Functional Safety, Security, OpenVX…
• Register via the EV Alliance website or at Synopsys Booth
Join Synopsys’ EV Seminar on Thursday
Navigating Embedded Vision at the Edge
B E S T P R O C E S S O R
© 2019 Synopsys
References
34
[Han,2015,2016]
https://arxiv.org/abs/1510.00149
https://arxiv.org/abs/1602.01528
[Xue, 2013]
https://www.microsoft.com/en-us/research/wp-
content/uploads/2013/01/svd_v2.pdf
[Nvidia, 2017]
http://on-demand.gputechconf.com/gtc
/2017/presentation/s7310-8-bit-inference-with-
tensorrt.pdf
[Choi, 2018, 2019]
https://arxiv.org/abs/1805.06085
https://www.ibm.com/blogs/research/2019/04/2-bit-
precision/
[Goetschalckx, 2018]
https://www.sigmobile.org/mobisys/2018/workshops/
deepmobile18/papers/Efficiently_Combining_SVD_Pru
ning_Clustering_Retraining.pdf
[Astrid,2017]
https://arxiv.org/abs/1701.07148
[Moons,2017]
https://ieeexplore.ieee.org/abstract/document/78703
53
[Chen,2016]
http://eyeriss.mit.edu/
[Cavigelli, 2018]
https://arxiv.org/abs/1810.03979
[Courbariaux, 2014]
https://arxiv.org/pdf/1412.7024.pdf
Embedded Vision Summit
Bert Moons --
5+ Techniques for Efficient Implementations
of Neural Networks
May 2019
© 2019 Synopsys
THANK YOU
35
© 2019 Synopsys
5+ Techniques for Efficient
Implementation of Neural
Networks
Bert Moons
Synopsys
May 2019
© 2019 Synopsys
Introduction --
Challenges of Embedding
Deep Learning
© 2019 Synopsys
Neural Network accuracy comes at a high cost in terms of model
storage and operations per input feature
3 Major challenges
Introduction – Challenges of embedded deep learning
© 2019 Synopsys
Neural Network accuracy comes at a high cost in terms of model
storage and operations per input feature
3 Major challenges
Introduction – Challenges of embedded deep learning
Many embedded applications require real-time operation on high-
dimensional, large input data from various input sources
© 2019 Synopsys
Neural Network accuracy comes at a high cost in terms of model
storage and operations per input feature
3 Major challenges
Introduction – Challenges of embedded deep learning
Many embedded applications require real-time operation on high-
dimensional, large input data from various input sources
Many embedded applications require support for a variety of
networks: CNN’s in feature extraction, RNN’s in sequence modeling
© 2019 Synopsys
Neural Network accuracy comes at a high cost in terms of model
storage and operations per input feature
3 Major challenges
Introduction – Challenges of embedded deep learning
Many embedded applications require real-time operation on high-
dimensional, large input data from various input sources
Many embedded applications require support for a variety of
networks: CNN’s in feature extraction, RNN’s in sequence modeling
1. Many operations per pixel
2. Process a lot of pixels in real-time
© 2019 Synopsys
Neural Network accuracy comes at a high cost in terms of model
storage and operations per input feature
3 Major challenges
Introduction – Challenges of embedded deep learning
Many embedded applications require real-time operation on high-
dimensional, large input data from various input sources
Many embedded applications require support for a variety of
networks: CNN’s in feature extraction, RNN’s in sequence modeling
1. Many operations per pixel
2. Process a lot of pixels in real-time
3. A large variation of different algorithms
© 2019 Synopsys
Neural Network accuracy comes at a high cost in terms of model
storage and operations per input feature
3 Major challenges
Introduction – Challenges of embedded deep learning
Many embedded applications require real-time operation on high-
dimensional, large input data from various input sources
Many embedded applications require support for a variety of
networks: CNN’s in feature extraction, RNN’s in sequence modeling
1. Many operations per pixel
2. Process a lot of pixels in real-time
3. A large variation of different algorithms
© 2019 Synopsys
Classification accuracy comes at a cost
Introduction – Challenges of embedded deep learning
Conventional Machine
Learning
Deep
Learning
Human
Bestreportedtop-5accuracy
onIMAGENET-1000[%]
Neural network accuracy comes at a cost of a high workload per input pixel
and huge model sizes and bandwidth requirements
© 2019 Synopsys
Computing on large input data
Introduction – Challenges of embedded deep learning
4KFHDIMAGENET
1X 40X 160X
Embedded applications require
real-time operation on large input frames
© 2019 Synopsys
Massive workload in real-time applications
Introduction – Challenges of embedded deep learning
1GOP 1TOP
Top-1IMAGENETaccuracy[%]
70
75
65
Single
ImageNet
Image
1GOP to 10GOP per IMAGENET image
# operations / ImageNet image
MobileNet V2
ResNet-50
GoogleNet
VGG-16
© 2019 Synopsys
Massive workload in real-time applications
Introduction – Challenges of embedded deep learning
1GOP 1TOP
Top-1IMAGENETaccuracy[%]
70
75
65
Single
ImageNet
Image
1GOP to 10GOP per IMAGENET image
# operations / ImageNet image1GOP/s 1TOP/s
Top-1IMAGENETaccuracy[%]
70
75
65
6 Cameras
30fps
Full HD
Image
5-to-180 TOPS @ 30 fps, FHD, ADAS
# operations / second
MobileNet V2
ResNet-50
GoogleNet
VGG-16
© 2019 Synopsys
5+ Techniques to reduce the DNN workload
A. Neural Networks are
error-tolerant
Introduction – Challenges of embedded deep learning
C. Neural Networks have
sparse and correlated
intermediate results
B. Neural Networks have
redundancies and
are over-dimensioned
© 2019 Synopsys
5+ Techniques to reduce the DNN workload
A. Neural Networks are
error-tolerant
Introduction – Challenges of embedded deep learning
1. Linear post-training 8/12/16b quantization
2. Linear trained 2/4/8 bit quantization
3. Non-linear trained 2/4/8 bit quantization
through clustering
C. Neural Networks have
sparse and correlated
intermediate results
B. Neural Networks have
redundancies and
are over-dimensioned
© 2019 Synopsys
5+ Techniques to reduce the DNN workload
A. Neural Networks are
error-tolerant
Introduction – Challenges of embedded deep learning
1. Linear post-training 8/12/16b quantization
2. Linear trained 2/4/8 bit quantization
3. Non-linear trained 2/4/8 bit quantization
through clustering
C. Neural Networks have
sparse and correlated
intermediate results
B. Neural Networks have
redundancies and
are over-dimensioned
4. Network pruning and compression
5. Network decomposition: low-rank network
approximations
© 2019 Synopsys
5+ Techniques to reduce the DNN workload
A. Neural Networks are
error-tolerant
Introduction – Challenges of embedded deep learning
1. Linear post-training 8/12/16b quantization
2. Linear trained 2/4/8 bit quantization
3. Non-linear trained 2/4/8 bit quantization
through clustering
C. Neural Networks have
sparse and correlated
intermediate results
B. Neural Networks have
redundancies and
are over-dimensioned
4. Network pruning and compression
5. Network decomposition: low-rank network
approximations
6. Sparsity and correlation based feature map
compression
© 2019 Synopsys
A. Neural Networks Are Error-
Tolerant
52
© 2019 Synopsys
The benefits of quantized number representations
5 Techniques – A. Neural Networks Are Error-Tolerant
8 bit fixed is 3-4x faster, 2-4x more efficient than 16b floating point
Energy consumption
per unit
Processing
units per chip
Classification
time per chip
* [Choi,2019]
16b float 8b fixed
O(1)
relative fps
O(16)
relative fps
4b fixed
O(256)
relative fps
~ 16 ~ 6-8 ~ 2-4
Relative accuracy 100%
no loss
99% 50-95%
© 2019 Synopsys
Linear post-training quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Convert floating point pretrained models to Dynamic Fixed Point
0 1 1 0 1
1 0 0 1 1 1 0 1 0 1
0 1 1 1 0 0 0 0 1 0
0 0 1 0 1 1 0 0 1 1
1 1 1 0 0 1 0 1 0 0
0 1 1 0 1
1 0 0 1 1 1 0 1 0 1
0 1 1 1 0 0 0 0 1 0
0 0 1 0 1 1 0 0 1 1
1 1 1 0 0 1 0 1 0 0
Fixed Point Dynamic Fixed Point
0 1 1 0 1System Exponent Group 1 Exponent
Group 2 Exponent
* [Courbariaux,2019]
© 2019 Synopsys
Linear post-training quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Dynamic Fixed-Point Quantization allows running neural networks with 8
bit weights and activations across the board
32 bit float baseline 8 bit fixed point
* [Nvidia,2017]
© 2019 Synopsys
Linear post-training quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
How to optimally choose: dynamic exponent groups, saturation
thresholds, weight and activation exponents?
Min-max scaling throws away small values A saturation threshold better represents
small values, but clips large values
* [Nvidia,2017]
© 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Floating point models are a bad initializer for low-precision fixed-point.
Trained quantization from scratch automates heuristic-based optimization.
Quantizing weights and activations with straight-
through estimators, allowing back-prop
Forward Backward
* PACT, Parametrized Clipping
Activation [Choi,2019]
© 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Floating point models are a bad initializer for low-precision fixed-point.
Trained quantization from scratch automates heuristic-based optimization.
Quantizing weights and activations with straight-
through estimators, allowing back-prop +
Train saturation range
for activations
Forward Backward
* PACT, Parametrized Clipping
Activation [Choi,2019]
© 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Floating point models are a bad initializer for low-precision fixed-point.
Trained quantization from scratch automates heuristic-based optimization.
Quantizing weights and activations with straight-
through estimators, allowing back-prop +
Train saturation range
for activations
Forward Backward
* PACT, Parametrized Clipping
Activation [Choi,2019]
© 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Floating point models are a bad initializer for low-precision fixed-point.
Trained quantization from scratch automates heuristic-based optimization.
Quantizing weights and activations with straight-
through estimators, allowing back-prop +
Train saturation range
for activations
Forward Backward
* PACT, Parametrized Clipping
Activation [Choi,2019]
© 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Floating point models are a bad initializer for low-precision fixed-point.
Trained quantization from scratch automates heuristic-based optimization.
Quantizing weights and activations with straight-
through estimators, allowing back-prop +
Train saturation range
for activations
Forward Backward
* PACT, Parametrized Clipping
Activation [Choi,2019]
*
© 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Good accuracy down to 2b
Graceful performance degradation
0.85
0.9
0.95
1
1.05
CIFAR10 SVHN AlexNet ResNet18 ResNet50
full precision 5b 4b 3b 2b
* [Choi,2018]
Relativebenchmark
accuracyvsfloatbaseline*
© 2019 Synopsys
Non-linear trained quantization – codebook clustering
5 Techniques – A. Neural Networks Are Error-Tolerant
Clustered, codebook quantization can be optimally trained.
This only reduces bandwidth, computations are still in floating point.
* [Han,2015]
© 2019 Synopsys
B. Neural Networks Are Over-
Dimensioned & Redundant
64
© 2019 Synopsys
Pruning Neural Networks
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
Pruning removes unnecessary connections in the neural network. Accuracy
is recovered through retraining the pruned network
* [Han,2015]
© 2019 Synopsys
Low Rank Singular Value Decomposition (SVD) in DNNs
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
Many singular values are small and can be discarded
* [Xue,2013]
𝑨 = 𝑼 𝜮 𝑽 𝑻
𝑨 ≅ 𝑼′𝜮′𝑽′ 𝑻 = 𝑵𝑼
© 2019 Synopsys
Low Rank Canonical Polyadic (CP) decomp. in CNNs
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
Convert a large convolutional filter in a triplet of smaller filters
* [Astrid,2017]
© 2019 Synopsys
Basic example: Combining SVD, pruning and clustering
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
11x model compression in a phone-recognition LSTM
0
2
4
6
8
10
12
Base P SVD SVD+P P+C SVD+P+C
LSTMCompressionRate
(P)
(C)
* [Goetschalckx,2018]
P = Pruning
SVD = Singular Value Decomposition
C = Clustering / Codebook Compression
© 2019 Synopsys
C. Neural Networks have
Sparse & Correlated
Intermediate Results
69
© 2019 Synopsys
Sparse feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Feature map bandwidth dominates in modern CNNs
0
2
4
6
8
10
12
Coefficient
BW
Feature Map
BW
BWinMObileNet-V1[MB]
1x1, 32
3x3DW, 32
1x1, 64
32
© 2019 Synopsys
Sparse feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
ReLU activation introduces 50-90% zero-valued numbers in intermediate
feature maps
-5 4 12
-10 0 17
-1 3 2
0 4 12
0 0 17
0 3 2
ReLU activation
8b Features 8b Features
© 2019 Synopsys
Sparse feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Hardware support for multi-bit Huffman encoding allows up to 2x
bandwidth reduction in typical networks.
Zero-runlength encoding as in [Chen, 2016],
Huffman-encoding as in [Moons, 2017]
0 4 12
0 0 17
0 3 2
8b Features
72b
Huffman Features
zero 2’b00
<16 2’b01, 4’b WORD
nonzero 1’b1, 8’b WORD
41b < 72b
© 2019 Synopsys
Correlated feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Intermediate features in the same channel-plane are highly correlated
Intermediate featuremaps in ReLU-less YOLOV2
An example featuremap
scale1
An example featuremap
scale9
© 2019 Synopsys
Correlated feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Super-linear correlation based extended bit-plane compression allows
feature-map compression even on non-sparse data
* [Cavigelli,2018]
0,4,0,0,0,8,0,0,71
Zero Values
Non-Zero
Values
Correlated Values: 16, 20, 20, 20, 28, 28, 28, 99
Delta Values: 0, 4, 0, 0, 8, 0, 0, 71
© 2019 Synopsys
Correlated feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Correlated compression outperforms sparsity-based compression
0
0.5
1
1.5
2
2.5
3
Mobilenet ResNet-50 Yolo V2 VOC VGG-16
CompressionRate
Sparsity-based Correlation-based
© 2019 Synopsys
Conclusion –
Bringing it All Together
© 2019 Synopsys
A first-order analysis on ResNet-50
5 Techniques – Conclusion: bringing it all together
A first-order energy model for Neural Network Inference
Assume:
• Quadratic energy scaling / MAC
when going from 32 to 8 bit.
• Linear energy saving / read-write in DDR/SRAM
when going from 32 to 8 bit
• 50% of coefficients zero when pruning
• 50% compute reduction under decomposition
• 50% of activations can be compressed
*[Han,2015]
*
© 2019 Synopsys
A first-order analysis on ResNet-50
5 Techniques – Conclusion: bringing it all together
When all model data is stored in DRAM optimized ResNet-50 is 10x
more efficient than its plain 32b counterpart
O(65MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame
10x
100%
22%
16%
11%
0%
20%
40%
60%
80%
100%
120%
32b float A. 8b fixed B. Decomposition
+ Pruning
C. Featuremap
Compression
RelativeEnergy
Consumption
© 2019 Synopsys
A first-order analysis on ResNet-50
5 Techniques – Conclusion: bringing it all together
In a system with sufficient on-chip SRAM, optimized ResNet-50 is 12.5x
more efficient than its plain 32b counterpart
O(0MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame
100%
15%
9% 8%
0%
20%
40%
60%
80%
100%
120%
32b float A. 8b fixed B. Decomposition
+ Pruning
C. Featuremap
Compression
RelativeEnergy
Consumption
12.5x
© 2019 Synopsys
For More Information
Visit the Synopsys booth for
demos on Automotive ADAS,
Virtual Reality & More
80
EV6x Embedded Vision
Processor IP with Safety
Enhancement Package
• Thursday, May 23
• Santa Clara Convention Center
• Doors open 8 AM
• Sessions on EV6x Vision Processor IP, Functional Safety, Security, OpenVX…
• Register via the EV Alliance website or at Synopsys Booth
Join Synopsys’ EV Seminar on Thursday
Navigating Embedded Vision at the Edge
B E S T P R O C E S S O R
© 2019 Synopsys
References
81
[Han,2015,2016]
https://arxiv.org/abs/1510.00149
https://arxiv.org/abs/1602.01528
[Xue, 2013]
https://www.microsoft.com/en-us/research/wp-
content/uploads/2013/01/svd_v2.pdf
[Nvidia, 2017]
http://on-demand.gputechconf.com/gtc
/2017/presentation/s7310-8-bit-inference-with-
tensorrt.pdf
[Choi, 2018, 2019]
https://arxiv.org/abs/1805.06085
https://www.ibm.com/blogs/research/2019/04/2-bit-
precision/
[Goetschalckx, 2018]
https://www.sigmobile.org/mobisys/2018/workshops/
deepmobile18/papers/Efficiently_Combining_SVD_Pru
ning_Clustering_Retraining.pdf
[Astrid,2017]
https://arxiv.org/abs/1701.07148
[Moons,2017]
https://ieeexplore.ieee.org/abstract/document/78703
53
[Chen,2016]
http://eyeriss.mit.edu/
[Cavigelli, 2018]
https://arxiv.org/abs/1810.03979
[Courbariaux, 2014]
https://arxiv.org/pdf/1412.7024.pdf
Embedded Vision Summit
Bert Moons --
5+ Techniques for Efficient Implementations
of Neural Networks
May 2019
© 2019 Synopsys
THANK YOU
82

More Related Content

More from Edge AI and Vision Alliance

“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...Edge AI and Vision Alliance
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...Edge AI and Vision Alliance
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...Edge AI and Vision Alliance
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...Edge AI and Vision Alliance
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from SamsaraEdge AI and Vision Alliance
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...Edge AI and Vision Alliance
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...Edge AI and Vision Alliance
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...Edge AI and Vision Alliance
 
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...Edge AI and Vision Alliance
 
“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...
“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...
“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...Edge AI and Vision Alliance
 
“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...
“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...
“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...Edge AI and Vision Alliance
 
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic LeapEdge AI and Vision Alliance
 
"Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ...
"Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ..."Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ...
"Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ...Edge AI and Vision Alliance
 
“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...
“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...
“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...Edge AI and Vision Alliance
 
“A Survey of Model Compression Methods,” a Presentation from Instrumental
“A Survey of Model Compression Methods,” a Presentation from Instrumental“A Survey of Model Compression Methods,” a Presentation from Instrumental
“A Survey of Model Compression Methods,” a Presentation from InstrumentalEdge AI and Vision Alliance
 
“Reinventing Smart Cities with Computer Vision,” a Presentation from Hayden AI
“Reinventing Smart Cities with Computer Vision,” a Presentation from Hayden AI“Reinventing Smart Cities with Computer Vision,” a Presentation from Hayden AI
“Reinventing Smart Cities with Computer Vision,” a Presentation from Hayden AIEdge AI and Vision Alliance
 

More from Edge AI and Vision Alliance (20)

“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
 
“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...
“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...
“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...
 
“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...
“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...
“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...
 
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
 
"Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ...
"Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ..."Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ...
"Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ...
 
“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...
“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...
“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...
 
“A Survey of Model Compression Methods,” a Presentation from Instrumental
“A Survey of Model Compression Methods,” a Presentation from Instrumental“A Survey of Model Compression Methods,” a Presentation from Instrumental
“A Survey of Model Compression Methods,” a Presentation from Instrumental
 
“Reinventing Smart Cities with Computer Vision,” a Presentation from Hayden AI
“Reinventing Smart Cities with Computer Vision,” a Presentation from Hayden AI“Reinventing Smart Cities with Computer Vision,” a Presentation from Hayden AI
“Reinventing Smart Cities with Computer Vision,” a Presentation from Hayden AI
 

Recently uploaded

Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Skynet Technologies
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch TuesdayIvanti
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Paige Cruz
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuidePixlogix Infotech
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfalexjohnson7307
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideStefan Dietze
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdfMuhammad Subhan
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGDSC PJATK
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 

Recently uploaded (20)

Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 

"Five+ Techniques for Efficient Implementation of Neural Networks," a Presentation from Synopsys

  • 1. © 2019 Synopsys 5+ Techniques for Efficient Implementation of Neural Networks Bert Moons Synopsys May 2019
  • 2. © 2019 Synopsys Introduction -- Challenges of Embedding Deep Learning
  • 3. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources Many embedded applications require support for a variety of networks: CNN’s in feature extraction, RNN’s in sequence modeling
  • 4. © 2019 Synopsys 3 Major challenges Introduction – Challenges of embedded deep learning 1. Many operations per pixel 2. Process a lot of pixels in real-time 3. A large variation of different algorithms
  • 5. © 2019 Synopsys Classification accuracy comes at a cost Introduction – Challenges of embedded deep learning Conventional Machine Learning Deep Learning Human Bestreportedtop-5accuracy onIMAGENET-1000[%] Neural network accuracy comes at a cost of a high workload per input pixel and huge model sizes and bandwidth requirements
  • 6. © 2019 Synopsys Computing on large input data Introduction – Challenges of embedded deep learning 4KFHDIMAGENET 1X 40X 160X Embedded applications require real-time operation on large input frames
  • 7. © 2019 Synopsys Massive workload in real-time applications Introduction – Challenges of embedded deep learning 1GOP 1TOP Top-1IMAGENETaccuracy[%] 70 75 65 Single ImageNet Image 1GOP to 10GOP per IMAGENET image # operations / ImageNet image1GOP/s 1TOP/s Top-1IMAGENETaccuracy[%] 70 75 65 6 Cameras 30fps Full HD Image 5-to-180 TOPS @ 30 fps, FHD, ADAS # operations / second MobileNet V2 ResNet-50 GoogleNet VGG-16
  • 8. © 2019 Synopsys 5+ Techniques to reduce the DNN workload A. Neural Networks are error-tolerant Introduction – Challenges of embedded deep learning 1. Linear post-training 8/12/16b quantization 2. Linear trained 2/4/8 bit quantization 3. Non-linear trained 2/4/8 bit quantization through clustering C. Neural Networks have sparse and correlated intermediate results B. Neural Networks have redundancies and are over-dimensioned 4. Network pruning and compression 5. Network decomposition: low-rank network approximations 6. Sparsity and correlation based feature map compression
  • 9. © 2019 Synopsys A. Neural Networks Are Error- Tolerant 9
  • 10. © 2019 Synopsys The benefits of quantized number representations 5 Techniques – A. Neural Networks Are Error-Tolerant 8 bit fixed is 3-4x faster, 2-4x more efficient than 16b floating point Energy consumption per unit Processing units per chip Classification time per chip * [Choi,2019] 16b float 8b fixed O(1) relative fps O(16) relative fps 4b fixed O(256) relative fps ~ 16 ~ 6-8 ~ 2-4 Relative accuracy 100% no loss 99% 50-95%
  • 11. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Convert floating point pretrained models to Dynamic Fixed Point 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 Fixed Point Dynamic Fixed Point 0 1 1 0 1System Exponent Group 1 Exponent Group 2 Exponent * [Courbariaux,2019]
  • 12. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Dynamic Fixed-Point Quantization allows running neural networks with 8 bit weights and activations across the board 32 bit float baseline 8 bit fixed point * [Nvidia,2017]
  • 13. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant How to optimally choose: dynamic exponent groups, saturation thresholds, weight and activation exponents? Min-max scaling throws away small values A saturation threshold better represents small values, but clips large values * [Nvidia,2017]
  • 14. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop + Train saturation range for activations Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019] *
  • 15. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Good accuracy down to 2b Graceful performance degradation 0.85 0.9 0.95 1 1.05 CIFAR10 SVHN AlexNet ResNet18 ResNet50 full precision 5b 4b 3b 2b * [Choi,2018] Relativebenchmark accuracyvsfloatbaseline*
  • 16. © 2019 Synopsys Non-linear trained quantization – codebook clustering 5 Techniques – A. Neural Networks Are Error-Tolerant Clustered, codebook quantization can be optimally trained. This only reduces bandwidth, computations are still in floating point. * [Han,2015]
  • 17. © 2019 Synopsys B. Neural Networks Are Over- Dimensioned & Redundant 17
  • 18. © 2019 Synopsys Pruning Neural Networks 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Pruning removes unnecessary connections in the neural network. Accuracy is recovered through retraining the pruned network * [Han,2015]
  • 19. © 2019 Synopsys Low Rank Singular Value Decomposition (SVD) in DNNs 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Many singular values are small and can be discarded * [Xue,2013] 𝑨 = 𝑼 𝜮 𝑽 𝑻 𝑨 ≅ 𝑼′𝜮′𝑽′ 𝑻 = 𝑵𝑼
  • 20. © 2019 Synopsys Low Rank Canonical Polyadic (CP) decomp. in CNNs 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Convert a large convolutional filter in a triplet of smaller filters * [Astrid,2017]
  • 21. © 2019 Synopsys Basic example: Combining SVD, pruning and clustering 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant 11x model compression in a phone-recognition LSTM 0 2 4 6 8 10 12 Base P SVD SVD+P P+C SVD+P+C LSTMCompressionRate (P) (C) * [Goetschalckx,2018] P = Pruning SVD = Singular Value Decomposition C = Clustering / Codebook Compression
  • 22. © 2019 Synopsys C. Neural Networks have Sparse & Correlated Intermediate Results 22
  • 23. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Feature map bandwidth dominates in modern CNNs 0 2 4 6 8 10 12 Coefficient BW Feature Map BW BWinMObileNet-V1[MB] 1x1, 32 3x3DW, 32 1x1, 64 32
  • 24. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps ReLU activation introduces 50-90% zero-valued numbers in intermediate feature maps -5 4 12 -10 0 17 -1 3 2 0 4 12 0 0 17 0 3 2 ReLU activation 8b Features 8b Features
  • 25. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Hardware support for multi-bit Huffman encoding allows up to 2x bandwidth reduction in typical networks. Zero-runlength encoding as in [Chen, 2016], Huffman-encoding as in [Moons, 2017] 0 4 12 0 0 17 0 3 2 8b Features 72b Huffman Features zero 2’b00 <16 2’b01, 4’b WORD nonzero 1’b1, 8’b WORD 41b < 72b
  • 26. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Intermediate features in the same channel-plane are highly correlated Intermediate featuremaps in ReLU-less YOLOV2 An example featuremap scale1 An example featuremap scale9
  • 27. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Super-linear correlation based extended bit-plane compression allows feature-map compression even on non-sparse data * [Cavigelli,2018] 0,4,0,0,0,8,0,0,71 Zero Values Non-Zero Values Correlated Values: 16, 20, 20, 20, 28, 28, 28, 99 Delta Values: 0, 4, 0, 0, 8, 0, 0, 71
  • 28. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Correlated compression outperforms sparsity-based compression 0 0.5 1 1.5 2 2.5 3 Mobilenet ResNet-50 Yolo V2 VOC VGG-16 CompressionRate Sparsity-based Correlation-based
  • 29. © 2019 Synopsys Conclusion – Bringing it All Together
  • 30. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together A first-order energy model for Neural Network Inference Assume: • Quadratic energy scaling / MAC when going from 32 to 8 bit. • Linear energy saving / read-write in DDR/SRAM when going from 32 to 8 bit • 50% of coefficients zero when pruning • 50% compute reduction under decomposition • 50% of activations can be compressed *[Han,2015] *
  • 31. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together When all model data is stored in DRAM optimized ResNet-50 is 10x more efficient than its plain 32b counterpart O(65MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame 10x 100% 22% 16% 11% 0% 20% 40% 60% 80% 100% 120% 32b float A. 8b fixed B. Decomposition + Pruning C. Featuremap Compression RelativeEnergy Consumption
  • 32. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together In a system with sufficient on-chip SRAM, optimized ResNet-50 is 12.5x more efficient than its plain 32b counterpart O(0MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame 100% 15% 9% 8% 0% 20% 40% 60% 80% 100% 120% 32b float A. 8b fixed B. Decomposition + Pruning C. Featuremap Compression RelativeEnergy Consumption 12.5x
  • 33. © 2019 Synopsys For More Information Visit the Synopsys booth for demos on Automotive ADAS, Virtual Reality & More 33 EV6x Embedded Vision Processor IP with Safety Enhancement Package • Thursday, May 23 • Santa Clara Convention Center • Doors open 8 AM • Sessions on EV6x Vision Processor IP, Functional Safety, Security, OpenVX… • Register via the EV Alliance website or at Synopsys Booth Join Synopsys’ EV Seminar on Thursday Navigating Embedded Vision at the Edge B E S T P R O C E S S O R
  • 34. © 2019 Synopsys References 34 [Han,2015,2016] https://arxiv.org/abs/1510.00149 https://arxiv.org/abs/1602.01528 [Xue, 2013] https://www.microsoft.com/en-us/research/wp- content/uploads/2013/01/svd_v2.pdf [Nvidia, 2017] http://on-demand.gputechconf.com/gtc /2017/presentation/s7310-8-bit-inference-with- tensorrt.pdf [Choi, 2018, 2019] https://arxiv.org/abs/1805.06085 https://www.ibm.com/blogs/research/2019/04/2-bit- precision/ [Goetschalckx, 2018] https://www.sigmobile.org/mobisys/2018/workshops/ deepmobile18/papers/Efficiently_Combining_SVD_Pru ning_Clustering_Retraining.pdf [Astrid,2017] https://arxiv.org/abs/1701.07148 [Moons,2017] https://ieeexplore.ieee.org/abstract/document/78703 53 [Chen,2016] http://eyeriss.mit.edu/ [Cavigelli, 2018] https://arxiv.org/abs/1810.03979 [Courbariaux, 2014] https://arxiv.org/pdf/1412.7024.pdf Embedded Vision Summit Bert Moons -- 5+ Techniques for Efficient Implementations of Neural Networks May 2019
  • 36. © 2019 Synopsys 5+ Techniques for Efficient Implementation of Neural Networks Bert Moons Synopsys May 2019
  • 37. © 2019 Synopsys Introduction -- Challenges of Embedding Deep Learning
  • 38. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning
  • 39. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources
  • 40. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources Many embedded applications require support for a variety of networks: CNN’s in feature extraction, RNN’s in sequence modeling
  • 41. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources Many embedded applications require support for a variety of networks: CNN’s in feature extraction, RNN’s in sequence modeling 1. Many operations per pixel 2. Process a lot of pixels in real-time
  • 42. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources Many embedded applications require support for a variety of networks: CNN’s in feature extraction, RNN’s in sequence modeling 1. Many operations per pixel 2. Process a lot of pixels in real-time 3. A large variation of different algorithms
  • 43. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources Many embedded applications require support for a variety of networks: CNN’s in feature extraction, RNN’s in sequence modeling 1. Many operations per pixel 2. Process a lot of pixels in real-time 3. A large variation of different algorithms
  • 44. © 2019 Synopsys Classification accuracy comes at a cost Introduction – Challenges of embedded deep learning Conventional Machine Learning Deep Learning Human Bestreportedtop-5accuracy onIMAGENET-1000[%] Neural network accuracy comes at a cost of a high workload per input pixel and huge model sizes and bandwidth requirements
  • 45. © 2019 Synopsys Computing on large input data Introduction – Challenges of embedded deep learning 4KFHDIMAGENET 1X 40X 160X Embedded applications require real-time operation on large input frames
  • 46. © 2019 Synopsys Massive workload in real-time applications Introduction – Challenges of embedded deep learning 1GOP 1TOP Top-1IMAGENETaccuracy[%] 70 75 65 Single ImageNet Image 1GOP to 10GOP per IMAGENET image # operations / ImageNet image MobileNet V2 ResNet-50 GoogleNet VGG-16
  • 47. © 2019 Synopsys Massive workload in real-time applications Introduction – Challenges of embedded deep learning 1GOP 1TOP Top-1IMAGENETaccuracy[%] 70 75 65 Single ImageNet Image 1GOP to 10GOP per IMAGENET image # operations / ImageNet image1GOP/s 1TOP/s Top-1IMAGENETaccuracy[%] 70 75 65 6 Cameras 30fps Full HD Image 5-to-180 TOPS @ 30 fps, FHD, ADAS # operations / second MobileNet V2 ResNet-50 GoogleNet VGG-16
  • 48. © 2019 Synopsys 5+ Techniques to reduce the DNN workload A. Neural Networks are error-tolerant Introduction – Challenges of embedded deep learning C. Neural Networks have sparse and correlated intermediate results B. Neural Networks have redundancies and are over-dimensioned
  • 49. © 2019 Synopsys 5+ Techniques to reduce the DNN workload A. Neural Networks are error-tolerant Introduction – Challenges of embedded deep learning 1. Linear post-training 8/12/16b quantization 2. Linear trained 2/4/8 bit quantization 3. Non-linear trained 2/4/8 bit quantization through clustering C. Neural Networks have sparse and correlated intermediate results B. Neural Networks have redundancies and are over-dimensioned
  • 50. © 2019 Synopsys 5+ Techniques to reduce the DNN workload A. Neural Networks are error-tolerant Introduction – Challenges of embedded deep learning 1. Linear post-training 8/12/16b quantization 2. Linear trained 2/4/8 bit quantization 3. Non-linear trained 2/4/8 bit quantization through clustering C. Neural Networks have sparse and correlated intermediate results B. Neural Networks have redundancies and are over-dimensioned 4. Network pruning and compression 5. Network decomposition: low-rank network approximations
  • 51. © 2019 Synopsys 5+ Techniques to reduce the DNN workload A. Neural Networks are error-tolerant Introduction – Challenges of embedded deep learning 1. Linear post-training 8/12/16b quantization 2. Linear trained 2/4/8 bit quantization 3. Non-linear trained 2/4/8 bit quantization through clustering C. Neural Networks have sparse and correlated intermediate results B. Neural Networks have redundancies and are over-dimensioned 4. Network pruning and compression 5. Network decomposition: low-rank network approximations 6. Sparsity and correlation based feature map compression
  • 52. © 2019 Synopsys A. Neural Networks Are Error- Tolerant 52
  • 53. © 2019 Synopsys The benefits of quantized number representations 5 Techniques – A. Neural Networks Are Error-Tolerant 8 bit fixed is 3-4x faster, 2-4x more efficient than 16b floating point Energy consumption per unit Processing units per chip Classification time per chip * [Choi,2019] 16b float 8b fixed O(1) relative fps O(16) relative fps 4b fixed O(256) relative fps ~ 16 ~ 6-8 ~ 2-4 Relative accuracy 100% no loss 99% 50-95%
  • 54. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Convert floating point pretrained models to Dynamic Fixed Point 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 Fixed Point Dynamic Fixed Point 0 1 1 0 1System Exponent Group 1 Exponent Group 2 Exponent * [Courbariaux,2019]
  • 55. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Dynamic Fixed-Point Quantization allows running neural networks with 8 bit weights and activations across the board 32 bit float baseline 8 bit fixed point * [Nvidia,2017]
  • 56. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant How to optimally choose: dynamic exponent groups, saturation thresholds, weight and activation exponents? Min-max scaling throws away small values A saturation threshold better represents small values, but clips large values * [Nvidia,2017]
  • 57. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019]
  • 58. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop + Train saturation range for activations Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019]
  • 59. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop + Train saturation range for activations Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019]
  • 60. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop + Train saturation range for activations Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019]
  • 61. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop + Train saturation range for activations Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019] *
  • 62. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Good accuracy down to 2b Graceful performance degradation 0.85 0.9 0.95 1 1.05 CIFAR10 SVHN AlexNet ResNet18 ResNet50 full precision 5b 4b 3b 2b * [Choi,2018] Relativebenchmark accuracyvsfloatbaseline*
  • 63. © 2019 Synopsys Non-linear trained quantization – codebook clustering 5 Techniques – A. Neural Networks Are Error-Tolerant Clustered, codebook quantization can be optimally trained. This only reduces bandwidth, computations are still in floating point. * [Han,2015]
  • 64. © 2019 Synopsys B. Neural Networks Are Over- Dimensioned & Redundant 64
  • 65. © 2019 Synopsys Pruning Neural Networks 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Pruning removes unnecessary connections in the neural network. Accuracy is recovered through retraining the pruned network * [Han,2015]
  • 66. © 2019 Synopsys Low Rank Singular Value Decomposition (SVD) in DNNs 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Many singular values are small and can be discarded * [Xue,2013] 𝑨 = 𝑼 𝜮 𝑽 𝑻 𝑨 ≅ 𝑼′𝜮′𝑽′ 𝑻 = 𝑵𝑼
  • 67. © 2019 Synopsys Low Rank Canonical Polyadic (CP) decomp. in CNNs 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Convert a large convolutional filter in a triplet of smaller filters * [Astrid,2017]
  • 68. © 2019 Synopsys Basic example: Combining SVD, pruning and clustering 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant 11x model compression in a phone-recognition LSTM 0 2 4 6 8 10 12 Base P SVD SVD+P P+C SVD+P+C LSTMCompressionRate (P) (C) * [Goetschalckx,2018] P = Pruning SVD = Singular Value Decomposition C = Clustering / Codebook Compression
  • 69. © 2019 Synopsys C. Neural Networks have Sparse & Correlated Intermediate Results 69
  • 70. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Feature map bandwidth dominates in modern CNNs 0 2 4 6 8 10 12 Coefficient BW Feature Map BW BWinMObileNet-V1[MB] 1x1, 32 3x3DW, 32 1x1, 64 32
  • 71. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps ReLU activation introduces 50-90% zero-valued numbers in intermediate feature maps -5 4 12 -10 0 17 -1 3 2 0 4 12 0 0 17 0 3 2 ReLU activation 8b Features 8b Features
  • 72. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Hardware support for multi-bit Huffman encoding allows up to 2x bandwidth reduction in typical networks. Zero-runlength encoding as in [Chen, 2016], Huffman-encoding as in [Moons, 2017] 0 4 12 0 0 17 0 3 2 8b Features 72b Huffman Features zero 2’b00 <16 2’b01, 4’b WORD nonzero 1’b1, 8’b WORD 41b < 72b
  • 73. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Intermediate features in the same channel-plane are highly correlated Intermediate featuremaps in ReLU-less YOLOV2 An example featuremap scale1 An example featuremap scale9
  • 74. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Super-linear correlation based extended bit-plane compression allows feature-map compression even on non-sparse data * [Cavigelli,2018] 0,4,0,0,0,8,0,0,71 Zero Values Non-Zero Values Correlated Values: 16, 20, 20, 20, 28, 28, 28, 99 Delta Values: 0, 4, 0, 0, 8, 0, 0, 71
  • 75. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Correlated compression outperforms sparsity-based compression 0 0.5 1 1.5 2 2.5 3 Mobilenet ResNet-50 Yolo V2 VOC VGG-16 CompressionRate Sparsity-based Correlation-based
  • 76. © 2019 Synopsys Conclusion – Bringing it All Together
  • 77. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together A first-order energy model for Neural Network Inference Assume: • Quadratic energy scaling / MAC when going from 32 to 8 bit. • Linear energy saving / read-write in DDR/SRAM when going from 32 to 8 bit • 50% of coefficients zero when pruning • 50% compute reduction under decomposition • 50% of activations can be compressed *[Han,2015] *
  • 78. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together When all model data is stored in DRAM optimized ResNet-50 is 10x more efficient than its plain 32b counterpart O(65MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame 10x 100% 22% 16% 11% 0% 20% 40% 60% 80% 100% 120% 32b float A. 8b fixed B. Decomposition + Pruning C. Featuremap Compression RelativeEnergy Consumption
  • 79. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together In a system with sufficient on-chip SRAM, optimized ResNet-50 is 12.5x more efficient than its plain 32b counterpart O(0MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame 100% 15% 9% 8% 0% 20% 40% 60% 80% 100% 120% 32b float A. 8b fixed B. Decomposition + Pruning C. Featuremap Compression RelativeEnergy Consumption 12.5x
  • 80. © 2019 Synopsys For More Information Visit the Synopsys booth for demos on Automotive ADAS, Virtual Reality & More 80 EV6x Embedded Vision Processor IP with Safety Enhancement Package • Thursday, May 23 • Santa Clara Convention Center • Doors open 8 AM • Sessions on EV6x Vision Processor IP, Functional Safety, Security, OpenVX… • Register via the EV Alliance website or at Synopsys Booth Join Synopsys’ EV Seminar on Thursday Navigating Embedded Vision at the Edge B E S T P R O C E S S O R
  • 81. © 2019 Synopsys References 81 [Han,2015,2016] https://arxiv.org/abs/1510.00149 https://arxiv.org/abs/1602.01528 [Xue, 2013] https://www.microsoft.com/en-us/research/wp- content/uploads/2013/01/svd_v2.pdf [Nvidia, 2017] http://on-demand.gputechconf.com/gtc /2017/presentation/s7310-8-bit-inference-with- tensorrt.pdf [Choi, 2018, 2019] https://arxiv.org/abs/1805.06085 https://www.ibm.com/blogs/research/2019/04/2-bit- precision/ [Goetschalckx, 2018] https://www.sigmobile.org/mobisys/2018/workshops/ deepmobile18/papers/Efficiently_Combining_SVD_Pru ning_Clustering_Retraining.pdf [Astrid,2017] https://arxiv.org/abs/1701.07148 [Moons,2017] https://ieeexplore.ieee.org/abstract/document/78703 53 [Chen,2016] http://eyeriss.mit.edu/ [Cavigelli, 2018] https://arxiv.org/abs/1810.03979 [Courbariaux, 2014] https://arxiv.org/pdf/1412.7024.pdf Embedded Vision Summit Bert Moons -- 5+ Techniques for Efficient Implementations of Neural Networks May 2019