More Related Content More from Edge AI and Vision Alliance (20) "Techniques to Reduce Power Consumption in Embedded DNN Implementations," a Presentation from Cadence1. Copyright © 2017 Cadence Design Systems 1
Samer Hijazi
May 2017
Techniques to Reduce Power
Consumption in Embedded DNN
Implementations
2. Copyright © 2017 Cadence Design Systems 2
Cadence’s mission
• Enable better, faster, cooler silicon systems sooner
Imaging/video recognition
• Strong driver for creating advanced SoCs
Neural networks are a crucial innovation
• But, need a breakthrough in efficiency
Motivation
3. Copyright © 2017 Cadence Design Systems 3
Typical Computer Vision Problems
• Classifications (e.g., ImageNet and GTSRB)
• Detections
• Draw a bounding box around each ROI
• Pixel by Pixel segmentation
• For a 1080p image, we have 2M pixels in and out!
4. Copyright © 2017 Cadence Design Systems 4
Why Haven’t DNNs Gone Embedded Yet?
5. Copyright © 2017 Cadence Design Systems 5
CNN Evolution
• Today’s deep learning industry motto is “Deeper is Better”
Network Application Conv
Layers
LeNet-5 for MNIST
(1998)
Handwritten
Digit Recognition
7
AlexNet (2012) ImageNet 8
Deepface (2014) Face recognition 7
VGG-19 (2014) ImageNet 19
GoogLeNet (2015) ImageNet 22
ResNet (2015) ImageNet 152
Inception-ResNet (2016) ImageNet 246
6. Copyright © 2017 Cadence Design Systems 6
• Today’s state-of-the-art hardware consumes ~40 W/TMAC
• ~4 TMACs are needed for many DNN real-time applications
• e.g., glass surround analysis, gesture HMI
• This means ~160 W!
• Even O(102) improvement in efficiency is not enough
The DNN Power Question
Courtesy of Dr. Stephen Hicks, Nuffield
Department of Clinical Neurosciences,
University of Oxford
Embedded device power budgets and form-factors
cannot accommodate the current trend of DNNs!
7. Copyright © 2017 Cadence Design Systems 7
How to Save Power?
CNNs use an excessive number of multiplies and data moves per pixel!
• To solve this problem we can do four things
1. Optimize network architecture
2. Optimize the problem definition
3. Minimize the number of bits per computation
4. Optimize CNN hardware (not covered in this talk)
8. Copyright © 2017 Cadence Design Systems 8
1. Optimize network architecture
2. Optimize the problem definition
3. Minimize the number of bits per computation
9. Copyright © 2017 Cadence Design Systems 9
Complexity vs. Performance
Accuracy
Target recognition rate
Complexity
Embedded Device Budget
CloudBudget
Starter Network
10. Copyright © 2017 Cadence Design Systems 10
Automatic Optimizations of Network Structure
• The ingredients
• A superset network architecture with many knobs to dial
• CactusNet
• Measure redundancy vs. accuracy
• Gradually trim CactusNet
11. Copyright © 2017 Cadence Design Systems 11
CactusNet
• A general CNN reference architecture with lots of control knobs
CM M M M M C MM M M C ……
1x1 BN ReLU
1x1 BN ReLU
7/3 BN ReLU
5/3 BN ReLU
3 BN ReLU
1 BN ReLU
Concat
1x1BNReLU
Concat
1x1BNReLU
1x1 BN ReLU
Pooling Layer C Concatenate M Cactus Module
Cactus Module
12. Copyright © 2017 Cadence Design Systems 12
2
3
Convolutional
Neural Network
Training
Labelled Dataset
Compressed Network Architecture
Learned Weights
Sensitivity
Analysis
Optimization
Accuracy vs.
Complexity Model
4
1
Transfer
Learning
Compressed Network
Initialization
Cactus Network Compression Procedure
13. Copyright © 2017 Cadence Design Systems 13
“Replicants”
• Propagate learned knowledge between networks with different
architectures. Why?
• Accelerate the creation of a family of networks
Performance
Starter Network
Replicants
Complexity
…
14. Copyright © 2017 Cadence Design Systems 14
• 51,840 images of German road signs
in 43 classes
• CactusNet Outperforms every other
known network on GTSRB
German Traffic Sign Recognition Benchmark
(GTSRB)
98.5
98.7
98.9
99.1
99.3
99.5
99.7
99.9
100.1
1 10 100 1000 10000
%CORRECTCLASSIFICATIONRATE
MILLION MACS (LOG SCALE)
Performance vs. Complexity
CNN Optimization Committee of CNN
Hinge
Loss CNN
Committee Of CNNs
Baseline
(Multi-Scale Replica)
CactusNet
37x
30x
Multi-Scale: Pierre Sermanet and Yann LeCun, “Traffic Sign Recognition with Multi-Scale
Convolutional Netorks”, IEEE IJCNN, 2011
Committee of CNNs: Ciresan, D.; Meier, U.; Schmidhuber, J., "Multi-column deep neural
networks for image classification," IEEE CVPR, 2012
Hinge Loss CNN: Jin, Junqi, Kun Fu, and Changshui Zhang. "Traffic sign recognition with hinge
loss trained convolutional neural networks." IEEE Transactions on Intelligent Transportation
Systems 15.5 (2014): 1991-2000
15. Copyright © 2017 Cadence Design Systems 15
Results ImageNet (2012)
• ResNet-50 has the best
accuracy/complexity ratio on
ImageNet
• CactusNet outperforms ResNet-50
on accuracy and complexity
Set Num of
images
Max size minsize
train 1281167 3456x2304 60x60
validation 50000 3657x2357 80x60
test 100000 3464x2880 63x84
Cactus Optimization
*CNN accuracies for pre-trained networks from
http://www.vlfeat.org/matconvnet/pretrained/
*
16. Copyright © 2017 Cadence Design Systems 16
1. Optimize network architecture
2. Optimize the problem definition
3. Minimize the number of bits per computation
17. Copyright © 2017 Cadence Design Systems 17
How to Reduce the Problem Size
for Pixel Segmentation?
18. Copyright © 2017 Cadence Design Systems 18
KITTI Road Segmentation Dataset
• 289 training and 290 test images of size 375x1242 pixels
• To segment an image solve 466K classification problems
• Exploiting correlations we made the problem 22x smaller
1. GMACs to process input of size 256x1280
2. http://lmb.informatik.uni-freiburg.de/Publications/2016/OB16b/oliveira16iros.pdf
3. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labeling; Vijay Badrinarayanan,
Ankur Handa, Roberto Cipolla; Machine Intelligence Lab, Department of Engineering, University of Cambridge, UK.
4. Anonymous submission on KITTI server, currently at number 3
http://www.cvlibs.net/datasets/kitti/eval_road_detail.php?result=581d48d10c399e61fe19182cd3483e628c46e893
NW Precision Recall MaxF Road
Accuracy
Overall
Accuracy
GMACs1
Cadence 95.3% 94.4% 94.8% 96.3% 98.9% 10.6
FCN2
94.0% 93.7% 93.8% x x 105.3
SegNet3
x x x 97.4% 89.7% 112.5
RPP4
95.9% 96.9% 96.4% 404.5
19. Copyright © 2017 Cadence Design Systems 19
1. Optimize network architecture
2. Optimize the problem definition
3. Minimize the number of bits per computation
20. Copyright © 2017 Cadence Design Systems 20
Minimizing Number Formats in DNN
• Two quantization approaches:
1. Post training quantization
2. During training quantization
• This requires changing the training infrastructure and process
System Methodology
AlexNet Top-1 Error [%]
32b Float 8b Fixed
UC Davis
Ristretto
Dynamic FXP
Minifloat (16b floating-point)
Multiplier-free (shifts only)]
Fine tuning (during training quantization)
43.1 43.8
Google
TensorFlow
Static FXP 42.2 49.4
Cadence Post training quantization based on Dynamic FXP
Fully fixed-point C-modeling, 32b accumulators
42.2 42.5
21. Copyright © 2017 Cadence Design Systems 21
Fine-Tuning Quantization Benefit
From: Gysel, Philipp. "Ristretto: Hardware-oriented approximation of
convolutional neural networks." arXiv preprint arXiv:1605.06402 (2016).
With
Fine Tuning
Post Training
Only
Average 0.73% accuracy improvement after fine tuning
MNIST
CIFAR
ImageNet
22. Copyright © 2017 Cadence Design Systems 22
“Static” Fixed-Point
4D Quantization
𝑥1 𝑥2 𝑥 𝑀𝑥𝑗
𝑦1 𝑦2 𝑦 𝑁𝑦𝑖
𝐹𝑖1 𝐹𝑖2
𝐹𝑖𝑗
𝐹𝑖𝑀
8b split:
1 sign bit
3 int bits
4 frac bits
𝐹11
𝐹12
𝐹1𝑗
𝐹1𝑀
Input Feature Maps
Output Feature Maps
𝒔 + 𝒃 𝟏 𝒃 𝟎. 𝒃−𝟏 𝒃−𝟐 𝒃−𝟑 𝒃−𝟒 𝒃−𝟓
4D Convolutional
Kernel
23. Copyright © 2017 Cadence Design Systems 23
“Dynamic” Fixed-Point
Optimize definition of int/frac bits over finer subsets of processing chain
3D Quantization
𝑥1 𝑥2 𝑥 𝑀𝑥𝑗
𝑦1 𝑦2 𝑦 𝑁𝑦𝑖
𝐹𝑖1 𝐹𝑖2
𝐹𝑖𝑗
𝐹𝑖𝑀
𝒔 + 𝒃 𝟏 𝒃 𝟎. 𝒃−𝟏 𝒃−𝟐 𝒃−𝟑 𝒃−𝟒 𝒃−𝟓
8b split:
1 sign bit
2 int bits
5 frac bits
𝐹11
𝐹12
𝐹1𝑗
𝐹1𝑀𝒔 + 𝒃 𝟑 𝒃 𝟐 𝒃 𝟏 𝒃 𝟎. 𝒃−𝟏 𝒃−𝟐 𝒃−𝟑
1 sign
4 int
3 frac
3D Conv
Kernel #i3D Conv
Kernel #1
24. Copyright © 2017 Cadence Design Systems 24
“Dynamic” Fixed-Point
There is a trade-off between performance and over-head.
2D Quantization
𝑥1 𝑥2 𝑥 𝑀𝑥𝑗
𝑦1 𝑦2 𝑦 𝑁𝑦𝑖
𝐹𝑖1 𝐹𝑖2
𝐹𝑖𝑗
𝐹𝑖𝑀
𝐹11
𝐹12
𝐹1𝑗
𝐹1𝑀
2D
Filter
int / frac
split
4 / 3
3 / 4
2 / 5
1 / 6
25. Copyright © 2017 Cadence Design Systems 25
Cadence Quantization of Network Zoo
Network
Top-5 Error [%] Top-1 Error [%]
FLP 8b FXP* FLP 8b FXP*
GoogLeNet 13.5 13.5 34.9 35.0
VGG-VeryDeep-19 10.7 10.7 29.5 29.7
Cactus-it6
~2x < ResNet-50 MACs
~10x < VGG-19 MACs
8.2 8.5 25.3 26.1
ResNet-50 8.3 8.6 25.1 25.8
ResNet-101 7.6 8.2 24.0 25.3
ResNet-152 7.2 7.5 23.5 24.1
*Estimates based on 8b fixed-point coefficients & data, 32b floating-point accumulators
26. Copyright © 2017 Cadence Design Systems 26
CNN 4b Quantization
Method Data Coefficients Computational
Complexity
Savings vs. 8b x 8b
AlexNet
Top-1
Error
Conventional 8b 8b - 42.5%
Hybrid 8b 8b: 54%
4b: 46%
33% 43.1%
Theoretical
Limit
8b 8b: 0%
4b: 100%
50%
28. Copyright © 2017 Cadence Design Systems 28
• CNN will achieve full potential with O(103)~O(104) improvement in power efficiency.
• DSP techniques can decrease CNN compute and data movement by O(101)~O(102)
Take Away Points
29. Copyright © 2017 Cadence Design Systems 29
Network Compression
[1] Han, Song, Huizi Mao, and William J. Dally. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and
Huffman Coding." arXiv preprint arXiv:1510.00149 (2015).
[2] Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 1MB model size." arXiv preprint
arXiv:1602.07360(2016).
Network Quantization
[3] Gysel, Philipp. "Ristretto: Hardware-oriented approximation of convolutional neural networks." arXiv preprint arXiv:1605.06402 (2016).
[4] Warden, P. “How to quantize neural networks with tensorflow,” @ https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-
tensorflow/
Segmentation
[5] Badrinarayanan, Vijay, Ankur Handa, and Roberto Cipolla. "Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-
wise labelling." arXiv preprint arXiv:1505.07293 (2015).
[6] Oliveira, Gabriel L., Wolfram Burgard, and Thomas Brox. "Efficient deep models for monocular road segmentation." Intelligent Robots and Systems
(IROS), 2016 IEEE/RSJ International Conference on. IEEE, 2016.
MatConvNet
[7] Vedaldi, Andrea, and Karel Lenc. "Matconvnet: Convolutional neural networks for matlab." Proceedings of the 23rd ACM international conference on
Multimedia. ACM, 2015.
Benchmarks
[8] Russakovsky, Olga, et al. "Imagenet large scale visual recognition challenge." International Journal of Computer Vision 115.3 (2015): 211-252.
[9] Stallkamp, Johannes, et al. "Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition." Neural networks 32 (2012):
323-332.
Selected References & Resources
30. Copyright © 2017 Cadence Design Systems 30
“© 2017 Cadence Design Systems, Inc. All rights reserved worldwide. Cadence, the Cadence logo and the other Cadence marks found at
www.cadence.com/go/trademarks are trademarks or registered trademarks of Cadence Design Systems, Inc. All other trademarks are
the property of their respective holders.