"Techniques to Reduce Power Consumption in Embedded DNN Implementations," a Presentation from Cadence

Copyright © 2017 Cadence Design Systems 1
Samer Hijazi
May 2017
Techniques to Reduce Power
Consumption in Embedded DNN
Implementations

Cadence’s mission
• Enable better, faster, cooler silicon systems sooner
Imaging/video recognition
• Strong driver for creating advanced SoCs
Neural networks are a crucial innovation
• But, need a breakthrough in efficiency
Motivation

Typical Computer Vision Problems
• Classifications (e.g., ImageNet and GTSRB)
• Detections
• Draw a bounding box around each ROI
• Pixel by Pixel segmentation
• For a 1080p image, we have 2M pixels in and out!

Why Haven’t DNNs Gone Embedded Yet?

CNN Evolution
• Today’s deep learning industry motto is “Deeper is Better”
Network Application Conv
Layers
LeNet-5 for MNIST
(1998)
Handwritten
Digit Recognition
7
AlexNet (2012) ImageNet 8
Deepface (2014) Face recognition 7
VGG-19 (2014) ImageNet 19
GoogLeNet (2015) ImageNet 22
ResNet (2015) ImageNet 152
Inception-ResNet (2016) ImageNet 246

• Today’s state-of-the-art hardware consumes ~40 W/TMAC
• ~4 TMACs are needed for many DNN real-time applications
• e.g., glass surround analysis, gesture HMI
• This means ~160 W!
• Even O(102) improvement in efficiency is not enough
The DNN Power Question
Courtesy of Dr. Stephen Hicks, Nuffield
Department of Clinical Neurosciences,
University of Oxford
Embedded device power budgets and form-factors
cannot accommodate the current trend of DNNs!

How to Save Power?
CNNs use an excessive number of multiplies and data moves per pixel!
• To solve this problem we can do four things
1. Optimize network architecture
2. Optimize the problem definition
3. Minimize the number of bits per computation
4. Optimize CNN hardware (not covered in this talk)

Complexity vs. Performance
Accuracy
Target recognition rate
Complexity
Embedded Device Budget
CloudBudget
Starter Network

Automatic Optimizations of Network Structure
• The ingredients
• A superset network architecture with many knobs to dial
• CactusNet
• Measure redundancy vs. accuracy
• Gradually trim CactusNet

CactusNet
• A general CNN reference architecture with lots of control knobs
CM M M M M C MM M M C ……
1x1 BN ReLU
1x1 BN ReLU
7/3 BN ReLU
5/3 BN ReLU
3 BN ReLU
1 BN ReLU
Concat
1x1BNReLU
Concat
1x1BNReLU
1x1 BN ReLU
Pooling Layer C Concatenate M Cactus Module
Cactus Module

2
3
Convolutional
Neural Network
Training
Labelled Dataset
Compressed Network Architecture
Learned Weights
Sensitivity
Analysis
Optimization
Accuracy vs.
Complexity Model
4
1
Transfer
Learning
Compressed Network
Initialization
Cactus Network Compression Procedure

“Replicants”
• Propagate learned knowledge between networks with different
architectures. Why?
• Accelerate the creation of a family of networks
Performance
Starter Network
Replicants
Complexity
…

• 51,840 images of German road signs
in 43 classes
• CactusNet Outperforms every other
known network on GTSRB
German Traffic Sign Recognition Benchmark
(GTSRB)
98.5
98.7
98.9
99.1
99.3
99.5
99.7
99.9
100.1
1 10 100 1000 10000
%CORRECTCLASSIFICATIONRATE
MILLION MACS (LOG SCALE)
Performance vs. Complexity
CNN Optimization Committee of CNN
Hinge
Loss CNN
Committee Of CNNs
Baseline
(Multi-Scale Replica)
CactusNet
37x
30x
Multi-Scale: Pierre Sermanet and Yann LeCun, “Traffic Sign Recognition with Multi-Scale
Convolutional Netorks”, IEEE IJCNN, 2011
Committee of CNNs: Ciresan, D.; Meier, U.; Schmidhuber, J., "Multi-column deep neural
networks for image classification," IEEE CVPR, 2012
Hinge Loss CNN: Jin, Junqi, Kun Fu, and Changshui Zhang. "Traffic sign recognition with hinge
loss trained convolutional neural networks." IEEE Transactions on Intelligent Transportation
Systems 15.5 (2014): 1991-2000

Results ImageNet (2012)
• ResNet-50 has the best
accuracy/complexity ratio on
ImageNet
• CactusNet outperforms ResNet-50
on accuracy and complexity
Set Num of
images
Max size minsize
train 1281167 3456x2304 60x60
validation 50000 3657x2357 80x60
test 100000 3464x2880 63x84
Cactus Optimization
*CNN accuracies for pre-trained networks from
http://www.vlfeat.org/matconvnet/pretrained/
*

How to Reduce the Problem Size
for Pixel Segmentation?

KITTI Road Segmentation Dataset
• 289 training and 290 test images of size 375x1242 pixels
• To segment an image solve 466K classification problems
• Exploiting correlations we made the problem 22x smaller
1. GMACs to process input of size 256x1280
2. http://lmb.informatik.uni-freiburg.de/Publications/2016/OB16b/oliveira16iros.pdf
3. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labeling; Vijay Badrinarayanan,
Ankur Handa, Roberto Cipolla; Machine Intelligence Lab, Department of Engineering, University of Cambridge, UK.
4. Anonymous submission on KITTI server, currently at number 3
http://www.cvlibs.net/datasets/kitti/eval_road_detail.php?result=581d48d10c399e61fe19182cd3483e628c46e893
NW Precision Recall MaxF Road
Accuracy
Overall
Accuracy
GMACs1
Cadence 95.3% 94.4% 94.8% 96.3% 98.9% 10.6
FCN2
94.0% 93.7% 93.8% x x 105.3
SegNet3
x x x 97.4% 89.7% 112.5
RPP4
95.9% 96.9% 96.4% 404.5

Minimizing Number Formats in DNN
• Two quantization approaches:
1. Post training quantization
2. During training quantization
• This requires changing the training infrastructure and process
System Methodology
AlexNet Top-1 Error [%]
32b Float 8b Fixed
UC Davis
Ristretto
Dynamic FXP
Minifloat (16b floating-point)
Multiplier-free (shifts only)]
Fine tuning (during training quantization)
43.1 43.8
Google
TensorFlow
Static FXP 42.2 49.4
Cadence Post training quantization based on Dynamic FXP
Fully fixed-point C-modeling, 32b accumulators
42.2 42.5

Fine-Tuning Quantization Benefit
From: Gysel, Philipp. "Ristretto: Hardware-oriented approximation of
convolutional neural networks." arXiv preprint arXiv:1605.06402 (2016).
With
Fine Tuning
Post Training
Only
Average 0.73% accuracy improvement after fine tuning
MNIST
CIFAR
ImageNet

“Static” Fixed-Point
4D Quantization
𝑥1 𝑥2 𝑥 𝑀𝑥𝑗
𝑦1 𝑦2 𝑦 𝑁𝑦𝑖
𝐹𝑖1 𝐹𝑖2
𝐹𝑖𝑗
𝐹𝑖𝑀
8b split:
1 sign bit
3 int bits
4 frac bits
𝐹11
𝐹12
𝐹1𝑗
𝐹1𝑀
Input Feature Maps
Output Feature Maps
𝒔 + 𝒃 𝟏 𝒃 𝟎. 𝒃−𝟏 𝒃−𝟐 𝒃−𝟑 𝒃−𝟒 𝒃−𝟓
4D Convolutional
Kernel

“Dynamic” Fixed-Point
Optimize definition of int/frac bits over finer subsets of processing chain
3D Quantization
𝐹𝑖1 𝐹𝑖2
𝐹𝑖𝑗
𝐹𝑖𝑀
𝒔 + 𝒃 𝟏 𝒃 𝟎. 𝒃−𝟏 𝒃−𝟐 𝒃−𝟑 𝒃−𝟒 𝒃−𝟓
8b split:
1 sign bit
2 int bits
5 frac bits
𝐹11
𝐹12
𝐹1𝑗
𝐹1𝑀𝒔 + 𝒃 𝟑 𝒃 𝟐 𝒃 𝟏 𝒃 𝟎. 𝒃−𝟏 𝒃−𝟐 𝒃−𝟑
1 sign
4 int
3 frac
3D Conv
Kernel #i3D Conv
Kernel #1

“Dynamic” Fixed-Point
There is a trade-off between performance and over-head.
2D Quantization
𝐹𝑖1 𝐹𝑖2
𝐹𝑖𝑗
𝐹𝑖𝑀
𝐹11
𝐹12
𝐹1𝑗
𝐹1𝑀
2D
Filter
int / frac
split
4 / 3
3 / 4
2 / 5
1 / 6

Cadence Quantization of Network Zoo
Network
Top-5 Error [%] Top-1 Error [%]
FLP 8b FXP* FLP 8b FXP*
GoogLeNet 13.5 13.5 34.9 35.0
VGG-VeryDeep-19 10.7 10.7 29.5 29.7
Cactus-it6
~2x < ResNet-50 MACs
~10x < VGG-19 MACs
8.2 8.5 25.3 26.1
ResNet-50 8.3 8.6 25.1 25.8
ResNet-101 7.6 8.2 24.0 25.3
ResNet-152 7.2 7.5 23.5 24.1
*Estimates based on 8b fixed-point coefficients & data, 32b floating-point accumulators

CNN 4b Quantization
Method Data Coefficients Computational
Complexity
Savings vs. 8b x 8b
AlexNet
Top-1
Error
Conventional 8b 8b - 42.5%
Hybrid 8b 8b: 54%
4b: 46%
33% 43.1%
Theoretical
Limit
8b 8b: 0%
4b: 100%
50%

In Summary

• CNN will achieve full potential with O(103)~O(104) improvement in power efficiency.
• DSP techniques can decrease CNN compute and data movement by O(101)~O(102)
Take Away Points

Network Compression
[1] Han, Song, Huizi Mao, and William J. Dally. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and
Huffman Coding." arXiv preprint arXiv:1510.00149 (2015).
[2] Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 1MB model size." arXiv preprint
arXiv:1602.07360(2016).
Network Quantization
[3] Gysel, Philipp. "Ristretto: Hardware-oriented approximation of convolutional neural networks." arXiv preprint arXiv:1605.06402 (2016).
[4] Warden, P. “How to quantize neural networks with tensorflow,” @ https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-
tensorflow/
Segmentation
[5] Badrinarayanan, Vijay, Ankur Handa, and Roberto Cipolla. "Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-
wise labelling." arXiv preprint arXiv:1505.07293 (2015).
[6] Oliveira, Gabriel L., Wolfram Burgard, and Thomas Brox. "Efficient deep models for monocular road segmentation." Intelligent Robots and Systems
(IROS), 2016 IEEE/RSJ International Conference on. IEEE, 2016.
MatConvNet
[7] Vedaldi, Andrea, and Karel Lenc. "Matconvnet: Convolutional neural networks for matlab." Proceedings of the 23rd ACM international conference on
Multimedia. ACM, 2015.
Benchmarks
[8] Russakovsky, Olga, et al. "Imagenet large scale visual recognition challenge." International Journal of Computer Vision 115.3 (2015): 211-252.
[9] Stallkamp, Johannes, et al. "Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition." Neural networks 32 (2012):
323-332.
Selected References & Resources

“© 2017 Cadence Design Systems, Inc. All rights reserved worldwide. Cadence, the Cadence logo and the other Cadence marks found at
www.cadence.com/go/trademarks are trademarks or registered trademarks of Cadence Design Systems, Inc. All other trademarks are
the property of their respective holders.

"Techniques to Reduce Power Consumption in Embedded DNN Implementations," a Presentation from Cadence

Recommended

Recommended

More Related Content

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

"Techniques to Reduce Power Consumption in Embedded DNN Implementations," a Presentation from Cadence