"Designing CNN Algorithms for Real-time Applications," a Presentation from Almond AI

Copyright © 2017 Almond Research Pte Ltd. 1
Matthew Chiu
May 2nd 2017
Designing CNN Algorithms For Real-
time Applications

• Consider 3 stages: Dataset – Model – Deployment
• Understand relationship between size (depth) and
performance of Deep Neural Nets.
• Share weights to reduce number of parameters and
operations.
• Reduce total number of computations during model inference
by factoring the convolution layer
Optimization Pipeline

• Feature Sharing
• “Distilling the Knowledge in a Neural Network”
• Very deep networks: Inception, Resnet, Highway networks, etc.
• Densely Connected Networks (2016)
• Resnet, Spatially Adaptive Computation Time (2017)
• Pruning, Weight Sharing, Compression
• “Deep Compression: Compressing Deep Neural Networks with
Pruning, Trained Quantization, And Huffman Coding
Relevant Research

• 2D Convolutional Layer (multiply-add)
• Not counting activation, biases, batch norm
• Cost of Fully Convolutional Layers is a matrix multiply + bias
• Well studied: Pooling, Factorization
Review Computational Cost
[height*width]*(k/ stride)2
*(#inmaps*#outmaps)

Model Error Rate
(Top 5)
Depth Parameters Runtime
(ms)
AlexNet1 19.9 8 62M 5.32
VGG2 9.33 16 144M 46.16
Inception2 9.15 22 6.8M 11.94
Resnet - 1012 6.05 110 1.7M 53.28
Size vs. Performance ImageNet
1 ImageNet 2012
2 Imagenet 2014
3 Pascal Titan X, Torch, CUDNN 5, Forward Pass Only
https://github.com/jcjohnson/cnn-benchmarks

GoogleNet Inception
Reduce number of filters, before applying 3x3 and 5x5 convolutions

Densely Connected Networks (Huang 2016)
• Factorize a convolutional layer into multiple (12) small blocks of filters.
• The input for each block is a concatenation of all previous convolutional filters.
• The output of each block is concatenated with the input.
• Recombine filters maps using a 1x1 convolution at the end of each stage.

Densely Connected: Lowest Error On CIFAR 10/100
(millions)

Example of a Real-time CNN

• Kaggle Competition 2016 : Estimating Facial Keypoints
Real-time CNN : Finding Facial Keypoints
Training set of 7049 faces,
96x96 grayscale
Problem is to learn the (x,y)
position of 15 facial keypoints:
Eyes (6), Eyebrows (4), Nose, Lips (4)

• Goal is to use all of the data, some incomplete, to train a bigger network
• Have to train a separate network for each subset of facial keypoints. Use
data augmentation (horizontal and vertical shifting).
• Solution is to share weights between the networks.
Dataset : Incomplete Data
Facial Keypoints Train (Shifting) Validation
Eyes (Corners) 1795 (6.6%) 449
Eyebrows 1753 (10%) 437
Mouth (Corners) 1801 (6.6%) 445
Smiley ☺ 5605 (10%) 1395
Complete 1717 (15%) 423

• Train on 2100 faces, only the samples with all 15 features.
• Baseline: 6 layers (3 convolutional, 3 Fully Connected) and train on 1700
complete cases only.
• Network size is 32 – 64 – 128 – 500
Model I: Baseline Network

Model II : Siamese Network
32 64 128 192
Concatenate output of
Last FC Layers of Each
Network
*Network is trained sequentially on each task

Model III: Shared Densely Connected Network
Parameters: N=12, K=3
Each block learns 12 filters at a time. The output is concatenated to the input
to increase the depth of the network. N = 24 in the final stage of shared network.

Convolution Input Output N K Image Size Ops Params
2x2 32 56 8 x3 94x94 92M 10,476
1x1 56 48 1/2 (L2 Weight
Regularization)
2x2 48 96 12 x4 47x47 82M 37,152
1x1 96 96 1/2
2x2 96 132 12 x3 23x23 28M 53,232
1x1 132 128
2x2 128 200 24 x3 23x23 70M 131,520
1x1 200 192 1/2
192 11x11
Densely Connected Model Parameters

Model RMS Error
(Training)
RMS Error
(Testing)
Depth Operations Conv.
Params
Baseline 1.68 2.5 6 75M 61,760
Siamese 32-
64-128-192
1.80 2.07 9 192M 160,064
Siamese 48-
96-192-192
1.70 2.18 9 297M 286,176
Densely
Connected
2.13 2.29 23 366M 232,380
Results on Facial Keypoint Task
RMS = Root Mean Square

• When a training set is small (1717 photos), must test whether the network
can actually perform the task in general. For feature tracking, measure
Translational Invariance: use a validation set with randomized
horizontal and vertical shifting.
• The Siamese model does the best with a Test RMS Error of 2.07. In
comparison, the baseline was the worst at generalization. Increasing the
amount of filters or adding more depth (Densely Connected) only
increased overfitting because there is not enough training data.
• The mean average error for Siamese model was already 0.75 pixels.
Targeting sub-pixel accuracy can also increase risk of overfitting or
“memorization”. Improving generalization is a more suitable goal.
Results Analysis

Results (Siamese 32-64-128-192)
Predicted Ground Truth Predicted Ground Truth

Frame Rates
Deployment Benchmarks
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
400.00
450.00
500.00
Baseline Siamese Dense
476.02
105.10
36.18
TX1 (Caffe)
TX1 (Caffe)
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
400.00
450.00
500.00
Baseline Siamese Dense
470.28
287.33
71.03
"Titan X (Theano)"
"Titan X (Theano)"
Batch size 32 Batch size 64

• Ideally, we can train a very deep network with lots of training
data. In reality, we must be careful about overfitting.
• Siamese model is a way to increase the training set size
while sharing weights (convolutional filters) to reduce the
convolutional operations.
• For a Densely Connected model, depth can be increased
with a small increase in the total number of operations.
• Actual performance on the runtime platform may vary
depending on your choice of DNN library.
Conclusions

• https://www.kaggle.com/c/facial-keypoints-detection
• http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-
facial-keypoints-tutorial/
• Spatially Adaptive Computation Time for Residual Networks. Figurnov, Collins,
Zhu, Zhang, Huang, Vetrox, Salakhutdinov
• Deep Compression: Compressing Deep Neural Networks with Pruning, Trained
Quantization, And Huffman Coding
• Distilling the Knowledge in a Neural Network. Geoffrey Hinton, Oriol Vinyals, Jeff
Dean
• Going Deeper With Convolutions. Szegedy et Al.
• Densely Connected Convoutional Networks. Gao Huang, Zhuang Liu, Killian Q.
Weinberger
Resources

• Conceptually complicated, but made of repeating blocks. At each
convolutional layer, only learn n * constant number of filters k.
[ (Input * k) + ((Input + k) * k) + … ((Input + (n-1)*k) * k) ]
• O(runtime) = (Input*(n*k) + (n2*k2) / 2) * (image size * kernel size)
• For fixed n*k < (input / 2),
• Convolutional Layer with output = input layers
• O(Input2 * (image size * kernel size))
Appendix : Analysis of Densely Connected
Networks

• Reduce amount of memory required to store weights
(Pruning 10x, Weight Sharing 30x, Compression 50x)
• Also reduces Energy Consumption
• Speedup on Fully Connected Layers by Sparse Matrix
Multiply
Appendix: Weight Storage Compression

"Designing CNN Algorithms for Real-time Applications," a Presentation from Almond AI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to "Designing CNN Algorithms for Real-time Applications," a Presentation from Almond AI

Similar to "Designing CNN Algorithms for Real-time Applications," a Presentation from Almond AI (20)

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

"Designing CNN Algorithms for Real-time Applications," a Presentation from Almond AI