Pruning convolutional neural networks for resource efficient inference

Pruning Convolutional Neural
Networks for resource
efficient inference
Presented by: Kaushalya Madhawa
27th January 2017
Molchanov, Pavlo, et al. "Pruning Convolutional Neural Networks for Resource Efficient
Transfer Learning." arXiv preprint arXiv:1611.06440 (2016).

The paper
2
● Will be presented
at ICLR 2017 -
24-26th April
● Anonymous
reviewer ratings
○ 9
○ 6
○ 6
https://openreview.net/forum?id=SJGCiw5gl

Optimizing neural networks
Goal: running trained neural networks on mobile devices
1.Designing optimized networks from scratch
2.Optimizing pre-trained networks
Deep Compression (Han et al.)
3

Optimizing pre-trained neural networks
Reasons for pruning pre-trained networks
Transfer learning: fine-tuning an existing deep neural network
previously trained on a larger related dataset results in
higher accuracies
Objectives of pruning:
Improving the speed of inference
Reducing the size of the trained model
Better generalization
4

Which parameters to be pruned?
Saliency: A measure of importance
Parameters with least saliency will be deleted
“Magnitude equals saliency”
Parameters with smaller magnitudes have low saliency
Criteria for pruning
Magnitude of weight
a convolutional kernel with low l2 norm detects less important
features than those with a high norm
Magnitude of activation
if an activation value is small, then this feature detector is not
important for prediction task
Pruning the parameters which has the least effect on the trained
5

Which parameters to be pruned?
Saliency: A measure of importance
Parameters with least saliency will be deleted
“Magnitude equals saliency”
Parameters with smaller magnitudes have low saliency
Criteria for pruning
Magnitude of weight
a convolutional kernel with low l2 norm detects less important
features than those with a high norm
if an activation value is small, then this feature detector is not
important for prediction task
Pruning the parameters which has the least effect on the trained
6

Contributions of this paper
New saliency measure based on the
first-order Taylor expansion
Significant reduction in floating point
operations per second (FLOPs)
without a significant loss in accuracy
Oracle pruning as a general method to
compare network pruning models
7

Pruning as an optimization problem
Find a subset of parameters which preserves the accuracy of
the trained network
Impractical to solve this combinatorial optimization problem
for current networks
ie: VGG-16 has 4,224 convolutional feature maps
8

Taylor series approximation
Taylor approximation used to approximate the change in the
loss function from removing a particular parameter (hi)
Parameters are assumed to be independent
First order Taylor polynomial
9

Optimal Brain Damage (Le Cun et al., 1990)
Change of loss function approximated by second order Taylor
polynomial
10
The effect of parameters are assumed to be independent
Parameter pruning is performed once the training is converged
OBD is 30 times slower than he proposed Taylor method for
saliency calculation

Experiments
Data sets
Flowers-102
Birds-200
ImageNet
Implemented using Theano
Layerwise l2-normalization
FLOPs regularization
Feature maps from different layers require different amounts of
computation due to the number of input feature maps and kernels
11

Experiments...
Compared against
Oracle pruning: computing the effect of removal of each parameter
and the one which has the least effect on the cost function is
pruned at each iteration
Optimal Brain Damage (OBD)
Minimum weight
Mean
Standard deviation
Average Percentage of Zeros (APoZ) : neurons with low average
percentage of positive activations are pruned (Hu et al., 2016)
12
Feature maps at the first few layers have similar APoZ regardless of the
network’s target

Results
Spearman rank against the oracle ranking calculated for each
criterion
13

Layerwise contribution to the loss
Oracle pruning on VGG-16 trained on Birds-200 dataset
Layers with max-pooling tend to be more important than those
without (layers 2, 4, 7, 10, and 13)
14

Importance of normalization across layers
15

Pruning VGG-16 (Simonyan & Zisserman, 2015)
16
A network with 50% of the original
parameters trained from scratch
OBD
Parameters
FLOPs
● Pruning of feature maps in VGG-16 trained on the Birds-200
dataset (30 mini-batch SGD updates after pruning a feature
map)

Pruning AlexNet (Krizhevsky et al., 2012)
● Pruning of feature maps in AlexNet trained on the Flowers-102
dataset (10 mini-batch SGD updates after pruning a feature
map)
17

Speedup of networks pruned by Taylor criterion
18
● All experiments performed in Theano with cuDNN v5.1.0

Conclusion
An efficient saliency measure to decide which
parameters can be pruned without a significant loss
of accuracy
Provides a thorough evaluation of many aspects of
network pruning
A theoretical explanation about how the gradient
contains information about the magnitude of the
activations is needed
19

References
[1] Molchanov, Pavlo, et al. "Pruning Convolutional Neural Networks for Resource Efficient Transfer
Learning." arXiv preprint arXiv:1611.06440 (2016).
[2] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuron
pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016
[3] S. Han, H. Mao, and W. J. Dally, “Deep Compression - Compressing Deep Neural Networks with
Pruning, Trained Quantization and Huffman Coding,” Int. Conf. Learn. Represent., pp. 1–13, 2016.
[4] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. In ICLR, 2015
[5] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep
convolutional neural networks." Advances in neural information processing systems. 2012.
20

Pruning convolutional neural networks for resource efficient inference

More Related Content

What's hot

Viewers also liked

Similar to Pruning convolutional neural networks for resource efficient inference

More from Kaushalya Madhawa

Recently uploaded

Pruning convolutional neural networks for resource efficient inference

Editor's Notes