Fractional step discriminant pruning

retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Title of presentation
Subtitle
Name of presenter
Date
Fractional step discriminant pruning: a filter pruning
framework for deep convolutional neural networks
N. Gkalelis, V. Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
IEEE Int. Conf. on Multimedia &
Expo Workshops, 7th MMC,
London, United Kingdom, July 2020

Outline
2
• Problem statement
• Related work
• Filter importance measure
• Fractional step pruning strategy
• Experiments
• Conclusions

3
• Deep convolutional neural networks (DCNNs) are witnessing significant
commercial deployment due to their breakthrough classification performance in
many machine learning tasks
Problem statement
• Multimedia
understanding
• Self-driving cars • Edge computing
Image Credits: V2Gov
Image Credits: [1]
[1] Chen, J., Ran, X.: Deep Learning With Edge Computing: A Review, Proc. of the IEEE, vol. 107, no. 8, (Aug. 2019)

4
• The deployment of DCNNs in resource-limited or real-time applications is still
challenging due to their high computational inference time and storage
requirements
• DCNNs are highly overparametrized and the use of methods to reduce their
capacity may be even beneficial for their performance [2]
 How to reduce the size of DCNNs and at the same time retain their generalization
performance ?
Problem statement
[2] Arora, S., Ge, R., Neyshabur, B., Zhang, Y.: Stronger generalization bounds for deep nets via a compression approach, ICML, 2018

5
Related work
• DCNN compression and acceleration methods can be categorized to: a) pruning,
b) low-rank factorization, c) compact conv filters, d) knowledge distillation [3, 4]
• Filter pruning is getting increasing attention because: a) achieves high
compression rates with small performance degradation, b) is complementary to
the methods from the other 3 categories
• It consists of: a) filter importance estimation criterion, usually the smaller-norm-
less-important, b) pruning strategy, usually an iterative one: training, pruning,
retraining, …
[3] K. Ota, M.S. Dao, V. Mezaris, F.G.B. De Natale: Deep Learning for Mobile Multimedia: A Survey, ACM Trans. Multimedia Computing
Communications & Applications (TOMM), vol. 13, no. 3s, June 2017
[4] Y. Cheng, D. Wang, P. Zhou and T. Zhang: Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and
Challenges, IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126-136, Jan. 2018

6
Related work
• In [5], it is shown that pruning filters with small l2-norm may have a negative
impact to network’s performance
• FPGM is proposed utilizing a Geometric Median (GM) based measure
• FPGM selects a fraction of filters using the l2-norm (usually 10%), and the rest
using the GM-based measure
 An iterative strategy is used (training, pruning, retraining, …) where all filters
corresponding to the target pruning rate are pruned at each iteration
[5] Y. He, P. Liu, Z. Wang, Z. Hu and Y. Yang: Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration, CVPR, 2019

7
Related work
• In [6], it is shown that the iterative pruning strategy, where all selected filters are
set to zero from the first iteration, may lead to unrecoverable information loss
• Asymptotic pruning strategy: iterative strategy, but, the number of selected
filters at each iteration varies asymptotically to the target pruning rate
 The l2-norm measure is used to select the filters at each iteration
[6] Y. He, X. Dong, G. Kang, Y. Fu, C. Yan and Y. Yang, "Asymptotic Soft Filter Pruning for Deep Convolutional Neural Networks, IEEE Trans. on
Cybernetics, pp. 1-11, Aug. 2019

8
Overview of proposed method
• Motivated from limitations in recent works [5, 6] and related research findings in
shallow learning [7, 8, 9] we extend [6]:
• Replacing the l2-norm-based criterion by: a) Class-Separability (CS) based
exploiting labelling information in annotated training datasets [7, 8, 9], b) GM-
based [5]
• Applying fractional step pruning strategy: not only the number of selected filters
but also their weights vary asymptotically to their target value
[7] N. Gkalelis, V. Mezaris, I. Kompatsiaris and T. Stathaki: Mixture Subclass Discriminant Analysis Link to Restricted Gaussian Model and Other
Generalizations, IEEE Trans. Neural Networks and Learning Systems, vol. 24, no. 1, pp. 8-21, Jan. 2013
[8] R. Lotlikar and R. Kothari: Fractional-step dimensionality reduction, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 6, pp.
623-627, June 2000
[9] K. Fukunaga, Introduction to statistical pattern recognition (2nd ed.), Academic Press Professional, Inc., San Diego, CA, USA, 1990

9
Importance measure 1: CS-based
• Suppose an annotated training dataset of n observations and m classes
• Let Xk
(i,j) be the feature map of the k-th observation at the j-th filter of i-th layer
• The feature maps are vectorized and stacked to form data matrix X(i,j) for filter (i,j)
𝐗(𝑖,𝑗) = 𝐱1
(𝑖,𝑗), … , 𝐱n
(𝑖,𝑗) , 𝐱k
(𝑖,𝑗) = 𝑣𝑒𝑐 𝑋 𝑘
(𝑖,𝑗)
• A filter discriminant score is then computed using
𝜂(𝑖,𝑗)
= 𝑡𝑟 𝐒(𝑖,𝑗)
𝐒(𝑖,𝑗) = 𝛍p
(𝑖,𝑗) − 𝛍q
(𝑖,𝑗) 𝛍p
(𝑖,𝑗) − 𝛍q
(𝑖,𝑗) 𝑇
𝑚
𝑞=𝑝+1
𝑚−1
𝑝=1
between-class scatter matrix for
filter (i,j) (can be computed
efficiently; see paper for details )
Mean vector of class p
(class labels are used to
compute the means)

10
Importance measure 1: CS-based
• tr(S(i,j)) quantifies the distance among class distributions using the features
produced from the corresponding filter [7,8,9]
• A large value indicates that the filter extracts discriminant features for
separating the classes
• In contrary, filters that extract noise or irrelevant features with respect to the
classification task attain very small CS values and can be discarded safely
μ1
(i,1)
||v(i,2)|| > ||v(i,1)||
v(i,j)
X(i,j)
μ2
(i,1)
μ1
(i,2)
tr(S(i,1)) is large
μ2
(i,2)
tr(S(i,2)) is very small; despite a
possible large l2-norm, filter (i,2)
can be safely discarded

11
Importance measure 2: GM-based
• For large pruning rates, the CS-based criterion may eliminate filters that extract
features with small but still important discriminant information
• The GM-based measure identifies the most replaceable filters in a layer [5]
𝜂(𝑖,𝑗)
= 𝐯(𝑖,𝑗)
− 𝐯(𝑖,𝑜)
𝑐 𝑖
𝑜=1
• Combined selection strategy: select a fraction of filters using the CS-based
measure and another fraction using the GM-based one

12
Fractional step pruning strategy
• Let ε, θ be the total epochs and target pruning rate
• The pruning rate θι and scaling factor ζι at epoch ι are computed as:
𝜃𝜄 = 𝛼𝑒−𝛽𝜄 + 𝛾
𝜁𝜄 = 1 −
𝜃𝜄
𝜃
• The parameters α, β, γ, are estimated using 3 known points similarly to [6]
• The individual pruning rates for the CS and GM-based criteria are
𝜃𝜄 = 𝑚𝑖𝑛 𝜃𝜄, 𝜃𝑓
𝜃𝜄 = 𝜃𝜄 − 𝜃𝜄
• 𝜃𝑓 is the final pruning rate associated with the CS measure (e.g. 10%)

Experiments
13
• CIFAR10 [10]: 10 classes, 32 x 32 color images, 50000 training and 10000 testing
observations
• ImageNet32 [11]: ILSVRC-2012 where images are resized to 32 x 32; 1000 classes,
32 x 32 color images, 1281167 training and 50000 testing observations
• GSC (ver. 0.01) [12]: 12 classes, speech utterances, 51094 training, 6798
validation and 6835 testing
• Comparison with MIL [13], PFEC [14], CP [15], SFP [16] , FPGM [6], ASFP [7]
[10] Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep. (2009)
[11] P. Chrabaszcz, I. Loshchilov, and F. Hutter: A downsampled variant of ImageNet as an alternative to the CIFAR datasets, CoRR, vol.
abs/1707.08819, 2017
[12] P. Warden, Speech commands: A dataset for limited-vocabulary speech recognition, CoRR, vol. abs/1804.03209, 2018
[13] X. Dong et al., More is less: A more complicated network with less inference complexity, CVPR, Honolulu, HI, USA, July 2017
[14] H. Li et al.: Pruning filters for efficient convnets, ICLR Toulon, France, Apr. 2017
[15] Y. He, X. Zhang, and J. Sun: Channel pruning for accelerating very deep neural networks, ICCV, Venice, Italy, Oct. 2017
[16] Y. He et al., “Soft filter pruning for accelerating deep convolutional neural networks,” IJCAI, Stockholm, Sweden, July 2018

Experiments
14
• Experimental setup for CIFAR10 and ImageNet32, same as in FPGM [6], ASFP [7]
• Images are normalized to zero mean and unit variance, data augmentation is
applied (cropping, mirroring, flipping, etc.)
• ResNet, CE loss, Minibatch SGD, Nesterov momentum 0.9, batch size 128,
weight decay 0.0005, ε = 200
• Initial learning rate is 0.01, divided by 5 at epochs 60, 120, 160 for CIFAR10, and
by 10 every 10 epochs for ImageNet32

Experiments
15
• Experimental setup for GSC as in [17]
• Log mel-spectrogams (LMSs) are used for speech commands representation to
derive 32 x 32 LMS for each recording: 16KHz sampling rate, STFT with Hamming
window of size 1024, hop length 512, 32 mel filterbanks, etc.
• Augmentation: pitch shifting, mixing with background noise, etc.
• ResNet, CE loss, Minibatch SGD, Nesterov momentum 0.9, batch size 96, weight
decay 0.0005, ε = 70, initial learning rate is 0.01 and divided by 10 at epoch 50
[17] J. Salamon and J. P. Bello: Deep convolutional neural networks and data augmentation for environmental sound classification, IEEE Signal
Process. Lett., vol. 24, no. 3, pp. 279–283, Mar. 2017.

Experiments
16
θ=40% No pr. MIL [13] PFEC [14] CP [15] SFP [16] AFSP [6] FPGM [7] FSDP(𝜃𝑓=10%) FSDP(𝜃𝑓=40%)
ResNet20 92.2% 91.43% --- --- 90.83% --- 91.99% 92.02% 92.09%
ResNet56 93.59% --- 91.31% 90.90% 92.26% 92.44% 92.89% 93.13% 93.1%
ResNet110 93.68% 93.44% 92.44% --- 93.38% 93.2% 93.85% 93.91% 93.93%
• Correct classification rates (CCRs) in CIFAR10 with pruning rates θ = 40%, 50%
θ=50% FPGM FSDP(𝜃𝑓=10%)
ResNet20 89.73% 90.16%
ResNet56 91.79% 92.64%
ResNet110 92.51% 93.72%
• FSDP outperforms all other methods
• E.g., > 1% CCR improvement over FPGM
(second best method) for ResNet110, θ = 50 %

Experiments
17
θ=20% No pruning SFP [16] FPGM [7] FSDP(𝜃𝑓=10%)
ImageNet32 40.79% 29.92% 37.23% 38.3%
GSC 97.47% 94.57% 95.64% 96.22%
• CCRs in ImageNet32 and GSC with ResNet56 and pruning rates θ = 20%, 50%
• Evaluation of SFP, FPGM, FSDP (based on performance results in CIFAR10)
θ=50% FPGM FSDP(𝜃𝑓=10%)
ImageNet32 32.32% 33.23%
GSC 92.89% 94.66%
• FSDP outperforms both SFP and FPGM
• In the challenging ImageNet32 dataset the performance drop of SFP is quite
high; this is attributed to the l2-norm based criterion, where a fraction of the
selected filters still carry significant discriminant information

Experiments
18
• Visualization of FSDP (𝜃𝑓 = 20%) while training a ResNet20 in CIFAR10, with θ = 20%
• Illustration of CS measure scores for each filter at epochs 10, 40, 200 (figures from left to right)
• Filters closer to the input seem to attain high discriminant scores (especially in the initial epochs)
• Surviving filters of the 2nd conv layer in residual blocks (e.g., 11, 13, 15, 17) accumulate a quite high
discriminant power as the training proceeds
• After a certain number of epochs, the surviving filters in the last conv layer attain a high
discriminant power

Summary and next steps
19
• A new filter pruning approach was presented exploiting a class-separability-based measure for
estimating the importance of the filters and a fractional step strategy to prune them
asymptotically
• The proposed approach was evaluated successfully in three popular datasets (CIFAR-10,
ImageNet32, GSC) for image and speech classification tasks
• As a future work, we are planning to investigate the use of variable pruning rates utilizing the
discriminant scores at layer-level, similarly to the globally-comparing criteria in [14,18]
[18] P. Molchanov et al.: Pruning convolutional neural networks for resource efficient inference, ICLR, Toulon, France, Apr. 2017

20
Thank you for your attention!
Questions?
Nikolaos Gkalelis, gkalelis@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code publicly available at:
https://github.com/bmezaris/fractional_step_discriminant_pruning_dcnn
This work was supported by the EUs Horizon 2020 research and innovation
programme under grant agreement H2020-780656 ReTV

Fractional step discriminant pruning

More Related Content

What's hot

Similar to Fractional step discriminant pruning

More from VasileiosMezaris

Recently uploaded

Fractional step discriminant pruning