Your SlideShare is downloading.
×

- 1. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Title of presentation Subtitle Name of presenter Date Fractional step discriminant pruning: a filter pruning framework for deep convolutional neural networks N. Gkalelis, V. Mezaris CERTH-ITI, Thermi - Thessaloniki, Greece IEEE Int. Conf. on Multimedia & Expo Workshops, 7th MMC, London, United Kingdom, July 2020
- 2. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Outline 2 • Problem statement • Related work • Filter importance measure • Fractional step pruning strategy • Experiments • Conclusions
- 3. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 3 • Deep convolutional neural networks (DCNNs) are witnessing significant commercial deployment due to their breakthrough classification performance in many machine learning tasks Problem statement • Multimedia understanding • Self-driving cars • Edge computing Image Credits: V2Gov Image Credits: [1] [1] Chen, J., Ran, X.: Deep Learning With Edge Computing: A Review, Proc. of the IEEE, vol. 107, no. 8, (Aug. 2019)
- 4. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 4 • The deployment of DCNNs in resource-limited or real-time applications is still challenging due to their high computational inference time and storage requirements • DCNNs are highly overparametrized and the use of methods to reduce their capacity may be even beneficial for their performance [2] How to reduce the size of DCNNs and at the same time retain their generalization performance ? Problem statement [2] Arora, S., Ge, R., Neyshabur, B., Zhang, Y.: Stronger generalization bounds for deep nets via a compression approach, ICML, 2018
- 5. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 5 Related work • DCNN compression and acceleration methods can be categorized to: a) pruning, b) low-rank factorization, c) compact conv filters, d) knowledge distillation [3, 4] • Filter pruning is getting increasing attention because: a) achieves high compression rates with small performance degradation, b) is complementary to the methods from the other 3 categories • It consists of: a) filter importance estimation criterion, usually the smaller-norm- less-important, b) pruning strategy, usually an iterative one: training, pruning, retraining, … [3] K. Ota, M.S. Dao, V. Mezaris, F.G.B. De Natale: Deep Learning for Mobile Multimedia: A Survey, ACM Trans. Multimedia Computing Communications & Applications (TOMM), vol. 13, no. 3s, June 2017 [4] Y. Cheng, D. Wang, P. Zhou and T. Zhang: Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges, IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126-136, Jan. 2018
- 6. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 6 Related work • In [5], it is shown that pruning filters with small l2-norm may have a negative impact to network’s performance • FPGM is proposed utilizing a Geometric Median (GM) based measure • FPGM selects a fraction of filters using the l2-norm (usually 10%), and the rest using the GM-based measure An iterative strategy is used (training, pruning, retraining, …) where all filters corresponding to the target pruning rate are pruned at each iteration [5] Y. He, P. Liu, Z. Wang, Z. Hu and Y. Yang: Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration, CVPR, 2019
- 7. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 7 Related work • In [6], it is shown that the iterative pruning strategy, where all selected filters are set to zero from the first iteration, may lead to unrecoverable information loss • Asymptotic pruning strategy: iterative strategy, but, the number of selected filters at each iteration varies asymptotically to the target pruning rate The l2-norm measure is used to select the filters at each iteration [6] Y. He, X. Dong, G. Kang, Y. Fu, C. Yan and Y. Yang, "Asymptotic Soft Filter Pruning for Deep Convolutional Neural Networks, IEEE Trans. on Cybernetics, pp. 1-11, Aug. 2019
- 8. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 8 Overview of proposed method • Motivated from limitations in recent works [5, 6] and related research findings in shallow learning [7, 8, 9] we extend [6]: • Replacing the l2-norm-based criterion by: a) Class-Separability (CS) based exploiting labelling information in annotated training datasets [7, 8, 9], b) GM- based [5] • Applying fractional step pruning strategy: not only the number of selected filters but also their weights vary asymptotically to their target value [7] N. Gkalelis, V. Mezaris, I. Kompatsiaris and T. Stathaki: Mixture Subclass Discriminant Analysis Link to Restricted Gaussian Model and Other Generalizations, IEEE Trans. Neural Networks and Learning Systems, vol. 24, no. 1, pp. 8-21, Jan. 2013 [8] R. Lotlikar and R. Kothari: Fractional-step dimensionality reduction, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 6, pp. 623-627, June 2000 [9] K. Fukunaga, Introduction to statistical pattern recognition (2nd ed.), Academic Press Professional, Inc., San Diego, CA, USA, 1990
- 9. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 9 Importance measure 1: CS-based • Suppose an annotated training dataset of n observations and m classes • Let Xk (i,j) be the feature map of the k-th observation at the j-th filter of i-th layer • The feature maps are vectorized and stacked to form data matrix X(i,j) for filter (i,j) 𝐗(𝑖,𝑗) = 𝐱1 (𝑖,𝑗), … , 𝐱n (𝑖,𝑗) , 𝐱k (𝑖,𝑗) = 𝑣𝑒𝑐 𝑋 𝑘 (𝑖,𝑗) • A filter discriminant score is then computed using 𝜂(𝑖,𝑗) = 𝑡𝑟 𝐒(𝑖,𝑗) 𝐒(𝑖,𝑗) = 𝛍p (𝑖,𝑗) − 𝛍q (𝑖,𝑗) 𝛍p (𝑖,𝑗) − 𝛍q (𝑖,𝑗) 𝑇 𝑚 𝑞=𝑝+1 𝑚−1 𝑝=1 between-class scatter matrix for filter (i,j) (can be computed efficiently; see paper for details ) Mean vector of class p (class labels are used to compute the means)
- 10. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 10 Importance measure 1: CS-based • tr(S(i,j)) quantifies the distance among class distributions using the features produced from the corresponding filter [7,8,9] • A large value indicates that the filter extracts discriminant features for separating the classes • In contrary, filters that extract noise or irrelevant features with respect to the classification task attain very small CS values and can be discarded safely μ1 (i,1) ||v(i,2)|| > ||v(i,1)|| v(i,j) X(i,j) μ2 (i,1) μ1 (i,2) tr(S(i,1)) is large μ2 (i,2) tr(S(i,2)) is very small; despite a possible large l2-norm, filter (i,2) can be safely discarded
- 11. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 11 Importance measure 2: GM-based • For large pruning rates, the CS-based criterion may eliminate filters that extract features with small but still important discriminant information • The GM-based measure identifies the most replaceable filters in a layer [5] 𝜂(𝑖,𝑗) = 𝐯(𝑖,𝑗) − 𝐯(𝑖,𝑜) 𝑐 𝑖 𝑜=1 • Combined selection strategy: select a fraction of filters using the CS-based measure and another fraction using the GM-based one
- 12. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 12 Fractional step pruning strategy • Let ε, θ be the total epochs and target pruning rate • The pruning rate θι and scaling factor ζι at epoch ι are computed as: 𝜃𝜄 = 𝛼𝑒−𝛽𝜄 + 𝛾 𝜁𝜄 = 1 − 𝜃𝜄 𝜃 • The parameters α, β, γ, are estimated using 3 known points similarly to [6] • The individual pruning rates for the CS and GM-based criteria are 𝜃𝜄 = 𝑚𝑖𝑛 𝜃𝜄, 𝜃𝑓 𝜃𝜄 = 𝜃𝜄 − 𝜃𝜄 • 𝜃𝑓 is the final pruning rate associated with the CS measure (e.g. 10%)
- 13. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 13 • CIFAR10 [10]: 10 classes, 32 x 32 color images, 50000 training and 10000 testing observations • ImageNet32 [11]: ILSVRC-2012 where images are resized to 32 x 32; 1000 classes, 32 x 32 color images, 1281167 training and 50000 testing observations • GSC (ver. 0.01) [12]: 12 classes, speech utterances, 51094 training, 6798 validation and 6835 testing • Comparison with MIL [13], PFEC [14], CP [15], SFP [16] , FPGM [6], ASFP [7] [10] Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep. (2009) [11] P. Chrabaszcz, I. Loshchilov, and F. Hutter: A downsampled variant of ImageNet as an alternative to the CIFAR datasets, CoRR, vol. abs/1707.08819, 2017 [12] P. Warden, Speech commands: A dataset for limited-vocabulary speech recognition, CoRR, vol. abs/1804.03209, 2018 [13] X. Dong et al., More is less: A more complicated network with less inference complexity, CVPR, Honolulu, HI, USA, July 2017 [14] H. Li et al.: Pruning filters for efficient convnets, ICLR Toulon, France, Apr. 2017 [15] Y. He, X. Zhang, and J. Sun: Channel pruning for accelerating very deep neural networks, ICCV, Venice, Italy, Oct. 2017 [16] Y. He et al., “Soft filter pruning for accelerating deep convolutional neural networks,” IJCAI, Stockholm, Sweden, July 2018
- 14. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 14 • Experimental setup for CIFAR10 and ImageNet32, same as in FPGM [6], ASFP [7] • Images are normalized to zero mean and unit variance, data augmentation is applied (cropping, mirroring, flipping, etc.) • ResNet, CE loss, Minibatch SGD, Nesterov momentum 0.9, batch size 128, weight decay 0.0005, ε = 200 • Initial learning rate is 0.01, divided by 5 at epochs 60, 120, 160 for CIFAR10, and by 10 every 10 epochs for ImageNet32
- 15. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 15 • Experimental setup for GSC as in [17] • Log mel-spectrogams (LMSs) are used for speech commands representation to derive 32 x 32 LMS for each recording: 16KHz sampling rate, STFT with Hamming window of size 1024, hop length 512, 32 mel filterbanks, etc. • Augmentation: pitch shifting, mixing with background noise, etc. • ResNet, CE loss, Minibatch SGD, Nesterov momentum 0.9, batch size 96, weight decay 0.0005, ε = 70, initial learning rate is 0.01 and divided by 10 at epoch 50 [17] J. Salamon and J. P. Bello: Deep convolutional neural networks and data augmentation for environmental sound classification, IEEE Signal Process. Lett., vol. 24, no. 3, pp. 279–283, Mar. 2017.
- 16. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 16 θ=40% No pr. MIL [13] PFEC [14] CP [15] SFP [16] AFSP [6] FPGM [7] FSDP(𝜃𝑓=10%) FSDP(𝜃𝑓=40%) ResNet20 92.2% 91.43% --- --- 90.83% --- 91.99% 92.02% 92.09% ResNet56 93.59% --- 91.31% 90.90% 92.26% 92.44% 92.89% 93.13% 93.1% ResNet110 93.68% 93.44% 92.44% --- 93.38% 93.2% 93.85% 93.91% 93.93% • Correct classification rates (CCRs) in CIFAR10 with pruning rates θ = 40%, 50% θ=50% FPGM FSDP(𝜃𝑓=10%) ResNet20 89.73% 90.16% ResNet56 91.79% 92.64% ResNet110 92.51% 93.72% • FSDP outperforms all other methods • E.g., > 1% CCR improvement over FPGM (second best method) for ResNet110, θ = 50 %
- 17. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 17 θ=20% No pruning SFP [16] FPGM [7] FSDP(𝜃𝑓=10%) ImageNet32 40.79% 29.92% 37.23% 38.3% GSC 97.47% 94.57% 95.64% 96.22% • CCRs in ImageNet32 and GSC with ResNet56 and pruning rates θ = 20%, 50% • Evaluation of SFP, FPGM, FSDP (based on performance results in CIFAR10) θ=50% FPGM FSDP(𝜃𝑓=10%) ImageNet32 32.32% 33.23% GSC 92.89% 94.66% • FSDP outperforms both SFP and FPGM • In the challenging ImageNet32 dataset the performance drop of SFP is quite high; this is attributed to the l2-norm based criterion, where a fraction of the selected filters still carry significant discriminant information
- 18. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 18 • Visualization of FSDP (𝜃𝑓 = 20%) while training a ResNet20 in CIFAR10, with θ = 20% • Illustration of CS measure scores for each filter at epochs 10, 40, 200 (figures from left to right) • Filters closer to the input seem to attain high discriminant scores (especially in the initial epochs) • Surviving filters of the 2nd conv layer in residual blocks (e.g., 11, 13, 15, 17) accumulate a quite high discriminant power as the training proceeds • After a certain number of epochs, the surviving filters in the last conv layer attain a high discriminant power
- 19. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Summary and next steps 19 • A new filter pruning approach was presented exploiting a class-separability-based measure for estimating the importance of the filters and a fractional step strategy to prune them asymptotically • The proposed approach was evaluated successfully in three popular datasets (CIFAR-10, ImageNet32, GSC) for image and speech classification tasks • As a future work, we are planning to investigate the use of variable pruning rates utilizing the discriminant scores at layer-level, similarly to the globally-comparing criteria in [14,18] [18] P. Molchanov et al.: Pruning convolutional neural networks for resource efficient inference, ICLR, Toulon, France, Apr. 2017
- 20. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 20 Thank you for your attention! Questions? Nikolaos Gkalelis, gkalelis@iti.gr Vasileios Mezaris, bmezaris@iti.gr Code publicly available at: https://github.com/bmezaris/fractional_step_discriminant_pruning_dcnn This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement H2020-780656 ReTV