LSTM Structured Pruning

retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Title of presentation
Subtitle
Name of presenter
Date
Structured pruning of LSTMs via Eigenanalysis and Geometric
Median for Mobile Multimedia and Deep Learning Applications
N. Gkalelis, V. Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
IEEE Int. Symposium on Multimedia,
Naples, Italy (Virtual), Dec. 2020
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Outline
2
• Problem statement
• Related work
• Layer’s pruning rate computation
• LSTM unit importance estimation
• Experiments
• Conclusions
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
3
• Deep learning (DL) is currently becoming a game changer in most industries due
to breakthrough classification performance in many machine learning tasks
Problem statement
• Mobile Multimedia • Self-driving cars • Edge computing
Image Credits: [2] Image Credits: [3]
[1] V-Soft Consulting: https://blog.vsoftconsulting.com/; [2] V2Gov: https://www.facebook.com/V2Gov/
[3] J. Chen, X. Ran, Deep Learning With Edge Computing: A Review, Proc. of the IEEE, Aug. 2019
Image Credits: [1]
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
4
• Recurrent neural networks (RNNs) have shown excellent performance in
processing sequential data
• The deployment of top-performing RNNs in resource-limited applications such
as mobile multimedia devices is still difficult due to their high inference time and
storage requirements
 How to reduce the size of RNNs and at the same time retain generalization
performance ?
Problem statement
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
5
Related work
• Pruning is getting increasing attention because these methods achieve a high
compression rate and maintain a stable model performance [4,5]
• Two main pruning categories: a) unstructured: prune individual network weights,
b) structured: prune well-defined network components, e.g., DCNN filters or
LSMT units
 Models derived using structured pruning can be deployed in conventional
hardware (e.g. GPUs); no special purpose accelerators required
[4] K. Ota, M.S. Dao, V. Mezaris, F.G.B. De Natale: Deep Learning for Mobile Multimedia: A Survey, ACM Trans. Multimedia Computing
Communications & Applications (TOMM), vol. 13, no. 3s, June 2017
[5] Y. Cheng, D. Wang, P. Zhou and T. Zhang: Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and
Challenges, IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126-136, Jan. 2018
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
6
Related work
• Structured pruning of DCNNs has been extensively studied in the literature;
structured RNN pruning is a much less investigated topic:
• In [6], Intrinsic Sparse Structures (ISS) of LSTMs are defined and a Group Lasso-
based approach is used for sparsifying the network
• In [7], LSTM parameters are constrained using an L0 norm penalty and ISSs close
to zero are pruned
 Both [6], [7], utilize sparsity-inducing regularizers to modify the loss function,
which may lead to numerical instabilities and suboptimal solutions [8]
[6] W. Wen et al., Learning intrinsic sparse structures within long short-term memory, ICLR, 2018
[7] L. Wen et al., Structured pruning of recurrent neural networks through neuron selection, Neural Networks, Mar. 2020.
[8] H. Xu et al., Sparse algorithms are not stable: A no-free-lunch theorem, IEEE Trans. Pattern Anal. Mach. Intell., Jan. 2012.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
7
Overview of proposed method
• Inspired from recent advances in DCNN filter pruning [9, 10] we extend [6]:
• The covariance matrix formed by layer’s responses is used to compute the
respective eigenvalues, quantify layer’s redundancy and pruning rate (as in [9]
for DCNN layers)
• A Geometric Median-based (GM-based) criterion is used to identify the most
redundant LSTM units (as in [10] for DCNN filters)
 The GM-based criterion has shown superior performance over sparsity-inducing
ones in the DCNN domain
[9] X. Suau, U. Zappella, and N. Apostoloff, Filter distillation for network compression, IEEE WACV, CO, USA, Mar. 2020
[10] Y. He et al., Filter pruning via Geometric median for deep convolutional neural networks acceleration, IEEE CVPR, CA, USA, Jun. 2019
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
8
Computation of layer’s pruning rate
• Suppose an annotated training set of N sequences and C classes
• The training set at LSTM layer’s output can be represented as
𝐙 = 𝒛1, … , 𝒛N , 𝒛k ∈ ℛ 𝐻
• zk is the hidden state vector of the k-th sequence at last time step; has high
representational power and often used for representing overall input sequence;
H is the number of units in the layer
• The sample covariance matrix S of the responses can be computed as
𝐒 = 𝐳 𝑘 − 𝒎 𝐳 𝑘 − 𝒎 𝑇
𝑁
𝑘=1
, 𝒎 =
1
𝑁
𝐳 𝑘
𝑁
𝑘=1
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
9
Computation of layer’s pruning rate
• The eigenvalues of S are computed; sorted into descending order and
normalized to sum to one:
𝜆1, … , 𝜆 𝐻, 𝜆1 ≥ … ≥ 𝜆 𝐻 ≥ 0, 𝜆𝑖 = 1
𝐻
𝑖=1
• They give insight about the redundancy of the LSTM layer: if only a small
fraction is nonzero, we conclude that many redundant units exist in the layer
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
10
Computation of layer’s pruning rate
• We further define ζi and δi as:
𝜁1, … , 𝜁 𝐻, 𝜁𝑗 = 𝜆𝑖
𝑗
𝑖=1
, 𝛿1, … , 𝛿 𝐻, 𝛿𝑖 =
1, 𝑖𝑓𝜁𝑖 ≤ 𝛼
0, 𝑒𝑙𝑠𝑒
• α: tuning parameter for deriving the required pruning level
• Pruning rate θ of the LSTM layer is then computed using δ’s:
𝜃 = 1 −
𝛿𝑖
𝐻
𝑖=1
𝐻
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
11
Computation of layer’s pruning rate
• Toy example: 2 LSTM layers and 6 units at each layer
• We compute λi‘s, ζi‘s and δi‘s using α=0.95 (overall energy level to retain):
• 1st LSTM layer (left): energy is spread among many eigenvalues; exhibits small redundancy; a
low pruning rate is computed (θ[1] = 1 – 4/6 = 33%)
• 2nd LSTM layer (right): energy is accumulated only in a few eigenvalues; exhibits high
redundancy; a high pruning rate is computed (θ[2] = 1 – 1/6 = 83%)
• The total pruning rate is (33% + 83%)/2 = 58%; alternatively we can adjust α through grid search in
order to achieve a given target pruning rate
0.5, 0.3, 0.1, 0.05, 0.03, 0.02
0.5, 0.8, 0.9, 0.95, 0.98, 1
1, 1, 1, 1, 0, 0
0.93, 0.04, 0.02, 0.01, 0, 0
0.93, 0.97, 0.99, 1, 1, 1
1, 0, 0, 0, 0, 0
λi
ζi
δi
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
12
Computation of layer’s units importance estimation
• Stack all LSTM layer weight matrices to form an overall weight matrix W
𝑾 = 𝑾𝑖𝑥, 𝑾 𝑓𝑥, 𝑾 𝑢𝑥, 𝑾 𝑜𝑥, 𝑾𝑖ℎ, 𝑾 𝑓ℎ, 𝑾 𝑢ℎ, 𝑾 𝑜ℎ ∈ ℛ 𝐻×𝑄
• H: hidden state dimensionality (number of layer units); Q = 4(H + F);F: layer’s
input vector dimensionality
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
13
Computation of layer’s units importance estimation
• Each row of W is associated with a layer’s unit; rewrite it as:
𝑾 = 𝒘1, … , 𝒘 𝐻
𝑇, 𝒘 𝑘 ∈ ℛ 𝑄
• Derive GM-based dissimilarity value [9] for each LSTM layer’s unit
𝜂 𝑗 = 𝒘𝑗 − 𝒘 𝑘
𝐻
𝑘=1
• A small ηj denotes that unit j is highly correlated with other units in the layer (i.e.
redundant); discard units with the smallest ηj
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
14
• Penn Treebank (PTB) [11]: word-level prediction, 1086k tokens, 10k classes
(unique tokens), 930k training, 74k validation and 82k testing tokens
• YouTube-8M (YT8M) [12]: multilabel concept detection, 3862 classes (semantic
concepts), more than 6 million videos, 1024- and 128-dimensional visual and
audio feature vector sequences are provided for each video
• The proposed ISS-GM is compared with ISS-GL [6] and ISS-L0 [7]
[11] M. P. Marcus, M. Marcinkiewicz, B. Santorini, Building a large annotated corpus of English: The Penn Treebank, Comput. Linguist, Jun. 1993
[12] J. Lee et al., The 2nd YouTube-8M large-scale video understanding challenge, ECCV Workshops, Munich, Germany, Sep. 2018
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experimental Setup
15
• PTB: as in [13], 2 layer stacked LSTM, 1500 units each, output layer of size 10000,
dropout keep rate 0.5; sequence length 35; 55 epochs, minibatch averaged SGD,
batch size 20, initial learning rate 1, etc.
• YT8M: 1st BLSTM layer with 512 units per forward/backward layer, 2nd LSTM
layer with 1024 units, output layer of size 3862 units; sequence length 300
frames; 10 epochs, minibatch SGD, batch size 256, initial learning rate 0.0002,
etc.
• The performance is measured using the per-word perplexity (PPL) and global
average precision at 20 (GAP@20) for PTB and YT8M, respectively
[13] W. Zaremba, I. Sutskever, and O. Vinyals, Recurrent neural network regularization, CoRR, vol. abs/1804.03209, 2014
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
16
ISS # in
(1st, 2nd )
PPL
(valid., test)
baseline [13] (1500, 1500) (82.57, 78.57)
ISS-GL [6] (373, 315) (82.59, 78.65)
ISS-L0 [7] (296, 247) (81.62, 78.08)
ISS-GM (prop.) (236, 297) (81.49, 77.97)
• Evaluation results in PTB (top) and YT8M (bottom)
• Lower PPL values are better; Higher GAP@20
values are better; Training time (Ttr) is in hours
• ISS-GM outperforms all other methods
• Exhibits high degree of robustness against large
pruning rates (e.g. only 1.23% drop for θ = 70%)
• Approx. 2 times slower compared to ISS-GL due to
the eigenanalysis of the covariance matrix; training
is performed off-line, this limitation is considered
insignificant
GAP@20 Ttr
no pruning 84.33% 6.73
ISS-GL [6] (θ=30%) 83.20% 7.82
ISS-GM (prop.) (θ=30%) 84.12% 15.4
ISS-GL [6] (θ=70%) 82.20% 7.43
ISS-GM (prop.) (θ=70%) 83.10% 14.5
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Summary and next steps
17
• A new LSTM structured pruning approach presented: utilizes the sample covariance matrix of
layer’s responses and a GM-based criterion to automatically derive pruning rates and discard the
most redundant units
• The proposed approach evaluated successfully in two popular datasets (PTB, YT8M) for word-
level prediction in text and multilabel video classification tasks
• As a future work, planning to investigate the use of the proposed approach in pruning deeper
RNN architectures, e.g. Recurrent Highway Networks [14, 15]
[14] J. G. Zilly, R. K. Srivastava, J. Koutnı́k, J. Schmidhuber, Recurrent Highway Networks, Proc.ICML, 2017
[15] G. Pundak, T. Sainath, Highway-LSTM and Recurrent Highway Networks for Speech Recognition, Proc. Interspeech, 2017
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
18
Thank you for your attention!
Questions?
Nikolaos Gkalelis, gkalelis@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code will be publicly available by end of December 2020 at:
https://github.com/bmezaris/lstm_structured_pruning_geometric_median
This work was supported by the EUs Horizon 2020 research and innovation programme
under grant agreement H2020-780656 ReTV
1 of 18

Recommended

Subclass deep neural networks by
Subclass deep neural networksSubclass deep neural networks
Subclass deep neural networksVasileiosMezaris
1.1K views26 slides
Video summarization using clustering by
Video summarization using clusteringVideo summarization using clustering
Video summarization using clusteringSahil Biswas
2.9K views15 slides
Cloud, Fog, or Edge: Where and When to Compute? by
Cloud, Fog, or Edge: Where and When to Compute?Cloud, Fog, or Edge: Where and When to Compute?
Cloud, Fog, or Edge: Where and When to Compute?Förderverein Technische Fakultät
2.7K views43 slides
Chris Varekamp (Philips Group Innovation, Research): Depth estimation, Proces... by
Chris Varekamp (Philips Group Innovation, Research): Depth estimation, Proces...Chris Varekamp (Philips Group Innovation, Research): Depth estimation, Proces...
Chris Varekamp (Philips Group Innovation, Research): Depth estimation, Proces...AugmentedWorldExpo
128 views22 slides
PERCEPTUALLY LOSSLESS COMPRESSION WITH ERROR CONCEALMENT FOR PERISCOPE AND SO... by
PERCEPTUALLY LOSSLESS COMPRESSION WITH ERROR CONCEALMENT FOR PERISCOPE AND SO...PERCEPTUALLY LOSSLESS COMPRESSION WITH ERROR CONCEALMENT FOR PERISCOPE AND SO...
PERCEPTUALLY LOSSLESS COMPRESSION WITH ERROR CONCEALMENT FOR PERISCOPE AND SO...sipij
61 views14 slides
Perceptually Lossless Compression with Error Concealment for Periscope and So... by
Perceptually Lossless Compression with Error Concealment for Periscope and So...Perceptually Lossless Compression with Error Concealment for Periscope and So...
Perceptually Lossless Compression with Error Concealment for Periscope and So...sipij
46 views14 slides

More Related Content

What's hot

Media4Math's Fall 2012 Catalog by
Media4Math's Fall 2012 CatalogMedia4Math's Fall 2012 Catalog
Media4Math's Fall 2012 CatalogMedia4math
572 views8 slides
Icml2018 naver review by
Icml2018 naver reviewIcml2018 naver review
Icml2018 naver reviewNAVER Engineering
1.7K views110 slides
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2... by
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...IRJET Journal
30 views5 slides
Multi-View Video Coding Algorithms/Techniques: A Comprehensive Study by
Multi-View Video Coding Algorithms/Techniques: A Comprehensive StudyMulti-View Video Coding Algorithms/Techniques: A Comprehensive Study
Multi-View Video Coding Algorithms/Techniques: A Comprehensive StudyIJERA Editor
71 views6 slides
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio... by
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...Edge AI and Vision Alliance
1.7K views35 slides
Understanding user interactivity for immersive communications and its impact ... by
Understanding user interactivity for immersive communications and its impact ...Understanding user interactivity for immersive communications and its impact ...
Understanding user interactivity for immersive communications and its impact ...lauratoni4
363 views24 slides

What's hot(16)

Media4Math's Fall 2012 Catalog by Media4math
Media4Math's Fall 2012 CatalogMedia4Math's Fall 2012 Catalog
Media4Math's Fall 2012 Catalog
Media4math572 views
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2... by IRJET Journal
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
IRJET Journal30 views
Multi-View Video Coding Algorithms/Techniques: A Comprehensive Study by IJERA Editor
Multi-View Video Coding Algorithms/Techniques: A Comprehensive StudyMulti-View Video Coding Algorithms/Techniques: A Comprehensive Study
Multi-View Video Coding Algorithms/Techniques: A Comprehensive Study
IJERA Editor71 views
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio... by Edge AI and Vision Alliance
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
Understanding user interactivity for immersive communications and its impact ... by lauratoni4
Understanding user interactivity for immersive communications and its impact ...Understanding user interactivity for immersive communications and its impact ...
Understanding user interactivity for immersive communications and its impact ...
lauratoni4363 views
Scenario demonstrators by LinkedTV
Scenario demonstratorsScenario demonstrators
Scenario demonstrators
LinkedTV 1.6K views
Dividing and Aggregating Network for Multi-view Action Recognition [Poster in... by Dongang (Sean) Wang
Dividing and Aggregating Network for Multi-view Action Recognition [Poster in...Dividing and Aggregating Network for Multi-view Action Recognition [Poster in...
Dividing and Aggregating Network for Multi-view Action Recognition [Poster in...
Devil in the Details: Analysing the Performance of ConvNet Features by Ken Chatfield
Devil in the Details: Analysing the Performance of ConvNet FeaturesDevil in the Details: Analysing the Performance of ConvNet Features
Devil in the Details: Analysing the Performance of ConvNet Features
Ken Chatfield1.7K views
International Journal of Computational Engineering Research(IJCER) by ijceronline
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
ijceronline293 views
COST-EFFECTIVE LOW-DELAY DESIGN FOR MULTI-PARTY CLOUD VIDEO CONFERENCING by nexgentechnology
 COST-EFFECTIVE LOW-DELAY DESIGN FOR MULTI-PARTY CLOUD VIDEO CONFERENCING COST-EFFECTIVE LOW-DELAY DESIGN FOR MULTI-PARTY CLOUD VIDEO CONFERENCING
COST-EFFECTIVE LOW-DELAY DESIGN FOR MULTI-PARTY CLOUD VIDEO CONFERENCING
nexgentechnology46 views

Similar to LSTM Structured Pruning

Fractional step discriminant pruning by
Fractional step discriminant pruningFractional step discriminant pruning
Fractional step discriminant pruningVasileiosMezaris
90 views20 slides
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur... by
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Claudio Greco
165 views36 slides
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur... by
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Alessandro Suglia
516 views36 slides
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu... by
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...NECST Lab @ Politecnico di Milano
149 views26 slides
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic... by
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Ryo Takahashi
303 views11 slides
IPT.pdf by
IPT.pdfIPT.pdf
IPT.pdfManas Das
6 views15 slides

Similar to LSTM Structured Pruning(20)

Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur... by Claudio Greco
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Claudio Greco165 views
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur... by Alessandro Suglia
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Alessandro Suglia516 views
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic... by Ryo Takahashi
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Ryo Takahashi303 views
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf by Duy-Hieu Bui
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Duy-Hieu Bui0 views
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR... by IJCNCJournal
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...
IJCNCJournal47 views
On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur... by IJCNCJournal
On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...
On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...
IJCNCJournal36 views
Hardware Acceleration for Machine Learning by CastLabKAIST
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
CastLabKAIST1.5K views
A Review on Color Recognition using Deep Learning and Different Image Segment... by IRJET Journal
A Review on Color Recognition using Deep Learning and Different Image Segment...A Review on Color Recognition using Deep Learning and Different Image Segment...
A Review on Color Recognition using Deep Learning and Different Image Segment...
IRJET Journal4 views
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA... by rinzindorjej
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
rinzindorjej0 views
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA... by rinzindorjej
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
rinzindorjej34 views
Chap 8. Optimization for training deep models by Young-Geun Choi
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
Young-Geun Choi1.5K views
[Paper] Multiscale Vision Transformers(MVit) by Susang Kim
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
Susang Kim491 views
Image Steganography Using Wavelet Transform And Genetic Algorithm by AM Publications
Image Steganography Using Wavelet Transform And Genetic AlgorithmImage Steganography Using Wavelet Transform And Genetic Algorithm
Image Steganography Using Wavelet Transform And Genetic Algorithm
AM Publications357 views
Combinatorial optimization and deep reinforcement learning by 민재 정
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
민재 정3K views

More from VasileiosMezaris

TAME: Trainable Attention Mechanism for Explanations by
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsVasileiosMezaris
110 views22 slides
Gated-ViGAT by
Gated-ViGATGated-ViGAT
Gated-ViGATVasileiosMezaris
28 views18 slides
Explaining video summarization based on the focus of attention by
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionVasileiosMezaris
71 views35 slides
Combining textual and visual features for Ad-hoc Video Search by
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchVasileiosMezaris
9 views18 slides
Explaining the decisions of image/video classifiers by
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersVasileiosMezaris
63 views67 slides
Learning visual explanations for DCNN-based image classifiers using an attent... by
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...VasileiosMezaris
58 views16 slides

More from VasileiosMezaris(20)

TAME: Trainable Attention Mechanism for Explanations by VasileiosMezaris
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for Explanations
VasileiosMezaris110 views
Explaining video summarization based on the focus of attention by VasileiosMezaris
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
VasileiosMezaris71 views
Combining textual and visual features for Ad-hoc Video Search by VasileiosMezaris
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video Search
Explaining the decisions of image/video classifiers by VasileiosMezaris
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
VasileiosMezaris63 views
Learning visual explanations for DCNN-based image classifiers using an attent... by VasileiosMezaris
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...
VasileiosMezaris58 views
Are all combinations equal? Combining textual and visual features with multi... by VasileiosMezaris
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...
VasileiosMezaris12 views
Hard-Negatives Selection Strategy for Cross-Modal Retrieval by VasileiosMezaris
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
VasileiosMezaris80 views
Misinformation on the internet: Video and AI by VasileiosMezaris
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AI
VasileiosMezaris177 views
Icme2020 tutorial video_summarization_part1 by VasileiosMezaris
Icme2020 tutorial video_summarization_part1Icme2020 tutorial video_summarization_part1
Icme2020 tutorial video_summarization_part1
VasileiosMezaris1K views
Video, AI and News: video analysis and verification technologies for supporti... by VasileiosMezaris
Video, AI and News: video analysis and verification technologies for supporti...Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...
VasileiosMezaris349 views
Unsupervised Video Summarization via Attention-Driven Adversarial Learning by VasileiosMezaris
Unsupervised Video Summarization via Attention-Driven Adversarial LearningUnsupervised Video Summarization via Attention-Driven Adversarial Learning
Unsupervised Video Summarization via Attention-Driven Adversarial Learning
VasileiosMezaris1.3K views
Video & AI: capabilities and limitations of AI in detecting video manipulations by VasileiosMezaris
Video & AI: capabilities and limitations of AI in detecting video manipulationsVideo & AI: capabilities and limitations of AI in detecting video manipulations
Video & AI: capabilities and limitations of AI in detecting video manipulations
VasileiosMezaris283 views

Recently uploaded

CYTOSKELETON STRUCTURE.ppt by
CYTOSKELETON STRUCTURE.pptCYTOSKELETON STRUCTURE.ppt
CYTOSKELETON STRUCTURE.pptEstherShobhaR
20 views19 slides
Determination of color fastness to rubbing(wet and dry condition) by crockmeter. by
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.Determination of color fastness to rubbing(wet and dry condition) by crockmeter.
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.ShadmanSakib63
8 views6 slides
Cyanobacteria as a Biofertilizer (BY- Ayushi).pptx by
Cyanobacteria as a Biofertilizer (BY- Ayushi).pptxCyanobacteria as a Biofertilizer (BY- Ayushi).pptx
Cyanobacteria as a Biofertilizer (BY- Ayushi).pptxAyushiKardam
5 views13 slides
ALGAL PRODUCTS.pptx by
ALGAL PRODUCTS.pptxALGAL PRODUCTS.pptx
ALGAL PRODUCTS.pptxRASHMI M G
7 views17 slides
Factors affecting fluorescence and phosphorescence.pptx by
Factors affecting fluorescence and phosphorescence.pptxFactors affecting fluorescence and phosphorescence.pptx
Factors affecting fluorescence and phosphorescence.pptxSamarthGiri1
9 views11 slides

Recently uploaded(20)

Determination of color fastness to rubbing(wet and dry condition) by crockmeter. by ShadmanSakib63
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.Determination of color fastness to rubbing(wet and dry condition) by crockmeter.
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.
ShadmanSakib638 views
Cyanobacteria as a Biofertilizer (BY- Ayushi).pptx by AyushiKardam
Cyanobacteria as a Biofertilizer (BY- Ayushi).pptxCyanobacteria as a Biofertilizer (BY- Ayushi).pptx
Cyanobacteria as a Biofertilizer (BY- Ayushi).pptx
AyushiKardam5 views
Factors affecting fluorescence and phosphorescence.pptx by SamarthGiri1
Factors affecting fluorescence and phosphorescence.pptxFactors affecting fluorescence and phosphorescence.pptx
Factors affecting fluorescence and phosphorescence.pptx
SamarthGiri19 views
Best Hybrid Event Platform.pptx by Harriet Davis
Best Hybrid Event Platform.pptxBest Hybrid Event Platform.pptx
Best Hybrid Event Platform.pptx
Harriet Davis11 views
INTRODUCTION TO PLANT SYSTEMATICS.pptx by RASHMI M G
INTRODUCTION TO PLANT SYSTEMATICS.pptxINTRODUCTION TO PLANT SYSTEMATICS.pptx
INTRODUCTION TO PLANT SYSTEMATICS.pptx
RASHMI M G 5 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI7 views
DNA manipulation Enzymes 2.pdf by NetHelix
DNA manipulation Enzymes 2.pdfDNA manipulation Enzymes 2.pdf
DNA manipulation Enzymes 2.pdf
NetHelix6 views
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto... by Sérgio Sacani
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...
Sérgio Sacani787 views
Presentation on experimental laboratory animal- Hamster by Kanika13641
Presentation on experimental laboratory animal- HamsterPresentation on experimental laboratory animal- Hamster
Presentation on experimental laboratory animal- Hamster
Kanika136416 views
selection of preformed arch wires during the alignment stage of preadjusted o... by MaherFouda1
selection of preformed arch wires during the alignment stage of preadjusted o...selection of preformed arch wires during the alignment stage of preadjusted o...
selection of preformed arch wires during the alignment stage of preadjusted o...
MaherFouda18 views
RADIATION PHYSICS.pptx by drpriyanka8
RADIATION PHYSICS.pptxRADIATION PHYSICS.pptx
RADIATION PHYSICS.pptx
drpriyanka815 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI10 views
Worldviews and their (im)plausibility: Science and Holism by JohnWilkins48
Worldviews and their (im)plausibility: Science and HolismWorldviews and their (im)plausibility: Science and Holism
Worldviews and their (im)plausibility: Science and Holism
JohnWilkins4844 views

LSTM Structured Pruning

  • 1. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Title of presentation Subtitle Name of presenter Date Structured pruning of LSTMs via Eigenanalysis and Geometric Median for Mobile Multimedia and Deep Learning Applications N. Gkalelis, V. Mezaris CERTH-ITI, Thermi - Thessaloniki, Greece IEEE Int. Symposium on Multimedia, Naples, Italy (Virtual), Dec. 2020
  • 2. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Outline 2 • Problem statement • Related work • Layer’s pruning rate computation • LSTM unit importance estimation • Experiments • Conclusions
  • 3. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 3 • Deep learning (DL) is currently becoming a game changer in most industries due to breakthrough classification performance in many machine learning tasks Problem statement • Mobile Multimedia • Self-driving cars • Edge computing Image Credits: [2] Image Credits: [3] [1] V-Soft Consulting: https://blog.vsoftconsulting.com/; [2] V2Gov: https://www.facebook.com/V2Gov/ [3] J. Chen, X. Ran, Deep Learning With Edge Computing: A Review, Proc. of the IEEE, Aug. 2019 Image Credits: [1]
  • 4. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 4 • Recurrent neural networks (RNNs) have shown excellent performance in processing sequential data • The deployment of top-performing RNNs in resource-limited applications such as mobile multimedia devices is still difficult due to their high inference time and storage requirements  How to reduce the size of RNNs and at the same time retain generalization performance ? Problem statement
  • 5. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 5 Related work • Pruning is getting increasing attention because these methods achieve a high compression rate and maintain a stable model performance [4,5] • Two main pruning categories: a) unstructured: prune individual network weights, b) structured: prune well-defined network components, e.g., DCNN filters or LSMT units  Models derived using structured pruning can be deployed in conventional hardware (e.g. GPUs); no special purpose accelerators required [4] K. Ota, M.S. Dao, V. Mezaris, F.G.B. De Natale: Deep Learning for Mobile Multimedia: A Survey, ACM Trans. Multimedia Computing Communications & Applications (TOMM), vol. 13, no. 3s, June 2017 [5] Y. Cheng, D. Wang, P. Zhou and T. Zhang: Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges, IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126-136, Jan. 2018
  • 6. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 6 Related work • Structured pruning of DCNNs has been extensively studied in the literature; structured RNN pruning is a much less investigated topic: • In [6], Intrinsic Sparse Structures (ISS) of LSTMs are defined and a Group Lasso- based approach is used for sparsifying the network • In [7], LSTM parameters are constrained using an L0 norm penalty and ISSs close to zero are pruned  Both [6], [7], utilize sparsity-inducing regularizers to modify the loss function, which may lead to numerical instabilities and suboptimal solutions [8] [6] W. Wen et al., Learning intrinsic sparse structures within long short-term memory, ICLR, 2018 [7] L. Wen et al., Structured pruning of recurrent neural networks through neuron selection, Neural Networks, Mar. 2020. [8] H. Xu et al., Sparse algorithms are not stable: A no-free-lunch theorem, IEEE Trans. Pattern Anal. Mach. Intell., Jan. 2012.
  • 7. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 7 Overview of proposed method • Inspired from recent advances in DCNN filter pruning [9, 10] we extend [6]: • The covariance matrix formed by layer’s responses is used to compute the respective eigenvalues, quantify layer’s redundancy and pruning rate (as in [9] for DCNN layers) • A Geometric Median-based (GM-based) criterion is used to identify the most redundant LSTM units (as in [10] for DCNN filters)  The GM-based criterion has shown superior performance over sparsity-inducing ones in the DCNN domain [9] X. Suau, U. Zappella, and N. Apostoloff, Filter distillation for network compression, IEEE WACV, CO, USA, Mar. 2020 [10] Y. He et al., Filter pruning via Geometric median for deep convolutional neural networks acceleration, IEEE CVPR, CA, USA, Jun. 2019
  • 8. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 8 Computation of layer’s pruning rate • Suppose an annotated training set of N sequences and C classes • The training set at LSTM layer’s output can be represented as 𝐙 = 𝒛1, … , 𝒛N , 𝒛k ∈ ℛ 𝐻 • zk is the hidden state vector of the k-th sequence at last time step; has high representational power and often used for representing overall input sequence; H is the number of units in the layer • The sample covariance matrix S of the responses can be computed as 𝐒 = 𝐳 𝑘 − 𝒎 𝐳 𝑘 − 𝒎 𝑇 𝑁 𝑘=1 , 𝒎 = 1 𝑁 𝐳 𝑘 𝑁 𝑘=1
  • 9. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 9 Computation of layer’s pruning rate • The eigenvalues of S are computed; sorted into descending order and normalized to sum to one: 𝜆1, … , 𝜆 𝐻, 𝜆1 ≥ … ≥ 𝜆 𝐻 ≥ 0, 𝜆𝑖 = 1 𝐻 𝑖=1 • They give insight about the redundancy of the LSTM layer: if only a small fraction is nonzero, we conclude that many redundant units exist in the layer
  • 10. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 10 Computation of layer’s pruning rate • We further define ζi and δi as: 𝜁1, … , 𝜁 𝐻, 𝜁𝑗 = 𝜆𝑖 𝑗 𝑖=1 , 𝛿1, … , 𝛿 𝐻, 𝛿𝑖 = 1, 𝑖𝑓𝜁𝑖 ≤ 𝛼 0, 𝑒𝑙𝑠𝑒 • α: tuning parameter for deriving the required pruning level • Pruning rate θ of the LSTM layer is then computed using δ’s: 𝜃 = 1 − 𝛿𝑖 𝐻 𝑖=1 𝐻
  • 11. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 11 Computation of layer’s pruning rate • Toy example: 2 LSTM layers and 6 units at each layer • We compute λi‘s, ζi‘s and δi‘s using α=0.95 (overall energy level to retain): • 1st LSTM layer (left): energy is spread among many eigenvalues; exhibits small redundancy; a low pruning rate is computed (θ[1] = 1 – 4/6 = 33%) • 2nd LSTM layer (right): energy is accumulated only in a few eigenvalues; exhibits high redundancy; a high pruning rate is computed (θ[2] = 1 – 1/6 = 83%) • The total pruning rate is (33% + 83%)/2 = 58%; alternatively we can adjust α through grid search in order to achieve a given target pruning rate 0.5, 0.3, 0.1, 0.05, 0.03, 0.02 0.5, 0.8, 0.9, 0.95, 0.98, 1 1, 1, 1, 1, 0, 0 0.93, 0.04, 0.02, 0.01, 0, 0 0.93, 0.97, 0.99, 1, 1, 1 1, 0, 0, 0, 0, 0 λi ζi δi
  • 12. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 12 Computation of layer’s units importance estimation • Stack all LSTM layer weight matrices to form an overall weight matrix W 𝑾 = 𝑾𝑖𝑥, 𝑾 𝑓𝑥, 𝑾 𝑢𝑥, 𝑾 𝑜𝑥, 𝑾𝑖ℎ, 𝑾 𝑓ℎ, 𝑾 𝑢ℎ, 𝑾 𝑜ℎ ∈ ℛ 𝐻×𝑄 • H: hidden state dimensionality (number of layer units); Q = 4(H + F);F: layer’s input vector dimensionality
  • 13. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 13 Computation of layer’s units importance estimation • Each row of W is associated with a layer’s unit; rewrite it as: 𝑾 = 𝒘1, … , 𝒘 𝐻 𝑇, 𝒘 𝑘 ∈ ℛ 𝑄 • Derive GM-based dissimilarity value [9] for each LSTM layer’s unit 𝜂 𝑗 = 𝒘𝑗 − 𝒘 𝑘 𝐻 𝑘=1 • A small ηj denotes that unit j is highly correlated with other units in the layer (i.e. redundant); discard units with the smallest ηj
  • 14. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 14 • Penn Treebank (PTB) [11]: word-level prediction, 1086k tokens, 10k classes (unique tokens), 930k training, 74k validation and 82k testing tokens • YouTube-8M (YT8M) [12]: multilabel concept detection, 3862 classes (semantic concepts), more than 6 million videos, 1024- and 128-dimensional visual and audio feature vector sequences are provided for each video • The proposed ISS-GM is compared with ISS-GL [6] and ISS-L0 [7] [11] M. P. Marcus, M. Marcinkiewicz, B. Santorini, Building a large annotated corpus of English: The Penn Treebank, Comput. Linguist, Jun. 1993 [12] J. Lee et al., The 2nd YouTube-8M large-scale video understanding challenge, ECCV Workshops, Munich, Germany, Sep. 2018
  • 15. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experimental Setup 15 • PTB: as in [13], 2 layer stacked LSTM, 1500 units each, output layer of size 10000, dropout keep rate 0.5; sequence length 35; 55 epochs, minibatch averaged SGD, batch size 20, initial learning rate 1, etc. • YT8M: 1st BLSTM layer with 512 units per forward/backward layer, 2nd LSTM layer with 1024 units, output layer of size 3862 units; sequence length 300 frames; 10 epochs, minibatch SGD, batch size 256, initial learning rate 0.0002, etc. • The performance is measured using the per-word perplexity (PPL) and global average precision at 20 (GAP@20) for PTB and YT8M, respectively [13] W. Zaremba, I. Sutskever, and O. Vinyals, Recurrent neural network regularization, CoRR, vol. abs/1804.03209, 2014
  • 16. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 16 ISS # in (1st, 2nd ) PPL (valid., test) baseline [13] (1500, 1500) (82.57, 78.57) ISS-GL [6] (373, 315) (82.59, 78.65) ISS-L0 [7] (296, 247) (81.62, 78.08) ISS-GM (prop.) (236, 297) (81.49, 77.97) • Evaluation results in PTB (top) and YT8M (bottom) • Lower PPL values are better; Higher GAP@20 values are better; Training time (Ttr) is in hours • ISS-GM outperforms all other methods • Exhibits high degree of robustness against large pruning rates (e.g. only 1.23% drop for θ = 70%) • Approx. 2 times slower compared to ISS-GL due to the eigenanalysis of the covariance matrix; training is performed off-line, this limitation is considered insignificant GAP@20 Ttr no pruning 84.33% 6.73 ISS-GL [6] (θ=30%) 83.20% 7.82 ISS-GM (prop.) (θ=30%) 84.12% 15.4 ISS-GL [6] (θ=70%) 82.20% 7.43 ISS-GM (prop.) (θ=70%) 83.10% 14.5
  • 17. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Summary and next steps 17 • A new LSTM structured pruning approach presented: utilizes the sample covariance matrix of layer’s responses and a GM-based criterion to automatically derive pruning rates and discard the most redundant units • The proposed approach evaluated successfully in two popular datasets (PTB, YT8M) for word- level prediction in text and multilabel video classification tasks • As a future work, planning to investigate the use of the proposed approach in pruning deeper RNN architectures, e.g. Recurrent Highway Networks [14, 15] [14] J. G. Zilly, R. K. Srivastava, J. Koutnı́k, J. Schmidhuber, Recurrent Highway Networks, Proc.ICML, 2017 [15] G. Pundak, T. Sainath, Highway-LSTM and Recurrent Highway Networks for Speech Recognition, Proc. Interspeech, 2017
  • 18. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 18 Thank you for your attention! Questions? Nikolaos Gkalelis, gkalelis@iti.gr Vasileios Mezaris, bmezaris@iti.gr Code will be publicly available by end of December 2020 at: https://github.com/bmezaris/lstm_structured_pruning_geometric_median This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement H2020-780656 ReTV