SlideShare a Scribd company logo
1 of 18
Download to read offline
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Title of presentation
Subtitle
Name of presenter
Date
Structured pruning of LSTMs via Eigenanalysis and Geometric
Median for Mobile Multimedia and Deep Learning Applications
N. Gkalelis, V. Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
IEEE Int. Symposium on Multimedia,
Naples, Italy (Virtual), Dec. 2020
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Outline
2
• Problem statement
• Related work
• Layer’s pruning rate computation
• LSTM unit importance estimation
• Experiments
• Conclusions
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
3
• Deep learning (DL) is currently becoming a game changer in most industries due
to breakthrough classification performance in many machine learning tasks
Problem statement
• Mobile Multimedia • Self-driving cars • Edge computing
Image Credits: [2] Image Credits: [3]
[1] V-Soft Consulting: https://blog.vsoftconsulting.com/; [2] V2Gov: https://www.facebook.com/V2Gov/
[3] J. Chen, X. Ran, Deep Learning With Edge Computing: A Review, Proc. of the IEEE, Aug. 2019
Image Credits: [1]
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
4
• Recurrent neural networks (RNNs) have shown excellent performance in
processing sequential data
• The deployment of top-performing RNNs in resource-limited applications such
as mobile multimedia devices is still difficult due to their high inference time and
storage requirements
 How to reduce the size of RNNs and at the same time retain generalization
performance ?
Problem statement
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
5
Related work
• Pruning is getting increasing attention because these methods achieve a high
compression rate and maintain a stable model performance [4,5]
• Two main pruning categories: a) unstructured: prune individual network weights,
b) structured: prune well-defined network components, e.g., DCNN filters or
LSMT units
 Models derived using structured pruning can be deployed in conventional
hardware (e.g. GPUs); no special purpose accelerators required
[4] K. Ota, M.S. Dao, V. Mezaris, F.G.B. De Natale: Deep Learning for Mobile Multimedia: A Survey, ACM Trans. Multimedia Computing
Communications & Applications (TOMM), vol. 13, no. 3s, June 2017
[5] Y. Cheng, D. Wang, P. Zhou and T. Zhang: Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and
Challenges, IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126-136, Jan. 2018
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
6
Related work
• Structured pruning of DCNNs has been extensively studied in the literature;
structured RNN pruning is a much less investigated topic:
• In [6], Intrinsic Sparse Structures (ISS) of LSTMs are defined and a Group Lasso-
based approach is used for sparsifying the network
• In [7], LSTM parameters are constrained using an L0 norm penalty and ISSs close
to zero are pruned
 Both [6], [7], utilize sparsity-inducing regularizers to modify the loss function,
which may lead to numerical instabilities and suboptimal solutions [8]
[6] W. Wen et al., Learning intrinsic sparse structures within long short-term memory, ICLR, 2018
[7] L. Wen et al., Structured pruning of recurrent neural networks through neuron selection, Neural Networks, Mar. 2020.
[8] H. Xu et al., Sparse algorithms are not stable: A no-free-lunch theorem, IEEE Trans. Pattern Anal. Mach. Intell., Jan. 2012.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
7
Overview of proposed method
• Inspired from recent advances in DCNN filter pruning [9, 10] we extend [6]:
• The covariance matrix formed by layer’s responses is used to compute the
respective eigenvalues, quantify layer’s redundancy and pruning rate (as in [9]
for DCNN layers)
• A Geometric Median-based (GM-based) criterion is used to identify the most
redundant LSTM units (as in [10] for DCNN filters)
 The GM-based criterion has shown superior performance over sparsity-inducing
ones in the DCNN domain
[9] X. Suau, U. Zappella, and N. Apostoloff, Filter distillation for network compression, IEEE WACV, CO, USA, Mar. 2020
[10] Y. He et al., Filter pruning via Geometric median for deep convolutional neural networks acceleration, IEEE CVPR, CA, USA, Jun. 2019
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
8
Computation of layer’s pruning rate
• Suppose an annotated training set of N sequences and C classes
• The training set at LSTM layer’s output can be represented as
𝐙 = 𝒛1, … , 𝒛N , 𝒛k ∈ ℛ 𝐻
• zk is the hidden state vector of the k-th sequence at last time step; has high
representational power and often used for representing overall input sequence;
H is the number of units in the layer
• The sample covariance matrix S of the responses can be computed as
𝐒 = 𝐳 𝑘 − 𝒎 𝐳 𝑘 − 𝒎 𝑇
𝑁
𝑘=1
, 𝒎 =
1
𝑁
𝐳 𝑘
𝑁
𝑘=1
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
9
Computation of layer’s pruning rate
• The eigenvalues of S are computed; sorted into descending order and
normalized to sum to one:
𝜆1, … , 𝜆 𝐻, 𝜆1 ≥ … ≥ 𝜆 𝐻 ≥ 0, 𝜆𝑖 = 1
𝐻
𝑖=1
• They give insight about the redundancy of the LSTM layer: if only a small
fraction is nonzero, we conclude that many redundant units exist in the layer
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
10
Computation of layer’s pruning rate
• We further define ζi and δi as:
𝜁1, … , 𝜁 𝐻, 𝜁𝑗 = 𝜆𝑖
𝑗
𝑖=1
, 𝛿1, … , 𝛿 𝐻, 𝛿𝑖 =
1, 𝑖𝑓𝜁𝑖 ≤ 𝛼
0, 𝑒𝑙𝑠𝑒
• α: tuning parameter for deriving the required pruning level
• Pruning rate θ of the LSTM layer is then computed using δ’s:
𝜃 = 1 −
𝛿𝑖
𝐻
𝑖=1
𝐻
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
11
Computation of layer’s pruning rate
• Toy example: 2 LSTM layers and 6 units at each layer
• We compute λi‘s, ζi‘s and δi‘s using α=0.95 (overall energy level to retain):
• 1st LSTM layer (left): energy is spread among many eigenvalues; exhibits small redundancy; a
low pruning rate is computed (θ[1] = 1 – 4/6 = 33%)
• 2nd LSTM layer (right): energy is accumulated only in a few eigenvalues; exhibits high
redundancy; a high pruning rate is computed (θ[2] = 1 – 1/6 = 83%)
• The total pruning rate is (33% + 83%)/2 = 58%; alternatively we can adjust α through grid search in
order to achieve a given target pruning rate
0.5, 0.3, 0.1, 0.05, 0.03, 0.02
0.5, 0.8, 0.9, 0.95, 0.98, 1
1, 1, 1, 1, 0, 0
0.93, 0.04, 0.02, 0.01, 0, 0
0.93, 0.97, 0.99, 1, 1, 1
1, 0, 0, 0, 0, 0
λi
ζi
δi
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
12
Computation of layer’s units importance estimation
• Stack all LSTM layer weight matrices to form an overall weight matrix W
𝑾 = 𝑾𝑖𝑥, 𝑾 𝑓𝑥, 𝑾 𝑢𝑥, 𝑾 𝑜𝑥, 𝑾𝑖ℎ, 𝑾 𝑓ℎ, 𝑾 𝑢ℎ, 𝑾 𝑜ℎ ∈ ℛ 𝐻×𝑄
• H: hidden state dimensionality (number of layer units); Q = 4(H + F);F: layer’s
input vector dimensionality
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
13
Computation of layer’s units importance estimation
• Each row of W is associated with a layer’s unit; rewrite it as:
𝑾 = 𝒘1, … , 𝒘 𝐻
𝑇, 𝒘 𝑘 ∈ ℛ 𝑄
• Derive GM-based dissimilarity value [9] for each LSTM layer’s unit
𝜂 𝑗 = 𝒘𝑗 − 𝒘 𝑘
𝐻
𝑘=1
• A small ηj denotes that unit j is highly correlated with other units in the layer (i.e.
redundant); discard units with the smallest ηj
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
14
• Penn Treebank (PTB) [11]: word-level prediction, 1086k tokens, 10k classes
(unique tokens), 930k training, 74k validation and 82k testing tokens
• YouTube-8M (YT8M) [12]: multilabel concept detection, 3862 classes (semantic
concepts), more than 6 million videos, 1024- and 128-dimensional visual and
audio feature vector sequences are provided for each video
• The proposed ISS-GM is compared with ISS-GL [6] and ISS-L0 [7]
[11] M. P. Marcus, M. Marcinkiewicz, B. Santorini, Building a large annotated corpus of English: The Penn Treebank, Comput. Linguist, Jun. 1993
[12] J. Lee et al., The 2nd YouTube-8M large-scale video understanding challenge, ECCV Workshops, Munich, Germany, Sep. 2018
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experimental Setup
15
• PTB: as in [13], 2 layer stacked LSTM, 1500 units each, output layer of size 10000,
dropout keep rate 0.5; sequence length 35; 55 epochs, minibatch averaged SGD,
batch size 20, initial learning rate 1, etc.
• YT8M: 1st BLSTM layer with 512 units per forward/backward layer, 2nd LSTM
layer with 1024 units, output layer of size 3862 units; sequence length 300
frames; 10 epochs, minibatch SGD, batch size 256, initial learning rate 0.0002,
etc.
• The performance is measured using the per-word perplexity (PPL) and global
average precision at 20 (GAP@20) for PTB and YT8M, respectively
[13] W. Zaremba, I. Sutskever, and O. Vinyals, Recurrent neural network regularization, CoRR, vol. abs/1804.03209, 2014
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
16
ISS # in
(1st, 2nd )
PPL
(valid., test)
baseline [13] (1500, 1500) (82.57, 78.57)
ISS-GL [6] (373, 315) (82.59, 78.65)
ISS-L0 [7] (296, 247) (81.62, 78.08)
ISS-GM (prop.) (236, 297) (81.49, 77.97)
• Evaluation results in PTB (top) and YT8M (bottom)
• Lower PPL values are better; Higher GAP@20
values are better; Training time (Ttr) is in hours
• ISS-GM outperforms all other methods
• Exhibits high degree of robustness against large
pruning rates (e.g. only 1.23% drop for θ = 70%)
• Approx. 2 times slower compared to ISS-GL due to
the eigenanalysis of the covariance matrix; training
is performed off-line, this limitation is considered
insignificant
GAP@20 Ttr
no pruning 84.33% 6.73
ISS-GL [6] (θ=30%) 83.20% 7.82
ISS-GM (prop.) (θ=30%) 84.12% 15.4
ISS-GL [6] (θ=70%) 82.20% 7.43
ISS-GM (prop.) (θ=70%) 83.10% 14.5
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Summary and next steps
17
• A new LSTM structured pruning approach presented: utilizes the sample covariance matrix of
layer’s responses and a GM-based criterion to automatically derive pruning rates and discard the
most redundant units
• The proposed approach evaluated successfully in two popular datasets (PTB, YT8M) for word-
level prediction in text and multilabel video classification tasks
• As a future work, planning to investigate the use of the proposed approach in pruning deeper
RNN architectures, e.g. Recurrent Highway Networks [14, 15]
[14] J. G. Zilly, R. K. Srivastava, J. Koutnı́k, J. Schmidhuber, Recurrent Highway Networks, Proc.ICML, 2017
[15] G. Pundak, T. Sainath, Highway-LSTM and Recurrent Highway Networks for Speech Recognition, Proc. Interspeech, 2017
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
18
Thank you for your attention!
Questions?
Nikolaos Gkalelis, gkalelis@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code will be publicly available by end of December 2020 at:
https://github.com/bmezaris/lstm_structured_pruning_geometric_median
This work was supported by the EUs Horizon 2020 research and innovation programme
under grant agreement H2020-780656 ReTV

More Related Content

What's hot

Network recasting
Network recastingNetwork recasting
Network recasting
NAVER Engineering
 

What's hot (16)

Media4Math's Fall 2012 Catalog
Media4Math's Fall 2012 CatalogMedia4Math's Fall 2012 Catalog
Media4Math's Fall 2012 Catalog
 
Icml2018 naver review
Icml2018 naver reviewIcml2018 naver review
Icml2018 naver review
 
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
 
Multi-View Video Coding Algorithms/Techniques: A Comprehensive Study
Multi-View Video Coding Algorithms/Techniques: A Comprehensive StudyMulti-View Video Coding Algorithms/Techniques: A Comprehensive Study
Multi-View Video Coding Algorithms/Techniques: A Comprehensive Study
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
 
Understanding user interactivity for immersive communications and its impact ...
Understanding user interactivity for immersive communications and its impact ...Understanding user interactivity for immersive communications and its impact ...
Understanding user interactivity for immersive communications and its impact ...
 
Scenario demonstrators
Scenario demonstratorsScenario demonstrators
Scenario demonstrators
 
Network recasting
Network recastingNetwork recasting
Network recasting
 
Dividing and Aggregating Network for Multi-view Action Recognition [Poster in...
Dividing and Aggregating Network for Multi-view Action Recognition [Poster in...Dividing and Aggregating Network for Multi-view Action Recognition [Poster in...
Dividing and Aggregating Network for Multi-view Action Recognition [Poster in...
 
Quantum Computing: Timing is Everything
Quantum Computing: Timing is EverythingQuantum Computing: Timing is Everything
Quantum Computing: Timing is Everything
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
 
Devil in the Details: Analysing the Performance of ConvNet Features
Devil in the Details: Analysing the Performance of ConvNet FeaturesDevil in the Details: Analysing the Performance of ConvNet Features
Devil in the Details: Analysing the Performance of ConvNet Features
 
Paraphrasing complex network
Paraphrasing complex networkParaphrasing complex network
Paraphrasing complex network
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
 
COST-EFFECTIVE LOW-DELAY DESIGN FOR MULTI-PARTY CLOUD VIDEO CONFERENCING
 COST-EFFECTIVE LOW-DELAY DESIGN FOR MULTI-PARTY CLOUD VIDEO CONFERENCING COST-EFFECTIVE LOW-DELAY DESIGN FOR MULTI-PARTY CLOUD VIDEO CONFERENCING
COST-EFFECTIVE LOW-DELAY DESIGN FOR MULTI-PARTY CLOUD VIDEO CONFERENCING
 

Similar to LSTM Structured Pruning

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Ryo Takahashi
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
Pierre de Lacaze
 

Similar to LSTM Structured Pruning (20)

Fractional step discriminant pruning
Fractional step discriminant pruningFractional step discriminant pruning
Fractional step discriminant pruning
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
 
IPT.pdf
IPT.pdfIPT.pdf
IPT.pdf
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...
 
On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...
On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...
On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
A Review on Color Recognition using Deep Learning and Different Image Segment...
A Review on Color Recognition using Deep Learning and Different Image Segment...A Review on Color Recognition using Deep Learning and Different Image Segment...
A Review on Color Recognition using Deep Learning and Different Image Segment...
 
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
 
6119ijcsitce01
6119ijcsitce016119ijcsitce01
6119ijcsitce01
 
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
 
Predicting Stock Price Movements with Low Power Consumption LSTM
Predicting Stock Price Movements with Low Power Consumption LSTMPredicting Stock Price Movements with Low Power Consumption LSTM
Predicting Stock Price Movements with Low Power Consumption LSTM
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
Image Steganography Using Wavelet Transform And Genetic Algorithm
Image Steganography Using Wavelet Transform And Genetic AlgorithmImage Steganography Using Wavelet Transform And Genetic Algorithm
Image Steganography Using Wavelet Transform And Genetic Algorithm
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 

More from VasileiosMezaris

Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
VasileiosMezaris
 

More from VasileiosMezaris (20)

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and Localization
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages Task
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees Videos
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
 
TAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for Explanations
 
Gated-ViGAT
Gated-ViGATGated-ViGAT
Gated-ViGAT
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video Search
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...
 
CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video Summarization
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web application
 
PGL SUM Video Summarization
PGL SUM Video SummarizationPGL SUM Video Summarization
PGL SUM Video Summarization
 
Video Thumbnail Selector
Video Thumbnail SelectorVideo Thumbnail Selector
Video Thumbnail Selector
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AI
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarization
 

Recently uploaded

Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Cherry
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Cherry
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
Cherry
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cherry
 

Recently uploaded (20)

GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNA
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsKanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 

LSTM Structured Pruning

  • 1. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Title of presentation Subtitle Name of presenter Date Structured pruning of LSTMs via Eigenanalysis and Geometric Median for Mobile Multimedia and Deep Learning Applications N. Gkalelis, V. Mezaris CERTH-ITI, Thermi - Thessaloniki, Greece IEEE Int. Symposium on Multimedia, Naples, Italy (Virtual), Dec. 2020
  • 2. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Outline 2 • Problem statement • Related work • Layer’s pruning rate computation • LSTM unit importance estimation • Experiments • Conclusions
  • 3. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 3 • Deep learning (DL) is currently becoming a game changer in most industries due to breakthrough classification performance in many machine learning tasks Problem statement • Mobile Multimedia • Self-driving cars • Edge computing Image Credits: [2] Image Credits: [3] [1] V-Soft Consulting: https://blog.vsoftconsulting.com/; [2] V2Gov: https://www.facebook.com/V2Gov/ [3] J. Chen, X. Ran, Deep Learning With Edge Computing: A Review, Proc. of the IEEE, Aug. 2019 Image Credits: [1]
  • 4. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 4 • Recurrent neural networks (RNNs) have shown excellent performance in processing sequential data • The deployment of top-performing RNNs in resource-limited applications such as mobile multimedia devices is still difficult due to their high inference time and storage requirements  How to reduce the size of RNNs and at the same time retain generalization performance ? Problem statement
  • 5. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 5 Related work • Pruning is getting increasing attention because these methods achieve a high compression rate and maintain a stable model performance [4,5] • Two main pruning categories: a) unstructured: prune individual network weights, b) structured: prune well-defined network components, e.g., DCNN filters or LSMT units  Models derived using structured pruning can be deployed in conventional hardware (e.g. GPUs); no special purpose accelerators required [4] K. Ota, M.S. Dao, V. Mezaris, F.G.B. De Natale: Deep Learning for Mobile Multimedia: A Survey, ACM Trans. Multimedia Computing Communications & Applications (TOMM), vol. 13, no. 3s, June 2017 [5] Y. Cheng, D. Wang, P. Zhou and T. Zhang: Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges, IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126-136, Jan. 2018
  • 6. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 6 Related work • Structured pruning of DCNNs has been extensively studied in the literature; structured RNN pruning is a much less investigated topic: • In [6], Intrinsic Sparse Structures (ISS) of LSTMs are defined and a Group Lasso- based approach is used for sparsifying the network • In [7], LSTM parameters are constrained using an L0 norm penalty and ISSs close to zero are pruned  Both [6], [7], utilize sparsity-inducing regularizers to modify the loss function, which may lead to numerical instabilities and suboptimal solutions [8] [6] W. Wen et al., Learning intrinsic sparse structures within long short-term memory, ICLR, 2018 [7] L. Wen et al., Structured pruning of recurrent neural networks through neuron selection, Neural Networks, Mar. 2020. [8] H. Xu et al., Sparse algorithms are not stable: A no-free-lunch theorem, IEEE Trans. Pattern Anal. Mach. Intell., Jan. 2012.
  • 7. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 7 Overview of proposed method • Inspired from recent advances in DCNN filter pruning [9, 10] we extend [6]: • The covariance matrix formed by layer’s responses is used to compute the respective eigenvalues, quantify layer’s redundancy and pruning rate (as in [9] for DCNN layers) • A Geometric Median-based (GM-based) criterion is used to identify the most redundant LSTM units (as in [10] for DCNN filters)  The GM-based criterion has shown superior performance over sparsity-inducing ones in the DCNN domain [9] X. Suau, U. Zappella, and N. Apostoloff, Filter distillation for network compression, IEEE WACV, CO, USA, Mar. 2020 [10] Y. He et al., Filter pruning via Geometric median for deep convolutional neural networks acceleration, IEEE CVPR, CA, USA, Jun. 2019
  • 8. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 8 Computation of layer’s pruning rate • Suppose an annotated training set of N sequences and C classes • The training set at LSTM layer’s output can be represented as 𝐙 = 𝒛1, … , 𝒛N , 𝒛k ∈ ℛ 𝐻 • zk is the hidden state vector of the k-th sequence at last time step; has high representational power and often used for representing overall input sequence; H is the number of units in the layer • The sample covariance matrix S of the responses can be computed as 𝐒 = 𝐳 𝑘 − 𝒎 𝐳 𝑘 − 𝒎 𝑇 𝑁 𝑘=1 , 𝒎 = 1 𝑁 𝐳 𝑘 𝑁 𝑘=1
  • 9. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 9 Computation of layer’s pruning rate • The eigenvalues of S are computed; sorted into descending order and normalized to sum to one: 𝜆1, … , 𝜆 𝐻, 𝜆1 ≥ … ≥ 𝜆 𝐻 ≥ 0, 𝜆𝑖 = 1 𝐻 𝑖=1 • They give insight about the redundancy of the LSTM layer: if only a small fraction is nonzero, we conclude that many redundant units exist in the layer
  • 10. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 10 Computation of layer’s pruning rate • We further define ζi and δi as: 𝜁1, … , 𝜁 𝐻, 𝜁𝑗 = 𝜆𝑖 𝑗 𝑖=1 , 𝛿1, … , 𝛿 𝐻, 𝛿𝑖 = 1, 𝑖𝑓𝜁𝑖 ≤ 𝛼 0, 𝑒𝑙𝑠𝑒 • α: tuning parameter for deriving the required pruning level • Pruning rate θ of the LSTM layer is then computed using δ’s: 𝜃 = 1 − 𝛿𝑖 𝐻 𝑖=1 𝐻
  • 11. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 11 Computation of layer’s pruning rate • Toy example: 2 LSTM layers and 6 units at each layer • We compute λi‘s, ζi‘s and δi‘s using α=0.95 (overall energy level to retain): • 1st LSTM layer (left): energy is spread among many eigenvalues; exhibits small redundancy; a low pruning rate is computed (θ[1] = 1 – 4/6 = 33%) • 2nd LSTM layer (right): energy is accumulated only in a few eigenvalues; exhibits high redundancy; a high pruning rate is computed (θ[2] = 1 – 1/6 = 83%) • The total pruning rate is (33% + 83%)/2 = 58%; alternatively we can adjust α through grid search in order to achieve a given target pruning rate 0.5, 0.3, 0.1, 0.05, 0.03, 0.02 0.5, 0.8, 0.9, 0.95, 0.98, 1 1, 1, 1, 1, 0, 0 0.93, 0.04, 0.02, 0.01, 0, 0 0.93, 0.97, 0.99, 1, 1, 1 1, 0, 0, 0, 0, 0 λi ζi δi
  • 12. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 12 Computation of layer’s units importance estimation • Stack all LSTM layer weight matrices to form an overall weight matrix W 𝑾 = 𝑾𝑖𝑥, 𝑾 𝑓𝑥, 𝑾 𝑢𝑥, 𝑾 𝑜𝑥, 𝑾𝑖ℎ, 𝑾 𝑓ℎ, 𝑾 𝑢ℎ, 𝑾 𝑜ℎ ∈ ℛ 𝐻×𝑄 • H: hidden state dimensionality (number of layer units); Q = 4(H + F);F: layer’s input vector dimensionality
  • 13. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 13 Computation of layer’s units importance estimation • Each row of W is associated with a layer’s unit; rewrite it as: 𝑾 = 𝒘1, … , 𝒘 𝐻 𝑇, 𝒘 𝑘 ∈ ℛ 𝑄 • Derive GM-based dissimilarity value [9] for each LSTM layer’s unit 𝜂 𝑗 = 𝒘𝑗 − 𝒘 𝑘 𝐻 𝑘=1 • A small ηj denotes that unit j is highly correlated with other units in the layer (i.e. redundant); discard units with the smallest ηj
  • 14. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 14 • Penn Treebank (PTB) [11]: word-level prediction, 1086k tokens, 10k classes (unique tokens), 930k training, 74k validation and 82k testing tokens • YouTube-8M (YT8M) [12]: multilabel concept detection, 3862 classes (semantic concepts), more than 6 million videos, 1024- and 128-dimensional visual and audio feature vector sequences are provided for each video • The proposed ISS-GM is compared with ISS-GL [6] and ISS-L0 [7] [11] M. P. Marcus, M. Marcinkiewicz, B. Santorini, Building a large annotated corpus of English: The Penn Treebank, Comput. Linguist, Jun. 1993 [12] J. Lee et al., The 2nd YouTube-8M large-scale video understanding challenge, ECCV Workshops, Munich, Germany, Sep. 2018
  • 15. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experimental Setup 15 • PTB: as in [13], 2 layer stacked LSTM, 1500 units each, output layer of size 10000, dropout keep rate 0.5; sequence length 35; 55 epochs, minibatch averaged SGD, batch size 20, initial learning rate 1, etc. • YT8M: 1st BLSTM layer with 512 units per forward/backward layer, 2nd LSTM layer with 1024 units, output layer of size 3862 units; sequence length 300 frames; 10 epochs, minibatch SGD, batch size 256, initial learning rate 0.0002, etc. • The performance is measured using the per-word perplexity (PPL) and global average precision at 20 (GAP@20) for PTB and YT8M, respectively [13] W. Zaremba, I. Sutskever, and O. Vinyals, Recurrent neural network regularization, CoRR, vol. abs/1804.03209, 2014
  • 16. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 16 ISS # in (1st, 2nd ) PPL (valid., test) baseline [13] (1500, 1500) (82.57, 78.57) ISS-GL [6] (373, 315) (82.59, 78.65) ISS-L0 [7] (296, 247) (81.62, 78.08) ISS-GM (prop.) (236, 297) (81.49, 77.97) • Evaluation results in PTB (top) and YT8M (bottom) • Lower PPL values are better; Higher GAP@20 values are better; Training time (Ttr) is in hours • ISS-GM outperforms all other methods • Exhibits high degree of robustness against large pruning rates (e.g. only 1.23% drop for θ = 70%) • Approx. 2 times slower compared to ISS-GL due to the eigenanalysis of the covariance matrix; training is performed off-line, this limitation is considered insignificant GAP@20 Ttr no pruning 84.33% 6.73 ISS-GL [6] (θ=30%) 83.20% 7.82 ISS-GM (prop.) (θ=30%) 84.12% 15.4 ISS-GL [6] (θ=70%) 82.20% 7.43 ISS-GM (prop.) (θ=70%) 83.10% 14.5
  • 17. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Summary and next steps 17 • A new LSTM structured pruning approach presented: utilizes the sample covariance matrix of layer’s responses and a GM-based criterion to automatically derive pruning rates and discard the most redundant units • The proposed approach evaluated successfully in two popular datasets (PTB, YT8M) for word- level prediction in text and multilabel video classification tasks • As a future work, planning to investigate the use of the proposed approach in pruning deeper RNN architectures, e.g. Recurrent Highway Networks [14, 15] [14] J. G. Zilly, R. K. Srivastava, J. Koutnı́k, J. Schmidhuber, Recurrent Highway Networks, Proc.ICML, 2017 [15] G. Pundak, T. Sainath, Highway-LSTM and Recurrent Highway Networks for Speech Recognition, Proc. Interspeech, 2017
  • 18. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 18 Thank you for your attention! Questions? Nikolaos Gkalelis, gkalelis@iti.gr Vasileios Mezaris, bmezaris@iti.gr Code will be publicly available by end of December 2020 at: https://github.com/bmezaris/lstm_structured_pruning_geometric_median This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement H2020-780656 ReTV