LSTM Structured Pruning

retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Title of presentation
Subtitle
Name of presenter
Date
Structured pruning of LSTMs via Eigenanalysis and Geometric
Median for Mobile Multimedia and Deep Learning Applications
N. Gkalelis, V. Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
IEEE Int. Symposium on Multimedia,
Naples, Italy (Virtual), Dec. 2020

Outline
2
• Problem statement
• Related work
• Layer’s pruning rate computation
• LSTM unit importance estimation
• Experiments
• Conclusions

3
• Deep learning (DL) is currently becoming a game changer in most industries due
to breakthrough classification performance in many machine learning tasks
Problem statement
• Mobile Multimedia • Self-driving cars • Edge computing
Image Credits: [2] Image Credits: [3]
[1] V-Soft Consulting: https://blog.vsoftconsulting.com/; [2] V2Gov: https://www.facebook.com/V2Gov/
[3] J. Chen, X. Ran, Deep Learning With Edge Computing: A Review, Proc. of the IEEE, Aug. 2019
Image Credits: [1]

4
• Recurrent neural networks (RNNs) have shown excellent performance in
processing sequential data
• The deployment of top-performing RNNs in resource-limited applications such
as mobile multimedia devices is still difficult due to their high inference time and
storage requirements
 How to reduce the size of RNNs and at the same time retain generalization
performance ?
Problem statement

5
Related work
• Pruning is getting increasing attention because these methods achieve a high
compression rate and maintain a stable model performance [4,5]
• Two main pruning categories: a) unstructured: prune individual network weights,
b) structured: prune well-defined network components, e.g., DCNN filters or
LSMT units
 Models derived using structured pruning can be deployed in conventional
hardware (e.g. GPUs); no special purpose accelerators required
[4] K. Ota, M.S. Dao, V. Mezaris, F.G.B. De Natale: Deep Learning for Mobile Multimedia: A Survey, ACM Trans. Multimedia Computing
Communications & Applications (TOMM), vol. 13, no. 3s, June 2017
[5] Y. Cheng, D. Wang, P. Zhou and T. Zhang: Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and
Challenges, IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126-136, Jan. 2018

6
Related work
• Structured pruning of DCNNs has been extensively studied in the literature;
structured RNN pruning is a much less investigated topic:
• In [6], Intrinsic Sparse Structures (ISS) of LSTMs are defined and a Group Lasso-
based approach is used for sparsifying the network
• In [7], LSTM parameters are constrained using an L0 norm penalty and ISSs close
to zero are pruned
 Both [6], [7], utilize sparsity-inducing regularizers to modify the loss function,
which may lead to numerical instabilities and suboptimal solutions [8]
[6] W. Wen et al., Learning intrinsic sparse structures within long short-term memory, ICLR, 2018
[7] L. Wen et al., Structured pruning of recurrent neural networks through neuron selection, Neural Networks, Mar. 2020.
[8] H. Xu et al., Sparse algorithms are not stable: A no-free-lunch theorem, IEEE Trans. Pattern Anal. Mach. Intell., Jan. 2012.

7
Overview of proposed method
• Inspired from recent advances in DCNN filter pruning [9, 10] we extend [6]:
• The covariance matrix formed by layer’s responses is used to compute the
respective eigenvalues, quantify layer’s redundancy and pruning rate (as in [9]
for DCNN layers)
• A Geometric Median-based (GM-based) criterion is used to identify the most
redundant LSTM units (as in [10] for DCNN filters)
 The GM-based criterion has shown superior performance over sparsity-inducing
ones in the DCNN domain
[9] X. Suau, U. Zappella, and N. Apostoloff, Filter distillation for network compression, IEEE WACV, CO, USA, Mar. 2020
[10] Y. He et al., Filter pruning via Geometric median for deep convolutional neural networks acceleration, IEEE CVPR, CA, USA, Jun. 2019

8
Computation of layer’s pruning rate
• Suppose an annotated training set of N sequences and C classes
• The training set at LSTM layer’s output can be represented as
𝐙 = 𝒛1, … , 𝒛N , 𝒛k ∈ ℛ 𝐻
• zk is the hidden state vector of the k-th sequence at last time step; has high
representational power and often used for representing overall input sequence;
H is the number of units in the layer
• The sample covariance matrix S of the responses can be computed as
𝐒 = 𝐳 𝑘 − 𝒎 𝐳 𝑘 − 𝒎 𝑇
𝑁
𝑘=1
, 𝒎 =
1
𝑁
𝐳 𝑘
𝑁
𝑘=1

9
• The eigenvalues of S are computed; sorted into descending order and
normalized to sum to one:
𝜆1, … , 𝜆 𝐻, 𝜆1 ≥ … ≥ 𝜆 𝐻 ≥ 0, 𝜆𝑖 = 1
𝐻
𝑖=1
• They give insight about the redundancy of the LSTM layer: if only a small
fraction is nonzero, we conclude that many redundant units exist in the layer

10
• We further define ζi and δi as:
𝜁1, … , 𝜁 𝐻, 𝜁𝑗 = 𝜆𝑖
𝑗
𝑖=1
, 𝛿1, … , 𝛿 𝐻, 𝛿𝑖 =
1, 𝑖𝑓𝜁𝑖 ≤ 𝛼
0, 𝑒𝑙𝑠𝑒
• α: tuning parameter for deriving the required pruning level
• Pruning rate θ of the LSTM layer is then computed using δ’s:
𝜃 = 1 −
𝛿𝑖
𝐻
𝑖=1
𝐻

11
• Toy example: 2 LSTM layers and 6 units at each layer
• We compute λi‘s, ζi‘s and δi‘s using α=0.95 (overall energy level to retain):
• 1st LSTM layer (left): energy is spread among many eigenvalues; exhibits small redundancy; a
low pruning rate is computed (θ[1] = 1 – 4/6 = 33%)
• 2nd LSTM layer (right): energy is accumulated only in a few eigenvalues; exhibits high
redundancy; a high pruning rate is computed (θ[2] = 1 – 1/6 = 83%)
• The total pruning rate is (33% + 83%)/2 = 58%; alternatively we can adjust α through grid search in
order to achieve a given target pruning rate
0.5, 0.3, 0.1, 0.05, 0.03, 0.02
0.5, 0.8, 0.9, 0.95, 0.98, 1
1, 1, 1, 1, 0, 0
0.93, 0.04, 0.02, 0.01, 0, 0
0.93, 0.97, 0.99, 1, 1, 1
1, 0, 0, 0, 0, 0
λi
ζi
δi

12
Computation of layer’s units importance estimation
• Stack all LSTM layer weight matrices to form an overall weight matrix W
𝑾 = 𝑾𝑖𝑥, 𝑾 𝑓𝑥, 𝑾 𝑢𝑥, 𝑾 𝑜𝑥, 𝑾𝑖ℎ, 𝑾 𝑓ℎ, 𝑾 𝑢ℎ, 𝑾 𝑜ℎ ∈ ℛ 𝐻×𝑄
• H: hidden state dimensionality (number of layer units); Q = 4(H + F);F: layer’s
input vector dimensionality

13
Computation of layer’s units importance estimation
• Each row of W is associated with a layer’s unit; rewrite it as:
𝑾 = 𝒘1, … , 𝒘 𝐻
𝑇, 𝒘 𝑘 ∈ ℛ 𝑄
• Derive GM-based dissimilarity value [9] for each LSTM layer’s unit
𝜂 𝑗 = 𝒘𝑗 − 𝒘 𝑘
𝐻
𝑘=1
• A small ηj denotes that unit j is highly correlated with other units in the layer (i.e.
redundant); discard units with the smallest ηj

Experiments
14
• Penn Treebank (PTB) [11]: word-level prediction, 1086k tokens, 10k classes
(unique tokens), 930k training, 74k validation and 82k testing tokens
• YouTube-8M (YT8M) [12]: multilabel concept detection, 3862 classes (semantic
concepts), more than 6 million videos, 1024- and 128-dimensional visual and
audio feature vector sequences are provided for each video
• The proposed ISS-GM is compared with ISS-GL [6] and ISS-L0 [7]
[11] M. P. Marcus, M. Marcinkiewicz, B. Santorini, Building a large annotated corpus of English: The Penn Treebank, Comput. Linguist, Jun. 1993
[12] J. Lee et al., The 2nd YouTube-8M large-scale video understanding challenge, ECCV Workshops, Munich, Germany, Sep. 2018

Experimental Setup
15
• PTB: as in [13], 2 layer stacked LSTM, 1500 units each, output layer of size 10000,
dropout keep rate 0.5; sequence length 35; 55 epochs, minibatch averaged SGD,
batch size 20, initial learning rate 1, etc.
• YT8M: 1st BLSTM layer with 512 units per forward/backward layer, 2nd LSTM
layer with 1024 units, output layer of size 3862 units; sequence length 300
frames; 10 epochs, minibatch SGD, batch size 256, initial learning rate 0.0002,
etc.
• The performance is measured using the per-word perplexity (PPL) and global
average precision at 20 (GAP@20) for PTB and YT8M, respectively
[13] W. Zaremba, I. Sutskever, and O. Vinyals, Recurrent neural network regularization, CoRR, vol. abs/1804.03209, 2014

Experiments
16
ISS # in
(1st, 2nd )
PPL
(valid., test)
baseline [13] (1500, 1500) (82.57, 78.57)
ISS-GL [6] (373, 315) (82.59, 78.65)
ISS-L0 [7] (296, 247) (81.62, 78.08)
ISS-GM (prop.) (236, 297) (81.49, 77.97)
• Evaluation results in PTB (top) and YT8M (bottom)
• Lower PPL values are better; Higher GAP@20
values are better; Training time (Ttr) is in hours
• ISS-GM outperforms all other methods
• Exhibits high degree of robustness against large
pruning rates (e.g. only 1.23% drop for θ = 70%)
• Approx. 2 times slower compared to ISS-GL due to
the eigenanalysis of the covariance matrix; training
is performed off-line, this limitation is considered
insignificant
GAP@20 Ttr
no pruning 84.33% 6.73
ISS-GL [6] (θ=30%) 83.20% 7.82
ISS-GM (prop.) (θ=30%) 84.12% 15.4
ISS-GL [6] (θ=70%) 82.20% 7.43
ISS-GM (prop.) (θ=70%) 83.10% 14.5

Summary and next steps
17
• A new LSTM structured pruning approach presented: utilizes the sample covariance matrix of
layer’s responses and a GM-based criterion to automatically derive pruning rates and discard the
most redundant units
• The proposed approach evaluated successfully in two popular datasets (PTB, YT8M) for word-
level prediction in text and multilabel video classification tasks
• As a future work, planning to investigate the use of the proposed approach in pruning deeper
RNN architectures, e.g. Recurrent Highway Networks [14, 15]
[14] J. G. Zilly, R. K. Srivastava, J. Koutnı́k, J. Schmidhuber, Recurrent Highway Networks, Proc.ICML, 2017
[15] G. Pundak, T. Sainath, Highway-LSTM and Recurrent Highway Networks for Speech Recognition, Proc. Interspeech, 2017

18
Thank you for your attention!
Questions?
Nikolaos Gkalelis, gkalelis@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code will be publicly available by end of December 2020 at:
https://github.com/bmezaris/lstm_structured_pruning_geometric_median
This work was supported by the EUs Horizon 2020 research and innovation programme
under grant agreement H2020-780656 ReTV

LSTM Structured Pruning

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to LSTM Structured Pruning

Similar to LSTM Structured Pruning (20)

More from VasileiosMezaris

More from VasileiosMezaris (20)

Recently uploaded

Recently uploaded (20)

LSTM Structured Pruning