Network Implosion: Effective Model Compression for ResNets via Static Layer Pruning and Retraining

Copyright©2019 NTT Corp. All Rights Reserved.
Network Implosion:
Eﬀec4ve Model Compression for ResNets via
Sta4c Layer Pruning and Retraining
Yasutoshi Ida, Yasuhiro Fujiwara
NTT So3ware Innova7on Center, Japan

2Copyright©2019 NTT Corp. All Rights Reserved.
Background:
Convolu(onal Neural Networks are used in many applica(ons
• Image classiﬁca(on, object detec(on, segmenta(on…
• The inference is performed by forward propaga4on.
input output
・・・・
・・・・
layers
forward propaga(on

Background:
Convolu(onal Neural Networks have many layers
• Convolu(onal Neural Networks (CNNs) achieve high accuracy
in many applica(ons by stacking many layers.
• ResNet is a standard CNN-based model.
Revolution of Depth
3.57
6.7 7.3
11.7
16.4
25.8
28.2
ILSVRC'15
ResNet
ILSVRC'14
GoogleNet
ILSVRC'14
VGG
ILSVRC'13 ILSVRC'12
AlexNet
ILSVRC'11 ILSVRC'10
ImageNet Classification top-5 error (%)
shallow8 layers
19 layers22 layers
152 layers
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
8 layers
shallow
deep
very deep
ultra deep
hCps://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf

Background:
Many layers incur long processing (me of inference
• Many layers incur high computa(on costs such as (me and
memory consump(on for the inference (forward propaga(on).
•  Reducing the inference costs is important for the service deployment.
input output
・・・・
・・・・
・・・・
・・・・
many layers
forward propaga(on
need long (me for forward propaga(on

Challenge:
Erasing mul(ple layers without degrading accuracy
• Erasing layers to speed up forward propaga4on/reduce the
model size.
• The most of previous methods sacriﬁce:
•  accuracy [Huang et al., ECCV 2018][Yu et al., CVPR 2018]
•  memory consump(on [Veit et al., ECCV 2018][Wu et al., CVPR 2018]
• Can we erase layers without sacriﬁcing accuracy and memory
consump4on?

Preliminary:
Convolu(onal layer and Residual Unit
• Convolu4onal layer: ﬁlters slides on images/ac(va(on maps
• Residual Unit: convolu(on layers with iden(ty map Fei-Fei Li & Justin Johnson & Serena Yeung April 18, 2017Lecture 5 -Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 201732
32
32
3
Convolution Layer
32x32x3 image
5x5x3 filter
convolve (slide) over all
spatial locations
activation map
1
28
28
ase of Highway networks that can also stack many layers by introducing gating
per, we focuse on ResNet because our idea is suitable for ResNet as we will
roduce Residual Networks and some evidences that explain why we can erase
orks
ResNets) are CNN-based models that have blocks called Residual Units. ResNets
Units and consist deep architectures. The l-th Residual Unit described in is
xl+1 = xl + F(xl), (1)
to the l-th Residual Unit. F(·) is a module that consists of convolutions, batch
Rectiﬁed Linear Units (ReLUs). Therefore, Residual Unit consists of identity
ingdi↵erentactivation
ResNet-164
5.93
6.50
6.14
5.91
5.46
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
ht
U
ht
U
ly
on
(e)fullpre-activation
iden(ty map
nonlinear map:
Convolu(onal layers,
Batch normaliza(ons, and ReLUs.
hCp://cs231n.github.io/convolu(onal-networks/

Preliminary:
Residual Network (ResNet)
• ResNet stacks Residual Units to build deep structure.
• ResNet is used as a standard model for computer vision task.
Fig.4(a)6.615.93
Fig.4(b)8.176.50
Fig.4(c)7.846.14
Fig.4(d)6.715.91
Fig.4(e)6.375.46
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
BN
ReLU
weight
BN
ReLU
weight
dition
xl
xl+1
weight
BN
ReLU
weight
BN
ReLU
addition
xl
xl+1
ReLUbefore
addition
(d)ReLU-only
pre-activation
ninTable2.Alltheseunitsconsistofthesame
↵erent.
inFig.2,theshortcutconnectionsarethe
ntopropagate.Multiplicativemanipulations
anddropout)ontheshortcutscanhamper
tooptimizationproblems.
gand1⇥1convolutionalshortcutsintroduce
strongerrepresentationalabilitiesthaniden-
t-onlygatingand1⇥1convolutioncoverthe
s(i.e.,theycouldbeoptimizedasidentity
errorishigherthanthatofidentityshort-
onofthesemodelsiscausedbyoptimization
bilities.
ationFunctions
upporttheanalysisinEqn.(5)andEqn.(8),
mptionthattheafter-additionactivationf
theCIFAR-10testsetusingdi↵erentactivation
Fig.ResNet-110ResNet-164
Fig.4(a)6.615.93
Fig.4(b)8.176.50
Fig.4(c)7.846.14
Fig.4(d)6.715.91
Fig.4(e)6.375.46
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
BN
ReLU
weight
BN
ReLU
weight
dition
xl
l+1
weight
BN
ReLU
weight
BN
ReLU
addition
xl
xl+1
ReLUbefore
addition
(d)ReLU-only
pre-activation
ninTable2.Alltheseunitsconsistofthesame
erent.
・
・
・
input output
stacking Residual Units
forward propaga(on

Problem Descrip4on:
Layer-level pruning for ResNet
• Our problem is layer-level pruning.
•  We consider standard image classifica(on task, but our method can be
used in other tasks such as detec(on and segmenta(on.
NIN 7.6 100 0 1.1
GoogLeNet 6.9 85.1 14.9 1.6
ResNet-18 5.6 100 0 1.8
ResNet-50 12.2 100 0 3.8
ResNet-101 21.2 100 0 7.6
Figure 1. Different pruning methods for a convolutional layer
which has 3 convolutional filters of size 3 ⇥ 3 ⇥ 3.
implemented by
a result, the Bas
can be utilized to
the group-wise b
weight matrix in
sparsity regulariz
with group-spars
ing can utilize th
almost linear at
they achieved a 3
in AlexNet. Con
group Lasso to pr
explored differen
filters, channels,
be regarded as a
NIN 7.6 100 0
GoogLeNet 6.9 85.1 14.9
ResNet-18 5.6 100 0
ResNet-50 12.2 100 0
ResNet-101 21.2 100 0
implem
a resu
can be
the gro
weight
sparsit
with g
ing ca
almost
they ac
in Alex
group
explor
filters,
be reg
ResNet-50 12.2 100 0
ResNet-101 21.2 100 0
implem
a resu
can be
the gro
weight
sparsit
with g
ing ca
almost
they ac
in Alex
group
explor
filters,
be reg
metho
achiev
GPU r
implem
a resu
can be
the gro
weight
sparsit
with g
ing can
almost
they ac
in Alex
group
explor
filters,
be reg
method
achiev
GPU r
3.4. F
Layer-level Pruning
Cheng et al., Recent Advances in Efficient Computa(on of Deep Convolu(onal Neural Networks, 2018.

Problem Descrip4on:
Layer-level pruning for ResNet
• The points of the problem are follows:

1) How to select layers, which will be erased.

2) How to keep the accuracy.

Proposed solu4on:
1) How to select layers
• Introducing priority into Residual Unit
• We can select unimportant Residual Units according to the
values of |w_{l}|.
•  Small |w_{l}| will scale down the signal of nonlinear map
•  We can erase the Residual Unit by erasing the nonlinear map
Fig.4(d)6.715.91
Fig.4(e)6.375.46
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
BN
ReLU
weight
BN
ReLU
weight
addition
xl
xl+1
weight
BN
ReLU
weight
BN
ReLU
addition
xl
xl+1
c)ReLUbefore
addition
(d)ReLU-only
pre-activation
ioninTable2.Alltheseunitsconsistofthesame
i↵erent.
iontopropagate.Multiplicativemanipulations
s,anddropout)ontheshortcutscanhamper
ngand1⇥1convolutionalshortcutsintroduce
estrongerrepresentationalabilitiesthaniden-
ut-onlygatingand1⇥1convolutioncoverthe
ts(i.e.,theycouldbeoptimizedasidentity
gerrorishigherthanthatofidentityshort-
tionofthesemodelsiscausedbyoptimization
abilities.
vationFunctions
supporttheanalysisinEqn.(5)andEqn.(8),
sumptionthattheafter-additionactivationf
4.1 Importance of Residual Unit4
We introduce weights to ResNet in order to decide importance of layers. In part5
the importance of F(·) in Equation (1). If F(·) is not important for the accuracy, w6
that includes some layers. Notice that Equation (1) changes to xl+1 = xl by eras7
following weighted Residual Unit to decide the importance of F(·):8
xl+1 = xl + wlF(xl),
where wl is a scalar that can be learned by back propagation. If wl is small in absolu9
down the output of F(·). In other words, F(·) gives few effect for the result if wl is0
value. Therefore, we can erase F(·) that has small absolute value of wl.1
4.2 Training and Erasing Layers2
We repeat training and erasing layers in order to keep the accuracy. We erase layer3
importance of wl after the training. Then, we again train the ResNet. We repeat th4
the accuracy drops or we erase the signiﬁcant number of layers. This procedure se5
to . However, we need to be careful about following points that are different from6
4.2.1 Selecting Unimportant Layers7
re CNN-based models that have blocks called Residual Units. ResNets
d consist deep architectures. The l-th Residual Unit described in is
xl+1 = xl + F(xl), (1)
h Residual Unit. F(·) is a module that consists of convolutions, batch
Linear Units (ReLUs). Therefore, Residual Unit consists of identity
mapping F(·). ResNets can stack more than thousand layers without
acking the Residual Units while other traditional architectures such as
esidual Networks
accuracies for several tasks in computer vision such as image clas-
d semantic segmentaion, their memory consumption and processing
ntheCIFAR-10testsetusingdi↵erentactivation
Fig.4(a)6.615.93
Fig.4(b)8.176.50
Fig.4(c)7.846.14
Fig.4(d)6.715.91
Fig.4(e)6.375.46
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
BN
ReLU
weight
BN
ReLU
weight
addition
xl
xl+1
weight
BN
ReLU
weight
BN
ReLU
addition
xl
xl+1
(c)ReLUbefore
addition
(d)ReLU-only
pre-activation
di↵erent.
sinFig.2,theshortcutconnectionsarethe
tiontopropagate.Multiplicativemanipulations
Residual Unit Residual Unit with priority
The priority is a scalar,
and can be trained.

Proposed solu4on:
1) How to select layers
• Introducing priority into Residual Unit
• We can select unimportant Residual Units according to the
values of |w_{l}|.
•  Small |w_{l}| will scale down the signal of nonlinear map
•  We can erase the Residual Unit by erasing the nonlinear map
Fig.4(d)6.715.91
Fig.4(e)6.375.46
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
BN
ReLU
weight
BN
ReLU
weight
addition
xl
xl+1
weight
BN
ReLU
weight
BN
ReLU
addition
xl
xl+1
c)ReLUbefore
addition
(d)ReLU-only
pre-activation
i↵erent.
iontopropagate.Multiplicativemanipulations
ngand1⇥1convolutionalshortcutsintroduce
estrongerrepresentationalabilitiesthaniden-
ut-onlygatingand1⇥1convolutioncoverthe
ts(i.e.,theycouldbeoptimizedasidentity
gerrorishigherthanthatofidentityshort-
tionofthesemodelsiscausedbyoptimization
abilities.
vationFunctions
supporttheanalysisinEqn.(5)andEqn.(8),
sumptionthattheafter-additionactivationf
4.1 Importance of Residual Unit4
We introduce weights to ResNet in order to decide importance of layers. In part5
the importance of F(·) in Equation (1). If F(·) is not important for the accuracy, w6
that includes some layers. Notice that Equation (1) changes to xl+1 = xl by eras7
following weighted Residual Unit to decide the importance of F(·):8
xl+1 = xl + wlF(xl),
where wl is a scalar that can be learned by back propagation. If wl is small in absolu9
down the output of F(·). In other words, F(·) gives few effect for the result if wl is0
value. Therefore, we can erase F(·) that has small absolute value of wl.1
4.2 Training and Erasing Layers2
We repeat training and erasing layers in order to keep the accuracy. We erase layer3
importance of wl after the training. Then, we again train the ResNet. We repeat th4
the accuracy drops or we erase the signiﬁcant number of layers. This procedure se5
to . However, we need to be careful about following points that are different from6
4.2.1 Selecting Unimportant Layers7
re CNN-based models that have blocks called Residual Units. ResNets
d consist deep architectures. The l-th Residual Unit described in is
xl+1 = xl + F(xl), (1)
h Residual Unit. F(·) is a module that consists of convolutions, batch
Linear Units (ReLUs). Therefore, Residual Unit consists of identity
mapping F(·). ResNets can stack more than thousand layers without
acking the Residual Units while other traditional architectures such as
esidual Networks
accuracies for several tasks in computer vision such as image clas-
d semantic segmentaion, their memory consumption and processing
ntheCIFAR-10testsetusingdi↵erentactivation
Fig.4(a)6.615.93
Fig.4(b)8.176.50
Fig.4(c)7.846.14
Fig.4(d)6.715.91
Fig.4(e)6.375.46
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
BN
ReLU
weight
BN
ReLU
weight
addition
xl
xl+1
weight
BN
ReLU
weight
BN
ReLU
addition
xl
xl+1
(c)ReLUbefore
addition
(d)ReLU-only
pre-activation
di↵erent.
sinFig.2,theshortcutconnectionsarethe
tiontopropagate.Multiplicativemanipulations
Residual Unit Residual Unit with priority
The priority is a scalar,
and can be trained.

Proposed solu4on:
2) How to keep the accuracy
• Re-training aeer erasing Residual Unit
•  Re-training is a tradi(onal strategy for pruning methods [LeCun et al.,
NeurIPS 1989]
Train ResNet
Erase Residual Unit
according to priority |w_{l}|
Re-train ResNet with large LR
Key points:

1. use large learning rate for re-training
2. erase one Residual Unit at one 4me
3. do not erase Residual Units aeer
downsampling or channel increasing

Repeat

building implosion
ResNet implosion
Training…
implosion
Proposed algorithm:
Network Implosion
implosion
Re-training… Re-training…
hCps://en.wikipedia.org/wiki/Building_implosion

Evalua4on:
Segng
• Task: Image classiﬁca(on
• Datasets: CIFAR10/100, ILSVRC2012 ImageNet
• Model: ResNet56 for CIFAR10/100, ResNet50 for ImageNet
• Each Residual Unit has 3 convolu(onal layers.
• Metric:
1) Tradeoﬀ between accuracy and # of layers.
2) Computa(on costs such as processing (me and model size.
• Compared with standard ResNet and Knowledge Dis(lla(on
(teacher-student training).
• Other hyperparameters are described in the paper in detail.

Evalua4on:
We could erase layers even if we use real world dataset
• Layer reduc4ons are 58 ~ 76 % without degrading accuracy.
• Original ResNet degrades accuracy when we reduce layers.
• Teacher-student training can keep accuracy on CIFAR10, but
can not on other datasets.
56 layers -> 32 layers 56 layers -> 35 layersErasing Multiple Layers from Residual Networks without Accuracy Loss
50 40 30 20 10
88909294
50 40 30 20 10
88909294
50 40 30 20 10
88909294
50 40 30 20 10
88909294
the number of layers
Accuracy(%)
ResNet
teacher−student
Network Implosion
(a) Cifar-10
50 40 30 20 10
55606570
50 40 30 20 10
55606570
50 40 30 20 10
55606570
50 40 30 20 10
55606570
Accuracy(%)
ResNet
teacher−student
Network Implosion
(b) Cifar-100
50 45 40 35 30 25 20
6870727476
50 45 40 35 30 25 20
6870727476
50 45 40 35 30 25 20
6870727476
50 45 40 35 30 25 20
6870727476
Accuracy(%)
ResNet
teacher−student
Network Implosion
(c) ImageNet
Figure 1. The accuracies on Cifar-10, Cifar-100 and ImageNet. The red dotted lines represent accuracies for initial models in our approa
50 layers -> 38 layers

Evalua4on:
We could reduce computa(on costs by erasing layers
• Reduc(on rates of computa(on costs without degrading
accuracy:
• # of layers: 58 ~ 76 %
• # of mul(ply-accumulate: 61 ~ 79 %
• (me of forward propaga(on: 61 ~ 77 %
• # of parameters: 70 ~ 94 %
Erasing Multiple Layers from Residual Networks without Accuracy Loss
Table 1. The computation costs for the test phase after erasing layers without accuracy loss. 56 and 50 layered models are original models
for Cifar-10/100 and ImageNet, respectively.
dataset # of layers accuracy (%) # of MACs forward (msec) backward (msec) # of parameters
Cifar-10 56 92.88 8.19 ⇥ 107
6.584 12.93 585.9K
32 93.05 4.99 ⇥ 107
3.970 7.721 409.1K
Cifar-100 56 71.83 8.65 ⇥ 107
6.203 13.36 613.6K
35 71.99 5.44 ⇥ 107
4.350 8.075 555.3K
ImageNet 50 75.89 4.11 ⇥ 109
29.95 59.51 25.55M
38 76.12 3.23 ⇥ 109
22.97 46.53 23.80M
above red dotted lines when we started to erase layers. This
is because we can retrain the models with good initial pa-
rameters as described in Section 4.2. This result veriﬁes our
the smallest models with no accuracy loss for each dataset
as described in Figure 1.
Table 1 shows the results. The table shows that the numbers
The inference 4me is reduced to 61~ 77 %
without degrading accuracy and increasing model size.

• Reducing # of layers to reduce the inference (me.
• Layer erasure and re-training scheme is eﬀec(ve.
• # of layers can be reduced without accuracy drops.
• The inference 4me can be reduced to 61 ~ 77 % in our
experiments.
Summary:

Appendix:

Theore4cal analysis:
Generaliza(on error bound can be (ght
• Theorem 1 (informal): We can have (ght upper bound of
generaliza(on error by erasing layers from trained ResNet
when following condi(on holds:
• The condi(on probabilis(cally holds (because it u(lizes PAC
bound)
• In other words, we can start to re-training ResNet from good
ini(al parameters in terms of generaliza(on error bound.
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
error bounds of f 2 FL and f 2 F[L]l0 :
Theorem 1. Let ¯Eg(·) be an upper bound of general-
ization error, and ⇢ be a fixed margin. Suppose that
Lemma 6 holds, and E⇢
e (f) < E⇢
e (f0
) where f 2 ˆFL
is a trained L-layered FC-ResNet classifier such that
ˆWl0 > 1 for the muti-label classification problem and
f0
2 ˆF[L]l0 is a classifier that erases the l0
-th Residual
Unit from f. For 8 > 0 and 8f 2 F, when the condi-
tion E⇢
e (f0
) E⇢
e (f)< 8M(2M 1)
⇢ ( ¯Rm( ˆFL) ¯Rm( ˆF[L]l0 ))
holds, we have the following bound with probability (1 )2
:
¯Eg(f0
) < ¯Eg(f). (12)
Proof. From Lemma 3 and the upper bound of Rademacher
average, we first have the following equation: ¯Eg(·) =
representati
the change
(Greff et al
makes the v
hold Theore
tance of eac
erasure will
(ii) Keepin
reports that
layers are e
value of E⇢
e (
it is difficul
Thus, we ne
after erasin
training error
aeer erasure
training error
before erasure
upper bound of
Rademacher average
before erasure
upper bound of
Rademacher average
aeer erasure

Addi4onal results:
ResNet1001
etworks via Erasing Multiple Layers
ading Accuracy
1000 800 600 400 200
93.094.095.096.0
1000 800 600 400 200
93.094.095.096.0
1000 800 600 400 200
93.094.095.096.0
Accuracy(%)
(a) Cifar-10
1000 800 600 400 200
747576777879
1000 800 600 400 200
747576777879
1000 800 600 400 200
747576777879
Accuracy(%)
(b) Cifar-100
Figure 1: The accuracies of 1001-layer models on Cifar-10
and Cifar-100 (black lines). The red lines represent accura-
cies for initial models in our approach: Network Implosion.
The blue dashed lines represent accuracies of original 1001-
layer ResNets that were reported in (He et al. 2016). Our
method uses a few layers but achieves higher accuracies than
the original ResNets.
Table 1: The processing times of forward and backward
propagations. NI represents our approach: Network Implo-
sion. We used 1001-layer models as initial models.
dataset method
forward
(sec)
forward+backward
(sec)
proach: Network Implosion. We used 1001-layer models as
initial models.
dataset method memory consumption (MB)
Cifar-10 ResNet 85.6
NI 42.6
Cifar-100 ResNet 85.9
NI 42.8
values of weights
numberofweights
−1.50 −0.75 0.00 0.75 1.50
01020304050
Figure 2: Histogram of weights wl in 1001 layered model
after training.
Table 3: The average processing times of forward and back
propagation. NI represents our approach: Network Implo-
sion.
dataset method
forward
(sec)
forward+backward
(sec)
Tiny- ResNet 0.04146 0.1635
ImageNet NI 0.02129 0.09480
Table 4: Model sizes. NI represents our approach: Network
Implosion.

Addi4onal results:
Tiny-ImageNet dataset
epresents our ap-
1-layer models as
mption (MB)
6
6
9
8
50 40 30 20 10
40455055
50 40 30 20 10
40455055
50 40 30 20 10
40455055
Accuracy(%)
ResNet
Network Implosion
Figure 3: The accuracies on Tiny-ImageNet (black line). The
red lines represent accuracies for initial models in our ap-
proach: Network Implosion. It achieves higher accuracies
than original ResNets (blue dashed line).
• 200 classes
• each class has 500 images
• image size: 64*64*3

Addi4onal results:
# of layers for each stage
Table 7: The numbers of layers in each stage after erasing layers without accuracy loss. 56 and 50 layered models are original
models for Cifar-10/100 and ImageNet, respectively. The layers in Stage 1 near inputs are aggressively erased.
dataset
Total #
of layers accuracy (%)
# of layers
in Stage 1
# of layers
in Stage 2
# of layers
in Stage 3
# of layers
in Stage 4
Cifar-10 56 92.88 18 18 18 -
32 93.05 3 15 12 -
Cifar-100 56 71.83 18 18 18 -
35 71.99 3 12 18 -
ImageNet 50 75.89 9 12 18 9
38 76.12 6 6 15 9
times to converge for each methods. Note that 50-layer mod-
els are used as the teacher networks for teacher-student train-
ing regime and the original models for our method.
Our approach trains the models faster than teacher-student
training regime because our method retrains the models with
preferable initial parameters in terms of the generalization
bounds (see Theorem 1 and Algorithm 1 in our paper).
On the other hand, since student networks are trained from
• Layers near the input are aggressively erased.
• Layers near the output remains.

Network Implosion: Effective Model Compression for ResNets via Static Layer Pruning and Retraining

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Network Implosion: Effective Model Compression for ResNets via Static Layer Pruning and Retraining

Similar to Network Implosion: Effective Model Compression for ResNets via Static Layer Pruning and Retraining (20)

More from NTT Software Innovation Center

More from NTT Software Innovation Center (20)

Recently uploaded

Recently uploaded (20)

Network Implosion: Effective Model Compression for ResNets via Static Layer Pruning and Retraining