SlideShare a Scribd company logo
Copyright©2019 NTT Corp. All Rights Reserved.
Network	Implosion:	
Effec4ve	Model	Compression	for	ResNets	via	
Sta4c	Layer	Pruning	and	Retraining	
Yasutoshi	Ida,	Yasuhiro	Fujiwara	
NTT	So3ware	Innova7on	Center,	Japan
2Copyright©2019 NTT Corp. All Rights Reserved.
Background:	
Convolu(onal	Neural	Networks	are	used	in	many	applica(ons
• Image	classifica(on,	object	detec(on,	segmenta(on…	
• The	inference	is	performed	by	forward	propaga4on.	
input output	
・・・・
・・・・
layers
forward	propaga(on
3Copyright©2019 NTT Corp. All Rights Reserved.
Background:	
Convolu(onal	Neural	Networks	have	many	layers
• Convolu(onal	Neural	Networks	(CNNs)	achieve	high	accuracy	
in	many	applica(ons	by	stacking	many	layers.	
• ResNet	is	a	standard	CNN-based	model.	
Revolution of Depth
3.57
6.7 7.3
11.7
16.4
25.8
28.2
ILSVRC'15
ResNet
ILSVRC'14
GoogleNet
ILSVRC'14
VGG
ILSVRC'13 ILSVRC'12
AlexNet
ILSVRC'11 ILSVRC'10
ImageNet Classification top-5 error (%)
shallow8 layers
19 layers22 layers
152 layers
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
8 layers
shallow
deep
very	deep
ultra	deep
hCps://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
4Copyright©2019 NTT Corp. All Rights Reserved.
Background:	
Many	layers	incur	long	processing	(me	of	inference
• Many	layers	incur	high	computa(on	costs	such	as	(me	and	
memory	consump(on	for	the	inference	(forward	propaga(on).	
•  Reducing	the	inference	costs	is	important	for	the	service	deployment.	
input output	
・・・・
・・・・
・・・・
・・・・
many	layers
forward	propaga(on
need	long	(me	for	forward	propaga(on
5Copyright©2019 NTT Corp. All Rights Reserved.
Challenge:	
Erasing	mul(ple	layers	without	degrading	accuracy
• Erasing	layers	to	speed	up	forward	propaga4on/reduce	the	
model	size.	
• The	most	of	previous	methods	sacrifice:	
•  accuracy	[Huang	et	al.,	ECCV	2018][Yu	et	al.,	CVPR	2018]	
•  memory	consump(on	[Veit	et	al.,	ECCV	2018][Wu	et	al.,	CVPR	2018]	
• Can	we	erase	layers	without	sacrificing	accuracy	and	memory	
consump4on?
6Copyright©2019 NTT Corp. All Rights Reserved.
Preliminary:	
Convolu(onal	layer	and	Residual	Unit
• Convolu4onal	layer:	filters	slides	on	images/ac(va(on	maps	
• Residual	Unit:	convolu(on	layers	with	iden(ty	map	Fei-Fei Li & Justin Johnson & Serena Yeung April 18, 2017Lecture 5 -Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 201732
32
32
3
Convolution Layer
32x32x3 image
5x5x3 filter
convolve (slide) over all
spatial locations
activation map
1
28
28
ase of Highway networks that can also stack many layers by introducing gating
per, we focuse on ResNet because our idea is suitable for ResNet as we will
roduce Residual Networks and some evidences that explain why we can erase
orks
ResNets) are CNN-based models that have blocks called Residual Units. ResNets
Units and consist deep architectures. The l-th Residual Unit described in is
xl+1 = xl + F(xl), (1)
to the l-th Residual Unit. F(·) is a module that consists of convolutions, batch
Rectified Linear Units (ReLUs). Therefore, Residual Unit consists of identity
ingdi↵erentactivation
ResNet-164
5.93
6.50
6.14
5.91
5.46
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
ht
U
ht
U
ly
on
(e)fullpre-activation
iden(ty	map
nonlinear	map:	
Convolu(onal	layers,	
Batch	normaliza(ons,	and	ReLUs.
hCp://cs231n.github.io/convolu(onal-networks/
hCps://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
7Copyright©2019 NTT Corp. All Rights Reserved.
Preliminary:	
Residual	Network	(ResNet)
• ResNet	stacks	Residual	Units	to	build	deep	structure.	
• ResNet	is	used	as	a	standard	model	for	computer	vision	task.	
Fig.4(a)6.615.93
Fig.4(b)8.176.50
Fig.4(c)7.846.14
Fig.4(d)6.715.91
Fig.4(e)6.375.46
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
BN
ReLU
weight
BN
ReLU
weight
dition
xl
xl+1
weight
BN
ReLU
weight
BN
ReLU
addition
xl
xl+1
ReLUbefore
addition
(d)ReLU-only
pre-activation
(e)fullpre-activation
ninTable2.Alltheseunitsconsistofthesame
↵erent.
inFig.2,theshortcutconnectionsarethe
ntopropagate.Multiplicativemanipulations
anddropout)ontheshortcutscanhamper
tooptimizationproblems.
gand1⇥1convolutionalshortcutsintroduce
strongerrepresentationalabilitiesthaniden-
t-onlygatingand1⇥1convolutioncoverthe
s(i.e.,theycouldbeoptimizedasidentity
errorishigherthanthatofidentityshort-
onofthesemodelsiscausedbyoptimization
bilities.
ationFunctions
upporttheanalysisinEqn.(5)andEqn.(8),
mptionthattheafter-additionactivationf
theCIFAR-10testsetusingdi↵erentactivation
Fig.ResNet-110ResNet-164
Fig.4(a)6.615.93
Fig.4(b)8.176.50
Fig.4(c)7.846.14
Fig.4(d)6.715.91
Fig.4(e)6.375.46
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
BN
ReLU
weight
BN
ReLU
weight
dition
xl
l+1
weight
BN
ReLU
weight
BN
ReLU
addition
xl
xl+1
ReLUbefore
addition
(d)ReLU-only
pre-activation
(e)fullpre-activation
ninTable2.Alltheseunitsconsistofthesame
erent.
・
・
・
input output	
stacking	Residual	Units
forward	propaga(on
hCps://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
8Copyright©2019 NTT Corp. All Rights Reserved.
Problem	Descrip4on:	
Layer-level	pruning	for	ResNet
• Our	problem	is	layer-level	pruning.	
•  We	consider	standard	image	classifica(on	task,	but	our	method	can	be	
used	in	other	tasks	such	as	detec(on	and	segmenta(on.	
NIN 7.6 100 0 1.1
GoogLeNet 6.9 85.1 14.9 1.6
ResNet-18 5.6 100 0 1.8
ResNet-50 12.2 100 0 3.8
ResNet-101 21.2 100 0 7.6
Figure 1. Different pruning methods for a convolutional layer
which has 3 convolutional filters of size 3 ⇥ 3 ⇥ 3.
implemented by
a result, the Bas
can be utilized to
the group-wise b
weight matrix in
sparsity regulariz
with group-spars
ing can utilize th
almost linear at
they achieved a 3
in AlexNet. Con
group Lasso to pr
explored differen
filters, channels,
be regarded as a
NIN 7.6 100 0
GoogLeNet 6.9 85.1 14.9
ResNet-18 5.6 100 0
ResNet-50 12.2 100 0
ResNet-101 21.2 100 0
Figure 1. Different pruning methods for a convolutional layer
which has 3 convolutional filters of size 3 ⇥ 3 ⇥ 3.
implem
a resu
can be
the gro
weight
sparsit
with g
ing ca
almost
they ac
in Alex
group
explor
filters,
be reg
ResNet-50 12.2 100 0
ResNet-101 21.2 100 0
Figure 1. Different pruning methods for a convolutional layer
which has 3 convolutional filters of size 3 ⇥ 3 ⇥ 3.
implem
a resu
can be
the gro
weight
sparsit
with g
ing ca
almost
they ac
in Alex
group
explor
filters,
be reg
metho
achiev
GPU r
Figure 1. Different pruning methods for a convolutional layer
which has 3 convolutional filters of size 3 ⇥ 3 ⇥ 3.
implem
a resu
can be
the gro
weight
sparsit
with g
ing can
almost
they ac
in Alex
group
explor
filters,
be reg
method
achiev
GPU r
3.4. F
Layer-level	Pruning
Cheng	et	al.,	Recent	Advances	in	Efficient	Computa(on	of	Deep	Convolu(onal	Neural	Networks,	2018.
9Copyright©2019 NTT Corp. All Rights Reserved.
Problem	Descrip4on:	
Layer-level	pruning	for	ResNet
• The	points	of	the	problem	are	follows:	
	
	
	1)	How	to	select	layers,	which	will	be	erased.	
	
	
	2)	How	to	keep	the	accuracy.
10Copyright©2019 NTT Corp. All Rights Reserved.
Proposed	solu4on:	
1)	How	to	select	layers
• Introducing	priority	into	Residual	Unit	
• We	can	select	unimportant	Residual	Units	according	to	the	
values	of	|w_{l}|.	
•  Small	|w_{l}|	will	scale	down	the	signal	of	nonlinear	map	
•  We	can	erase	the	Residual	Unit	by	erasing	the	nonlinear	map	
Fig.4(d)6.715.91
Fig.4(e)6.375.46
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
BN
ReLU
weight
BN
ReLU
weight
addition
xl
xl+1
weight
BN
ReLU
weight
BN
ReLU
addition
xl
xl+1
c)ReLUbefore
addition
(d)ReLU-only
pre-activation
(e)fullpre-activation
ioninTable2.Alltheseunitsconsistofthesame
i↵erent.
inFig.2,theshortcutconnectionsarethe
iontopropagate.Multiplicativemanipulations
s,anddropout)ontheshortcutscanhamper
tooptimizationproblems.
ngand1⇥1convolutionalshortcutsintroduce
estrongerrepresentationalabilitiesthaniden-
ut-onlygatingand1⇥1convolutioncoverthe
ts(i.e.,theycouldbeoptimizedasidentity
gerrorishigherthanthatofidentityshort-
tionofthesemodelsiscausedbyoptimization
abilities.
vationFunctions
supporttheanalysisinEqn.(5)andEqn.(8),
sumptionthattheafter-additionactivationf
4.1 Importance of Residual Unit4
We introduce weights to ResNet in order to decide importance of layers. In part5
the importance of F(·) in Equation (1). If F(·) is not important for the accuracy, w6
that includes some layers. Notice that Equation (1) changes to xl+1 = xl by eras7
following weighted Residual Unit to decide the importance of F(·):8
xl+1 = xl + wlF(xl),
where wl is a scalar that can be learned by back propagation. If wl is small in absolu9
down the output of F(·). In other words, F(·) gives few effect for the result if wl is0
value. Therefore, we can erase F(·) that has small absolute value of wl.1
4.2 Training and Erasing Layers2
We repeat training and erasing layers in order to keep the accuracy. We erase layer3
importance of wl after the training. Then, we again train the ResNet. We repeat th4
the accuracy drops or we erase the significant number of layers. This procedure se5
to . However, we need to be careful about following points that are different from6
4.2.1 Selecting Unimportant Layers7
re CNN-based models that have blocks called Residual Units. ResNets
d consist deep architectures. The l-th Residual Unit described in is
xl+1 = xl + F(xl), (1)
h Residual Unit. F(·) is a module that consists of convolutions, batch
Linear Units (ReLUs). Therefore, Residual Unit consists of identity
mapping F(·). ResNets can stack more than thousand layers without
acking the Residual Units while other traditional architectures such as
esidual Networks
accuracies for several tasks in computer vision such as image clas-
d semantic segmentaion, their memory consumption and processing
ntheCIFAR-10testsetusingdi↵erentactivation
Fig.ResNet-110ResNet-164
Fig.4(a)6.615.93
Fig.4(b)8.176.50
Fig.4(c)7.846.14
Fig.4(d)6.715.91
Fig.4(e)6.375.46
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
BN
ReLU
weight
BN
ReLU
weight
addition
xl
xl+1
weight
BN
ReLU
weight
BN
ReLU
addition
xl
xl+1
(c)ReLUbefore
addition
(d)ReLU-only
pre-activation
(e)fullpre-activation
ioninTable2.Alltheseunitsconsistofthesame
di↵erent.
sinFig.2,theshortcutconnectionsarethe
tiontopropagate.Multiplicativemanipulations
s,anddropout)ontheshortcutscanhamper
Residual	Unit Residual	Unit	with	priority
The	priority	is	a	scalar,	
and	can	be	trained.
hCps://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
11Copyright©2019 NTT Corp. All Rights Reserved.
Proposed	solu4on:	
1)	How	to	select	layers
• Introducing	priority	into	Residual	Unit	
• We	can	select	unimportant	Residual	Units	according	to	the	
values	of	|w_{l}|.	
•  Small	|w_{l}|	will	scale	down	the	signal	of	nonlinear	map	
•  We	can	erase	the	Residual	Unit	by	erasing	the	nonlinear	map	
Fig.4(d)6.715.91
Fig.4(e)6.375.46
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
BN
ReLU
weight
BN
ReLU
weight
addition
xl
xl+1
weight
BN
ReLU
weight
BN
ReLU
addition
xl
xl+1
c)ReLUbefore
addition
(d)ReLU-only
pre-activation
(e)fullpre-activation
ioninTable2.Alltheseunitsconsistofthesame
i↵erent.
inFig.2,theshortcutconnectionsarethe
iontopropagate.Multiplicativemanipulations
s,anddropout)ontheshortcutscanhamper
tooptimizationproblems.
ngand1⇥1convolutionalshortcutsintroduce
estrongerrepresentationalabilitiesthaniden-
ut-onlygatingand1⇥1convolutioncoverthe
ts(i.e.,theycouldbeoptimizedasidentity
gerrorishigherthanthatofidentityshort-
tionofthesemodelsiscausedbyoptimization
abilities.
vationFunctions
supporttheanalysisinEqn.(5)andEqn.(8),
sumptionthattheafter-additionactivationf
4.1 Importance of Residual Unit4
We introduce weights to ResNet in order to decide importance of layers. In part5
the importance of F(·) in Equation (1). If F(·) is not important for the accuracy, w6
that includes some layers. Notice that Equation (1) changes to xl+1 = xl by eras7
following weighted Residual Unit to decide the importance of F(·):8
xl+1 = xl + wlF(xl),
where wl is a scalar that can be learned by back propagation. If wl is small in absolu9
down the output of F(·). In other words, F(·) gives few effect for the result if wl is0
value. Therefore, we can erase F(·) that has small absolute value of wl.1
4.2 Training and Erasing Layers2
We repeat training and erasing layers in order to keep the accuracy. We erase layer3
importance of wl after the training. Then, we again train the ResNet. We repeat th4
the accuracy drops or we erase the significant number of layers. This procedure se5
to . However, we need to be careful about following points that are different from6
4.2.1 Selecting Unimportant Layers7
re CNN-based models that have blocks called Residual Units. ResNets
d consist deep architectures. The l-th Residual Unit described in is
xl+1 = xl + F(xl), (1)
h Residual Unit. F(·) is a module that consists of convolutions, batch
Linear Units (ReLUs). Therefore, Residual Unit consists of identity
mapping F(·). ResNets can stack more than thousand layers without
acking the Residual Units while other traditional architectures such as
esidual Networks
accuracies for several tasks in computer vision such as image clas-
d semantic segmentaion, their memory consumption and processing
ntheCIFAR-10testsetusingdi↵erentactivation
Fig.ResNet-110ResNet-164
Fig.4(a)6.615.93
Fig.4(b)8.176.50
Fig.4(c)7.846.14
Fig.4(d)6.715.91
Fig.4(e)6.375.46
ReLU
weight
BN
ReLU
weight
BN
addition
xl
xl+1
BN
ReLU
weight
BN
ReLU
weight
addition
xl
xl+1
weight
BN
ReLU
weight
BN
ReLU
addition
xl
xl+1
(c)ReLUbefore
addition
(d)ReLU-only
pre-activation
(e)fullpre-activation
ioninTable2.Alltheseunitsconsistofthesame
di↵erent.
sinFig.2,theshortcutconnectionsarethe
tiontopropagate.Multiplicativemanipulations
s,anddropout)ontheshortcutscanhamper
Residual	Unit Residual	Unit	with	priority
The	priority	is	a	scalar,	
and	can	be	trained.
hCps://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
12Copyright©2019 NTT Corp. All Rights Reserved.
Proposed	solu4on:	
2)	How	to	keep	the	accuracy
• Re-training	aeer	erasing	Residual	Unit	
•  Re-training	is	a	tradi(onal	strategy	for	pruning	methods	[LeCun	et	al.,	
NeurIPS	1989]	
Train	ResNet
Erase	Residual	Unit	
according	to	priority	|w_{l}|
Re-train	ResNet	with	large	LR
																											Key	points:	
	
1.	use	large	learning	rate	for	re-training	
2.	erase	one	Residual	Unit	at	one	4me	
3.	do	not	erase	Residual	Units	aeer	
				downsampling	or	channel	increasing	
	
Repeat
13Copyright©2019 NTT Corp. All Rights Reserved.
building	implosion
ResNet	implosion
Training…	
implosion
Proposed	algorithm:	
Network	Implosion
implosion
Re-training…	 Re-training…	
hCps://en.wikipedia.org/wiki/Building_implosion
14Copyright©2019 NTT Corp. All Rights Reserved.
Evalua4on:	
Segng
• Task:	Image	classifica(on	
• Datasets:	CIFAR10/100,	ILSVRC2012	ImageNet	
• Model:	ResNet56	for	CIFAR10/100,	ResNet50	for	ImageNet		
• Each	Residual	Unit	has	3	convolu(onal	layers.	
• Metric:		
1)	Tradeoff	between	accuracy	and	#	of	layers.	
2)	Computa(on	costs	such	as	processing	(me	and	model	size.	
• Compared	with	standard	ResNet	and	Knowledge	Dis(lla(on	
(teacher-student	training).	
• Other	hyperparameters	are	described	in	the	paper	in	detail.
15Copyright©2019 NTT Corp. All Rights Reserved.
Evalua4on:	
We	could	erase	layers	even	if	we	use	real	world	dataset
• Layer	reduc4ons	are	58	~	76	%	without	degrading	accuracy.	
• Original	ResNet	degrades	accuracy	when	we	reduce	layers.	
• Teacher-student	training	can	keep	accuracy	on	CIFAR10,	but	
can	not	on	other	datasets.	
56	layers	->	32	layers	 56	layers	->	35	layersErasing Multiple Layers from Residual Networks without Accuracy Loss
50 40 30 20 10
88909294
50 40 30 20 10
88909294
50 40 30 20 10
88909294
50 40 30 20 10
88909294
the number of layers
Accuracy(%)
ResNet
teacher−student
Network Implosion
(a) Cifar-10
50 40 30 20 10
55606570
50 40 30 20 10
55606570
50 40 30 20 10
55606570
50 40 30 20 10
55606570
the number of layers
Accuracy(%)
ResNet
teacher−student
Network Implosion
(b) Cifar-100
50 45 40 35 30 25 20
6870727476
50 45 40 35 30 25 20
6870727476
50 45 40 35 30 25 20
6870727476
50 45 40 35 30 25 20
6870727476
the number of layers
Accuracy(%)
ResNet
teacher−student
Network Implosion
(c) ImageNet
Figure 1. The accuracies on Cifar-10, Cifar-100 and ImageNet. The red dotted lines represent accuracies for initial models in our approa
50	layers	->	38	layers
16Copyright©2019 NTT Corp. All Rights Reserved.
Evalua4on:	
We	could	reduce	computa(on	costs	by	erasing	layers
• Reduc(on	rates	of	computa(on	costs	without	degrading	
accuracy:	
• #	of	layers:	58	~	76	%	
• #	of	mul(ply-accumulate:	61	~	79	%	
• (me	of	forward	propaga(on:	61	~	77	%	
• #	of	parameters:	70	~	94	%	
Erasing Multiple Layers from Residual Networks without Accuracy Loss
Table 1. The computation costs for the test phase after erasing layers without accuracy loss. 56 and 50 layered models are original models
for Cifar-10/100 and ImageNet, respectively.
dataset # of layers accuracy (%) # of MACs forward (msec) backward (msec) # of parameters
Cifar-10 56 92.88 8.19 ⇥ 107
6.584 12.93 585.9K
32 93.05 4.99 ⇥ 107
3.970 7.721 409.1K
Cifar-100 56 71.83 8.65 ⇥ 107
6.203 13.36 613.6K
35 71.99 5.44 ⇥ 107
4.350 8.075 555.3K
ImageNet 50 75.89 4.11 ⇥ 109
29.95 59.51 25.55M
38 76.12 3.23 ⇥ 109
22.97 46.53 23.80M
above red dotted lines when we started to erase layers. This
is because we can retrain the models with good initial pa-
rameters as described in Section 4.2. This result verifies our
the smallest models with no accuracy loss for each dataset
as described in Figure 1.
Table 1 shows the results. The table shows that the numbers
The	inference	4me	is	reduced	to	61~	77	%	
without	degrading	accuracy	and	increasing	model	size.
17Copyright©2019 NTT Corp. All Rights Reserved.
• Reducing	#	of	layers	to	reduce	the	inference	(me.	
• Layer	erasure	and	re-training	scheme	is	effec(ve.	
• #	of	layers	can	be	reduced	without	accuracy	drops.	
• The	inference	4me	can	be	reduced	to	61	~	77	%	in	our	
experiments.	
Summary:
18Copyright©2019 NTT Corp. All Rights Reserved.
Appendix:
19Copyright©2019 NTT Corp. All Rights Reserved.
Theore4cal	analysis:	
Generaliza(on	error	bound	can	be	(ght
• Theorem	1	(informal):	We	can	have	(ght	upper	bound	of	
generaliza(on	error	by	erasing	layers	from	trained	ResNet	
when	following	condi(on	holds:	
• The	condi(on	probabilis(cally	holds	(because	it	u(lizes	PAC	
bound)	
• In	other	words,	we	can	start	to	re-training	ResNet	from	good	
ini(al	parameters	in	terms	of	generaliza(on	error	bound.	
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
error bounds of f 2 FL and f 2 F[L]l0 :
Theorem 1. Let ¯Eg(·) be an upper bound of general-
ization error, and ⇢ be a fixed margin. Suppose that
Lemma 6 holds, and E⇢
e (f) < E⇢
e (f0
) where f 2 ˆFL
is a trained L-layered FC-ResNet classifier such that
ˆWl0 > 1 for the muti-label classification problem and
f0
2 ˆF[L]l0 is a classifier that erases the l0
-th Residual
Unit from f. For 8 > 0 and 8f 2 F, when the condi-
tion E⇢
e (f0
) E⇢
e (f)< 8M(2M 1)
⇢ ( ¯Rm( ˆFL) ¯Rm( ˆF[L]l0 ))
holds, we have the following bound with probability (1 )2
:
¯Eg(f0
) < ¯Eg(f). (12)
Proof. From Lemma 3 and the upper bound of Rademacher
average, we first have the following equation: ¯Eg(·) =
representati
the change
(Greff et al
makes the v
hold Theore
tance of eac
erasure will
(ii) Keepin
reports that
layers are e
value of E⇢
e (
it is difficul
Thus, we ne
after erasin
training	error	
aeer	erasure
training	error	
before	erasure
upper	bound	of	
Rademacher	average	
before	erasure
upper	bound	of	
Rademacher	average	
aeer	erasure
20Copyright©2019 NTT Corp. All Rights Reserved.
Addi4onal	results:	
ResNet1001
etworks via Erasing Multiple Layers
ading Accuracy
1000 800 600 400 200
93.094.095.096.0
1000 800 600 400 200
93.094.095.096.0
1000 800 600 400 200
93.094.095.096.0
the number of layers
Accuracy(%)
(a) Cifar-10
1000 800 600 400 200
747576777879
1000 800 600 400 200
747576777879
1000 800 600 400 200
747576777879
the number of layers
Accuracy(%)
(b) Cifar-100
Figure 1: The accuracies of 1001-layer models on Cifar-10
and Cifar-100 (black lines). The red lines represent accura-
cies for initial models in our approach: Network Implosion.
The blue dashed lines represent accuracies of original 1001-
layer ResNets that were reported in (He et al. 2016). Our
method uses a few layers but achieves higher accuracies than
the original ResNets.
Table 1: The processing times of forward and backward
propagations. NI represents our approach: Network Implo-
sion. We used 1001-layer models as initial models.
dataset method
forward
(sec)
forward+backward
(sec)
proach: Network Implosion. We used 1001-layer models as
initial models.
dataset method memory consumption (MB)
Cifar-10 ResNet 85.6
NI 42.6
Cifar-100 ResNet 85.9
NI 42.8
values of weights
numberofweights
−1.50 −0.75 0.00 0.75 1.50
01020304050
Figure 2: Histogram of weights wl in 1001 layered model
after training.
Table 3: The average processing times of forward and back
propagation. NI represents our approach: Network Implo-
sion.
dataset method
forward
(sec)
forward+backward
(sec)
Tiny- ResNet 0.04146 0.1635
ImageNet NI 0.02129 0.09480
Table 4: Model sizes. NI represents our approach: Network
Implosion.
21Copyright©2019 NTT Corp. All Rights Reserved.
Addi4onal	results:	
Tiny-ImageNet	dataset
epresents our ap-
1-layer models as
mption (MB)
6
6
9
8
50 40 30 20 10
40455055
50 40 30 20 10
40455055
50 40 30 20 10
40455055
the number of layers
Accuracy(%)
ResNet
Network Implosion
Figure 3: The accuracies on Tiny-ImageNet (black line). The
red lines represent accuracies for initial models in our ap-
proach: Network Implosion. It achieves higher accuracies
than original ResNets (blue dashed line).
• 200	classes	
• each	class	has	500	images	
• image	size:	64*64*3
22Copyright©2019 NTT Corp. All Rights Reserved.
Addi4onal	results:	
#	of	layers	for	each	stage
Table 7: The numbers of layers in each stage after erasing layers without accuracy loss. 56 and 50 layered models are original
models for Cifar-10/100 and ImageNet, respectively. The layers in Stage 1 near inputs are aggressively erased.
dataset
Total #
of layers accuracy (%)
# of layers
in Stage 1
# of layers
in Stage 2
# of layers
in Stage 3
# of layers
in Stage 4
Cifar-10 56 92.88 18 18 18 -
32 93.05 3 15 12 -
Cifar-100 56 71.83 18 18 18 -
35 71.99 3 12 18 -
ImageNet 50 75.89 9 12 18 9
38 76.12 6 6 15 9
times to converge for each methods. Note that 50-layer mod-
els are used as the teacher networks for teacher-student train-
ing regime and the original models for our method.
Our approach trains the models faster than teacher-student
training regime because our method retrains the models with
preferable initial parameters in terms of the generalization
bounds (see Theorem 1 and Algorithm 1 in our paper).
On the other hand, since student networks are trained from
• Layers	near	the	input	are	aggressively	erased.	
• Layers	near	the	output	remains.

More Related Content

What's hot

PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
Jinwon Lee
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
Jinwon Lee
 
20150703.journal club
20150703.journal club20150703.journal club
20150703.journal club
Hayaru SHOUNO
 
ShuffleNet - PR054
ShuffleNet - PR054ShuffleNet - PR054
ShuffleNet - PR054
Jinwon Lee
 
Parallel convolutional neural network
Parallel  convolutional neural networkParallel  convolutional neural network
Parallel convolutional neural network
Abdullah Khan Zehady
 
Image Restoration and Denoising By Using Nonlocally Centralized Sparse Repres...
Image Restoration and Denoising By Using Nonlocally Centralized Sparse Repres...Image Restoration and Denoising By Using Nonlocally Centralized Sparse Repres...
Image Restoration and Denoising By Using Nonlocally Centralized Sparse Repres...
IJERA Editor
 
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral ImagesBand Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
IDES Editor
 
Motion estimation overview
Motion estimation overviewMotion estimation overview
Motion estimation overview
Yoss Cohen
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)
SungminYou
 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
suga93
 
Cnn method
Cnn methodCnn method
Cnn method
AmirSajedi1
 
Deeplab
DeeplabDeeplab
Deeplab
Cheng-You Lu
 
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
grssieee
 
Deep Belief nets
Deep Belief netsDeep Belief nets
Deep Belief nets
butest
 
Rotman Lens Performance Analysis
Rotman Lens Performance AnalysisRotman Lens Performance Analysis
Rotman Lens Performance Analysis
IDES Editor
 
Deep Belief Networks
Deep Belief NetworksDeep Belief Networks
Deep Belief Networks
Hasan H Topcu
 
Galgo f
Galgo fGalgo f
Dsc
DscDsc

What's hot (18)

PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
20150703.journal club
20150703.journal club20150703.journal club
20150703.journal club
 
ShuffleNet - PR054
ShuffleNet - PR054ShuffleNet - PR054
ShuffleNet - PR054
 
Parallel convolutional neural network
Parallel  convolutional neural networkParallel  convolutional neural network
Parallel convolutional neural network
 
Image Restoration and Denoising By Using Nonlocally Centralized Sparse Repres...
Image Restoration and Denoising By Using Nonlocally Centralized Sparse Repres...Image Restoration and Denoising By Using Nonlocally Centralized Sparse Repres...
Image Restoration and Denoising By Using Nonlocally Centralized Sparse Repres...
 
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral ImagesBand Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
 
Motion estimation overview
Motion estimation overviewMotion estimation overview
Motion estimation overview
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)
 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
 
Cnn method
Cnn methodCnn method
Cnn method
 
Deeplab
DeeplabDeeplab
Deeplab
 
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
 
Deep Belief nets
Deep Belief netsDeep Belief nets
Deep Belief nets
 
Rotman Lens Performance Analysis
Rotman Lens Performance AnalysisRotman Lens Performance Analysis
Rotman Lens Performance Analysis
 
Deep Belief Networks
Deep Belief NetworksDeep Belief Networks
Deep Belief Networks
 
Galgo f
Galgo fGalgo f
Galgo f
 
Dsc
DscDsc
Dsc
 

Similar to Network Implosion: Effective Model Compression for ResNets via Static Layer Pruning and Retraining

2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
ali hassan
 
convolutional_neural_networks in deep learning
convolutional_neural_networks in deep learningconvolutional_neural_networks in deep learning
convolutional_neural_networks in deep learning
ssusere5ddd6
 
1801.06434
1801.064341801.06434
1801.06434
emil_laurence
 
MobileNet V3
MobileNet V3MobileNet V3
MobileNet V3
Wonbeom Jang
 
1409.1556.pdf
1409.1556.pdf1409.1556.pdf
1409.1556.pdf
Zuhriddin1
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
Pierre de Lacaze
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
csandit
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
Spine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localizationSpine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localization
Devansh16
 
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITIONVERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Willy Marroquin (WillyDevNET)
 
ResNet.pptx
ResNet.pptxResNet.pptx
ResNet.pptx
ssuser2624f71
 
N ns 1
N ns 1N ns 1
N ns 1
Thy Selaroth
 
Lecture 5: Convolutional Neural Network Models
Lecture 5: Convolutional Neural Network ModelsLecture 5: Convolutional Neural Network Models
Lecture 5: Convolutional Neural Network Models
Mohamed Loey
 
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
taeseon ryu
 
ResNet.pptx
ResNet.pptxResNet.pptx
ResNet.pptx
ssuser2624f71
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Sitakanta Mishra
 
GENERALIZED LEGENDRE POLYNOMIALS FOR SUPPORT VECTOR MACHINES (SVMS) CLASSIFIC...
GENERALIZED LEGENDRE POLYNOMIALS FOR SUPPORT VECTOR MACHINES (SVMS) CLASSIFIC...GENERALIZED LEGENDRE POLYNOMIALS FOR SUPPORT VECTOR MACHINES (SVMS) CLASSIFIC...
GENERALIZED LEGENDRE POLYNOMIALS FOR SUPPORT VECTOR MACHINES (SVMS) CLASSIFIC...
IJNSA Journal
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
CNN.pptx
CNN.pptxCNN.pptx
CNN.pptx
AbrarRana10
 
IRJET- Autonomous Quadrotor Control using Convolutional Neural Networks
IRJET- Autonomous Quadrotor Control using Convolutional Neural NetworksIRJET- Autonomous Quadrotor Control using Convolutional Neural Networks
IRJET- Autonomous Quadrotor Control using Convolutional Neural Networks
IRJET Journal
 

Similar to Network Implosion: Effective Model Compression for ResNets via Static Layer Pruning and Retraining (20)

2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
 
convolutional_neural_networks in deep learning
convolutional_neural_networks in deep learningconvolutional_neural_networks in deep learning
convolutional_neural_networks in deep learning
 
1801.06434
1801.064341801.06434
1801.06434
 
MobileNet V3
MobileNet V3MobileNet V3
MobileNet V3
 
1409.1556.pdf
1409.1556.pdf1409.1556.pdf
1409.1556.pdf
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Spine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localizationSpine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localization
 
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITIONVERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
 
ResNet.pptx
ResNet.pptxResNet.pptx
ResNet.pptx
 
N ns 1
N ns 1N ns 1
N ns 1
 
Lecture 5: Convolutional Neural Network Models
Lecture 5: Convolutional Neural Network ModelsLecture 5: Convolutional Neural Network Models
Lecture 5: Convolutional Neural Network Models
 
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
 
ResNet.pptx
ResNet.pptxResNet.pptx
ResNet.pptx
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
 
GENERALIZED LEGENDRE POLYNOMIALS FOR SUPPORT VECTOR MACHINES (SVMS) CLASSIFIC...
GENERALIZED LEGENDRE POLYNOMIALS FOR SUPPORT VECTOR MACHINES (SVMS) CLASSIFIC...GENERALIZED LEGENDRE POLYNOMIALS FOR SUPPORT VECTOR MACHINES (SVMS) CLASSIFIC...
GENERALIZED LEGENDRE POLYNOMIALS FOR SUPPORT VECTOR MACHINES (SVMS) CLASSIFIC...
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
 
CNN.pptx
CNN.pptxCNN.pptx
CNN.pptx
 
IRJET- Autonomous Quadrotor Control using Convolutional Neural Networks
IRJET- Autonomous Quadrotor Control using Convolutional Neural NetworksIRJET- Autonomous Quadrotor Control using Convolutional Neural Networks
IRJET- Autonomous Quadrotor Control using Convolutional Neural Networks
 

More from NTT Software Innovation Center

A Global Data Infrastructure for Data Sharing Between Businesses
A Global Data Infrastructure for Data Sharing Between BusinessesA Global Data Infrastructure for Data Sharing Between Businesses
A Global Data Infrastructure for Data Sharing Between Businesses
NTT Software Innovation Center
 
企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤
NTT Software Innovation Center
 
企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤
NTT Software Innovation Center
 
不揮発WALバッファ
不揮発WALバッファ不揮発WALバッファ
不揮発WALバッファ
NTT Software Innovation Center
 
企業間データ流通のための国際基盤
企業間データ流通のための国際基盤企業間データ流通のための国際基盤
企業間データ流通のための国際基盤
NTT Software Innovation Center
 
企業間データ流通のための国際基盤
企業間データ流通のための国際基盤企業間データ流通のための国際基盤
企業間データ流通のための国際基盤
NTT Software Innovation Center
 
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
NTT Software Innovation Center
 
2-in-1 Cluster Integration: Batch and Interactive GPU Computing
2-in-1 Cluster Integration: Batch and Interactive GPU Computing2-in-1 Cluster Integration: Batch and Interactive GPU Computing
2-in-1 Cluster Integration: Batch and Interactive GPU Computing
NTT Software Innovation Center
 
Hybrid Sourcing for Overcoming “Digital Cliff 2025”
Hybrid Sourcing for Overcoming “Digital Cliff 2025”Hybrid Sourcing for Overcoming “Digital Cliff 2025”
Hybrid Sourcing for Overcoming “Digital Cliff 2025”
NTT Software Innovation Center
 
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
NTT Software Innovation Center
 
Why and how Edge Computing matters enterprise IT strategy
Why and how Edge Computing matters enterprise IT strategyWhy and how Edge Computing matters enterprise IT strategy
Why and how Edge Computing matters enterprise IT strategy
NTT Software Innovation Center
 
外部キー制約を考慮した特徴量削減手法
外部キー制約を考慮した特徴量削減手法外部キー制約を考慮した特徴量削減手法
外部キー制約を考慮した特徴量削減手法
NTT Software Innovation Center
 
デジタルサービスプラットフォーム実現に向けた技術課題
デジタルサービスプラットフォーム実現に向けた技術課題デジタルサービスプラットフォーム実現に向けた技術課題
デジタルサービスプラットフォーム実現に向けた技術課題
NTT Software Innovation Center
 
Building images efficiently and securely on Kubernetes with BuildKit
Building images efficiently and securely on Kubernetes with BuildKitBuilding images efficiently and securely on Kubernetes with BuildKit
Building images efficiently and securely on Kubernetes with BuildKit
NTT Software Innovation Center
 
Real-time spatiotemporal data utilization for future mobility services
Real-time spatiotemporal data utilization for future mobility servicesReal-time spatiotemporal data utilization for future mobility services
Real-time spatiotemporal data utilization for future mobility services
NTT Software Innovation Center
 
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
NTT Software Innovation Center
 
統合ログ分析技術Lognosisと運用ログ分析の取組
統合ログ分析技術Lognosisと運用ログ分析の取組統合ログ分析技術Lognosisと運用ログ分析の取組
統合ログ分析技術Lognosisと運用ログ分析の取組
NTT Software Innovation Center
 
MVSR Schedulerを作るための指針
MVSR Schedulerを作るための指針MVSR Schedulerを作るための指針
MVSR Schedulerを作るための指針
NTT Software Innovation Center
 
OpenStack Swiftとそのエコシステムの最新動向
OpenStack Swiftとそのエコシステムの最新動向OpenStack Swiftとそのエコシステムの最新動向
OpenStack Swiftとそのエコシステムの最新動向
NTT Software Innovation Center
 
NTTのR&Dを支えるNTTコミュニケーションズのIT基盤サービス
NTTのR&Dを支えるNTTコミュニケーションズのIT基盤サービスNTTのR&Dを支えるNTTコミュニケーションズのIT基盤サービス
NTTのR&Dを支えるNTTコミュニケーションズのIT基盤サービス
NTT Software Innovation Center
 

More from NTT Software Innovation Center (20)

A Global Data Infrastructure for Data Sharing Between Businesses
A Global Data Infrastructure for Data Sharing Between BusinessesA Global Data Infrastructure for Data Sharing Between Businesses
A Global Data Infrastructure for Data Sharing Between Businesses
 
企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤
 
企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤
 
不揮発WALバッファ
不揮発WALバッファ不揮発WALバッファ
不揮発WALバッファ
 
企業間データ流通のための国際基盤
企業間データ流通のための国際基盤企業間データ流通のための国際基盤
企業間データ流通のための国際基盤
 
企業間データ流通のための国際基盤
企業間データ流通のための国際基盤企業間データ流通のための国際基盤
企業間データ流通のための国際基盤
 
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
 
2-in-1 Cluster Integration: Batch and Interactive GPU Computing
2-in-1 Cluster Integration: Batch and Interactive GPU Computing2-in-1 Cluster Integration: Batch and Interactive GPU Computing
2-in-1 Cluster Integration: Batch and Interactive GPU Computing
 
Hybrid Sourcing for Overcoming “Digital Cliff 2025”
Hybrid Sourcing for Overcoming “Digital Cliff 2025”Hybrid Sourcing for Overcoming “Digital Cliff 2025”
Hybrid Sourcing for Overcoming “Digital Cliff 2025”
 
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
 
Why and how Edge Computing matters enterprise IT strategy
Why and how Edge Computing matters enterprise IT strategyWhy and how Edge Computing matters enterprise IT strategy
Why and how Edge Computing matters enterprise IT strategy
 
外部キー制約を考慮した特徴量削減手法
外部キー制約を考慮した特徴量削減手法外部キー制約を考慮した特徴量削減手法
外部キー制約を考慮した特徴量削減手法
 
デジタルサービスプラットフォーム実現に向けた技術課題
デジタルサービスプラットフォーム実現に向けた技術課題デジタルサービスプラットフォーム実現に向けた技術課題
デジタルサービスプラットフォーム実現に向けた技術課題
 
Building images efficiently and securely on Kubernetes with BuildKit
Building images efficiently and securely on Kubernetes with BuildKitBuilding images efficiently and securely on Kubernetes with BuildKit
Building images efficiently and securely on Kubernetes with BuildKit
 
Real-time spatiotemporal data utilization for future mobility services
Real-time spatiotemporal data utilization for future mobility servicesReal-time spatiotemporal data utilization for future mobility services
Real-time spatiotemporal data utilization for future mobility services
 
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
 
統合ログ分析技術Lognosisと運用ログ分析の取組
統合ログ分析技術Lognosisと運用ログ分析の取組統合ログ分析技術Lognosisと運用ログ分析の取組
統合ログ分析技術Lognosisと運用ログ分析の取組
 
MVSR Schedulerを作るための指針
MVSR Schedulerを作るための指針MVSR Schedulerを作るための指針
MVSR Schedulerを作るための指針
 
OpenStack Swiftとそのエコシステムの最新動向
OpenStack Swiftとそのエコシステムの最新動向OpenStack Swiftとそのエコシステムの最新動向
OpenStack Swiftとそのエコシステムの最新動向
 
NTTのR&Dを支えるNTTコミュニケーションズのIT基盤サービス
NTTのR&Dを支えるNTTコミュニケーションズのIT基盤サービスNTTのR&Dを支えるNTTコミュニケーションズのIT基盤サービス
NTTのR&Dを支えるNTTコミュニケーションズのIT基盤サービス
 

Recently uploaded

UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
amsjournal
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
Roger Rozario
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
Las Vegas Warehouse
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
mamamaam477
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 

Recently uploaded (20)

UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 

Network Implosion: Effective Model Compression for ResNets via Static Layer Pruning and Retraining

  • 1. Copyright©2019 NTT Corp. All Rights Reserved. Network Implosion: Effec4ve Model Compression for ResNets via Sta4c Layer Pruning and Retraining Yasutoshi Ida, Yasuhiro Fujiwara NTT So3ware Innova7on Center, Japan
  • 2. 2Copyright©2019 NTT Corp. All Rights Reserved. Background: Convolu(onal Neural Networks are used in many applica(ons • Image classifica(on, object detec(on, segmenta(on… • The inference is performed by forward propaga4on. input output ・・・・ ・・・・ layers forward propaga(on
  • 3. 3Copyright©2019 NTT Corp. All Rights Reserved. Background: Convolu(onal Neural Networks have many layers • Convolu(onal Neural Networks (CNNs) achieve high accuracy in many applica(ons by stacking many layers. • ResNet is a standard CNN-based model. Revolution of Depth 3.57 6.7 7.3 11.7 16.4 25.8 28.2 ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ILSVRC'11 ILSVRC'10 ImageNet Classification top-5 error (%) shallow8 layers 19 layers22 layers 152 layers Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. 8 layers shallow deep very deep ultra deep hCps://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
  • 4. 4Copyright©2019 NTT Corp. All Rights Reserved. Background: Many layers incur long processing (me of inference • Many layers incur high computa(on costs such as (me and memory consump(on for the inference (forward propaga(on). •  Reducing the inference costs is important for the service deployment. input output ・・・・ ・・・・ ・・・・ ・・・・ many layers forward propaga(on need long (me for forward propaga(on
  • 5. 5Copyright©2019 NTT Corp. All Rights Reserved. Challenge: Erasing mul(ple layers without degrading accuracy • Erasing layers to speed up forward propaga4on/reduce the model size. • The most of previous methods sacrifice: •  accuracy [Huang et al., ECCV 2018][Yu et al., CVPR 2018] •  memory consump(on [Veit et al., ECCV 2018][Wu et al., CVPR 2018] • Can we erase layers without sacrificing accuracy and memory consump4on?
  • 6. 6Copyright©2019 NTT Corp. All Rights Reserved. Preliminary: Convolu(onal layer and Residual Unit • Convolu4onal layer: filters slides on images/ac(va(on maps • Residual Unit: convolu(on layers with iden(ty map Fei-Fei Li & Justin Johnson & Serena Yeung April 18, 2017Lecture 5 -Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 18, 201732 32 32 3 Convolution Layer 32x32x3 image 5x5x3 filter convolve (slide) over all spatial locations activation map 1 28 28 ase of Highway networks that can also stack many layers by introducing gating per, we focuse on ResNet because our idea is suitable for ResNet as we will roduce Residual Networks and some evidences that explain why we can erase orks ResNets) are CNN-based models that have blocks called Residual Units. ResNets Units and consist deep architectures. The l-th Residual Unit described in is xl+1 = xl + F(xl), (1) to the l-th Residual Unit. F(·) is a module that consists of convolutions, batch Rectified Linear Units (ReLUs). Therefore, Residual Unit consists of identity ingdi↵erentactivation ResNet-164 5.93 6.50 6.14 5.91 5.46 ReLU weight BN ReLU weight BN addition xl xl+1 ht U ht U ly on (e)fullpre-activation iden(ty map nonlinear map: Convolu(onal layers, Batch normaliza(ons, and ReLUs. hCp://cs231n.github.io/convolu(onal-networks/ hCps://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
  • 7. 7Copyright©2019 NTT Corp. All Rights Reserved. Preliminary: Residual Network (ResNet) • ResNet stacks Residual Units to build deep structure. • ResNet is used as a standard model for computer vision task. Fig.4(a)6.615.93 Fig.4(b)8.176.50 Fig.4(c)7.846.14 Fig.4(d)6.715.91 Fig.4(e)6.375.46 ReLU weight BN ReLU weight BN addition xl xl+1 BN ReLU weight BN ReLU weight dition xl xl+1 weight BN ReLU weight BN ReLU addition xl xl+1 ReLUbefore addition (d)ReLU-only pre-activation (e)fullpre-activation ninTable2.Alltheseunitsconsistofthesame ↵erent. inFig.2,theshortcutconnectionsarethe ntopropagate.Multiplicativemanipulations anddropout)ontheshortcutscanhamper tooptimizationproblems. gand1⇥1convolutionalshortcutsintroduce strongerrepresentationalabilitiesthaniden- t-onlygatingand1⇥1convolutioncoverthe s(i.e.,theycouldbeoptimizedasidentity errorishigherthanthatofidentityshort- onofthesemodelsiscausedbyoptimization bilities. ationFunctions upporttheanalysisinEqn.(5)andEqn.(8), mptionthattheafter-additionactivationf theCIFAR-10testsetusingdi↵erentactivation Fig.ResNet-110ResNet-164 Fig.4(a)6.615.93 Fig.4(b)8.176.50 Fig.4(c)7.846.14 Fig.4(d)6.715.91 Fig.4(e)6.375.46 ReLU weight BN ReLU weight BN addition xl xl+1 BN ReLU weight BN ReLU weight dition xl l+1 weight BN ReLU weight BN ReLU addition xl xl+1 ReLUbefore addition (d)ReLU-only pre-activation (e)fullpre-activation ninTable2.Alltheseunitsconsistofthesame erent. ・ ・ ・ input output stacking Residual Units forward propaga(on hCps://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
  • 8. 8Copyright©2019 NTT Corp. All Rights Reserved. Problem Descrip4on: Layer-level pruning for ResNet • Our problem is layer-level pruning. •  We consider standard image classifica(on task, but our method can be used in other tasks such as detec(on and segmenta(on. NIN 7.6 100 0 1.1 GoogLeNet 6.9 85.1 14.9 1.6 ResNet-18 5.6 100 0 1.8 ResNet-50 12.2 100 0 3.8 ResNet-101 21.2 100 0 7.6 Figure 1. Different pruning methods for a convolutional layer which has 3 convolutional filters of size 3 ⇥ 3 ⇥ 3. implemented by a result, the Bas can be utilized to the group-wise b weight matrix in sparsity regulariz with group-spars ing can utilize th almost linear at they achieved a 3 in AlexNet. Con group Lasso to pr explored differen filters, channels, be regarded as a NIN 7.6 100 0 GoogLeNet 6.9 85.1 14.9 ResNet-18 5.6 100 0 ResNet-50 12.2 100 0 ResNet-101 21.2 100 0 Figure 1. Different pruning methods for a convolutional layer which has 3 convolutional filters of size 3 ⇥ 3 ⇥ 3. implem a resu can be the gro weight sparsit with g ing ca almost they ac in Alex group explor filters, be reg ResNet-50 12.2 100 0 ResNet-101 21.2 100 0 Figure 1. Different pruning methods for a convolutional layer which has 3 convolutional filters of size 3 ⇥ 3 ⇥ 3. implem a resu can be the gro weight sparsit with g ing ca almost they ac in Alex group explor filters, be reg metho achiev GPU r Figure 1. Different pruning methods for a convolutional layer which has 3 convolutional filters of size 3 ⇥ 3 ⇥ 3. implem a resu can be the gro weight sparsit with g ing can almost they ac in Alex group explor filters, be reg method achiev GPU r 3.4. F Layer-level Pruning Cheng et al., Recent Advances in Efficient Computa(on of Deep Convolu(onal Neural Networks, 2018.
  • 9. 9Copyright©2019 NTT Corp. All Rights Reserved. Problem Descrip4on: Layer-level pruning for ResNet • The points of the problem are follows: 1) How to select layers, which will be erased. 2) How to keep the accuracy.
  • 10. 10Copyright©2019 NTT Corp. All Rights Reserved. Proposed solu4on: 1) How to select layers • Introducing priority into Residual Unit • We can select unimportant Residual Units according to the values of |w_{l}|. •  Small |w_{l}| will scale down the signal of nonlinear map •  We can erase the Residual Unit by erasing the nonlinear map Fig.4(d)6.715.91 Fig.4(e)6.375.46 ReLU weight BN ReLU weight BN addition xl xl+1 BN ReLU weight BN ReLU weight addition xl xl+1 weight BN ReLU weight BN ReLU addition xl xl+1 c)ReLUbefore addition (d)ReLU-only pre-activation (e)fullpre-activation ioninTable2.Alltheseunitsconsistofthesame i↵erent. inFig.2,theshortcutconnectionsarethe iontopropagate.Multiplicativemanipulations s,anddropout)ontheshortcutscanhamper tooptimizationproblems. ngand1⇥1convolutionalshortcutsintroduce estrongerrepresentationalabilitiesthaniden- ut-onlygatingand1⇥1convolutioncoverthe ts(i.e.,theycouldbeoptimizedasidentity gerrorishigherthanthatofidentityshort- tionofthesemodelsiscausedbyoptimization abilities. vationFunctions supporttheanalysisinEqn.(5)andEqn.(8), sumptionthattheafter-additionactivationf 4.1 Importance of Residual Unit4 We introduce weights to ResNet in order to decide importance of layers. In part5 the importance of F(·) in Equation (1). If F(·) is not important for the accuracy, w6 that includes some layers. Notice that Equation (1) changes to xl+1 = xl by eras7 following weighted Residual Unit to decide the importance of F(·):8 xl+1 = xl + wlF(xl), where wl is a scalar that can be learned by back propagation. If wl is small in absolu9 down the output of F(·). In other words, F(·) gives few effect for the result if wl is0 value. Therefore, we can erase F(·) that has small absolute value of wl.1 4.2 Training and Erasing Layers2 We repeat training and erasing layers in order to keep the accuracy. We erase layer3 importance of wl after the training. Then, we again train the ResNet. We repeat th4 the accuracy drops or we erase the significant number of layers. This procedure se5 to . However, we need to be careful about following points that are different from6 4.2.1 Selecting Unimportant Layers7 re CNN-based models that have blocks called Residual Units. ResNets d consist deep architectures. The l-th Residual Unit described in is xl+1 = xl + F(xl), (1) h Residual Unit. F(·) is a module that consists of convolutions, batch Linear Units (ReLUs). Therefore, Residual Unit consists of identity mapping F(·). ResNets can stack more than thousand layers without acking the Residual Units while other traditional architectures such as esidual Networks accuracies for several tasks in computer vision such as image clas- d semantic segmentaion, their memory consumption and processing ntheCIFAR-10testsetusingdi↵erentactivation Fig.ResNet-110ResNet-164 Fig.4(a)6.615.93 Fig.4(b)8.176.50 Fig.4(c)7.846.14 Fig.4(d)6.715.91 Fig.4(e)6.375.46 ReLU weight BN ReLU weight BN addition xl xl+1 BN ReLU weight BN ReLU weight addition xl xl+1 weight BN ReLU weight BN ReLU addition xl xl+1 (c)ReLUbefore addition (d)ReLU-only pre-activation (e)fullpre-activation ioninTable2.Alltheseunitsconsistofthesame di↵erent. sinFig.2,theshortcutconnectionsarethe tiontopropagate.Multiplicativemanipulations s,anddropout)ontheshortcutscanhamper Residual Unit Residual Unit with priority The priority is a scalar, and can be trained. hCps://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
  • 11. 11Copyright©2019 NTT Corp. All Rights Reserved. Proposed solu4on: 1) How to select layers • Introducing priority into Residual Unit • We can select unimportant Residual Units according to the values of |w_{l}|. •  Small |w_{l}| will scale down the signal of nonlinear map •  We can erase the Residual Unit by erasing the nonlinear map Fig.4(d)6.715.91 Fig.4(e)6.375.46 ReLU weight BN ReLU weight BN addition xl xl+1 BN ReLU weight BN ReLU weight addition xl xl+1 weight BN ReLU weight BN ReLU addition xl xl+1 c)ReLUbefore addition (d)ReLU-only pre-activation (e)fullpre-activation ioninTable2.Alltheseunitsconsistofthesame i↵erent. inFig.2,theshortcutconnectionsarethe iontopropagate.Multiplicativemanipulations s,anddropout)ontheshortcutscanhamper tooptimizationproblems. ngand1⇥1convolutionalshortcutsintroduce estrongerrepresentationalabilitiesthaniden- ut-onlygatingand1⇥1convolutioncoverthe ts(i.e.,theycouldbeoptimizedasidentity gerrorishigherthanthatofidentityshort- tionofthesemodelsiscausedbyoptimization abilities. vationFunctions supporttheanalysisinEqn.(5)andEqn.(8), sumptionthattheafter-additionactivationf 4.1 Importance of Residual Unit4 We introduce weights to ResNet in order to decide importance of layers. In part5 the importance of F(·) in Equation (1). If F(·) is not important for the accuracy, w6 that includes some layers. Notice that Equation (1) changes to xl+1 = xl by eras7 following weighted Residual Unit to decide the importance of F(·):8 xl+1 = xl + wlF(xl), where wl is a scalar that can be learned by back propagation. If wl is small in absolu9 down the output of F(·). In other words, F(·) gives few effect for the result if wl is0 value. Therefore, we can erase F(·) that has small absolute value of wl.1 4.2 Training and Erasing Layers2 We repeat training and erasing layers in order to keep the accuracy. We erase layer3 importance of wl after the training. Then, we again train the ResNet. We repeat th4 the accuracy drops or we erase the significant number of layers. This procedure se5 to . However, we need to be careful about following points that are different from6 4.2.1 Selecting Unimportant Layers7 re CNN-based models that have blocks called Residual Units. ResNets d consist deep architectures. The l-th Residual Unit described in is xl+1 = xl + F(xl), (1) h Residual Unit. F(·) is a module that consists of convolutions, batch Linear Units (ReLUs). Therefore, Residual Unit consists of identity mapping F(·). ResNets can stack more than thousand layers without acking the Residual Units while other traditional architectures such as esidual Networks accuracies for several tasks in computer vision such as image clas- d semantic segmentaion, their memory consumption and processing ntheCIFAR-10testsetusingdi↵erentactivation Fig.ResNet-110ResNet-164 Fig.4(a)6.615.93 Fig.4(b)8.176.50 Fig.4(c)7.846.14 Fig.4(d)6.715.91 Fig.4(e)6.375.46 ReLU weight BN ReLU weight BN addition xl xl+1 BN ReLU weight BN ReLU weight addition xl xl+1 weight BN ReLU weight BN ReLU addition xl xl+1 (c)ReLUbefore addition (d)ReLU-only pre-activation (e)fullpre-activation ioninTable2.Alltheseunitsconsistofthesame di↵erent. sinFig.2,theshortcutconnectionsarethe tiontopropagate.Multiplicativemanipulations s,anddropout)ontheshortcutscanhamper Residual Unit Residual Unit with priority The priority is a scalar, and can be trained. hCps://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
  • 12. 12Copyright©2019 NTT Corp. All Rights Reserved. Proposed solu4on: 2) How to keep the accuracy • Re-training aeer erasing Residual Unit •  Re-training is a tradi(onal strategy for pruning methods [LeCun et al., NeurIPS 1989] Train ResNet Erase Residual Unit according to priority |w_{l}| Re-train ResNet with large LR Key points: 1. use large learning rate for re-training 2. erase one Residual Unit at one 4me 3. do not erase Residual Units aeer downsampling or channel increasing Repeat
  • 13. 13Copyright©2019 NTT Corp. All Rights Reserved. building implosion ResNet implosion Training… implosion Proposed algorithm: Network Implosion implosion Re-training… Re-training… hCps://en.wikipedia.org/wiki/Building_implosion
  • 14. 14Copyright©2019 NTT Corp. All Rights Reserved. Evalua4on: Segng • Task: Image classifica(on • Datasets: CIFAR10/100, ILSVRC2012 ImageNet • Model: ResNet56 for CIFAR10/100, ResNet50 for ImageNet • Each Residual Unit has 3 convolu(onal layers. • Metric: 1) Tradeoff between accuracy and # of layers. 2) Computa(on costs such as processing (me and model size. • Compared with standard ResNet and Knowledge Dis(lla(on (teacher-student training). • Other hyperparameters are described in the paper in detail.
  • 15. 15Copyright©2019 NTT Corp. All Rights Reserved. Evalua4on: We could erase layers even if we use real world dataset • Layer reduc4ons are 58 ~ 76 % without degrading accuracy. • Original ResNet degrades accuracy when we reduce layers. • Teacher-student training can keep accuracy on CIFAR10, but can not on other datasets. 56 layers -> 32 layers 56 layers -> 35 layersErasing Multiple Layers from Residual Networks without Accuracy Loss 50 40 30 20 10 88909294 50 40 30 20 10 88909294 50 40 30 20 10 88909294 50 40 30 20 10 88909294 the number of layers Accuracy(%) ResNet teacher−student Network Implosion (a) Cifar-10 50 40 30 20 10 55606570 50 40 30 20 10 55606570 50 40 30 20 10 55606570 50 40 30 20 10 55606570 the number of layers Accuracy(%) ResNet teacher−student Network Implosion (b) Cifar-100 50 45 40 35 30 25 20 6870727476 50 45 40 35 30 25 20 6870727476 50 45 40 35 30 25 20 6870727476 50 45 40 35 30 25 20 6870727476 the number of layers Accuracy(%) ResNet teacher−student Network Implosion (c) ImageNet Figure 1. The accuracies on Cifar-10, Cifar-100 and ImageNet. The red dotted lines represent accuracies for initial models in our approa 50 layers -> 38 layers
  • 16. 16Copyright©2019 NTT Corp. All Rights Reserved. Evalua4on: We could reduce computa(on costs by erasing layers • Reduc(on rates of computa(on costs without degrading accuracy: • # of layers: 58 ~ 76 % • # of mul(ply-accumulate: 61 ~ 79 % • (me of forward propaga(on: 61 ~ 77 % • # of parameters: 70 ~ 94 % Erasing Multiple Layers from Residual Networks without Accuracy Loss Table 1. The computation costs for the test phase after erasing layers without accuracy loss. 56 and 50 layered models are original models for Cifar-10/100 and ImageNet, respectively. dataset # of layers accuracy (%) # of MACs forward (msec) backward (msec) # of parameters Cifar-10 56 92.88 8.19 ⇥ 107 6.584 12.93 585.9K 32 93.05 4.99 ⇥ 107 3.970 7.721 409.1K Cifar-100 56 71.83 8.65 ⇥ 107 6.203 13.36 613.6K 35 71.99 5.44 ⇥ 107 4.350 8.075 555.3K ImageNet 50 75.89 4.11 ⇥ 109 29.95 59.51 25.55M 38 76.12 3.23 ⇥ 109 22.97 46.53 23.80M above red dotted lines when we started to erase layers. This is because we can retrain the models with good initial pa- rameters as described in Section 4.2. This result verifies our the smallest models with no accuracy loss for each dataset as described in Figure 1. Table 1 shows the results. The table shows that the numbers The inference 4me is reduced to 61~ 77 % without degrading accuracy and increasing model size.
  • 17. 17Copyright©2019 NTT Corp. All Rights Reserved. • Reducing # of layers to reduce the inference (me. • Layer erasure and re-training scheme is effec(ve. • # of layers can be reduced without accuracy drops. • The inference 4me can be reduced to 61 ~ 77 % in our experiments. Summary:
  • 18. 18Copyright©2019 NTT Corp. All Rights Reserved. Appendix:
  • 19. 19Copyright©2019 NTT Corp. All Rights Reserved. Theore4cal analysis: Generaliza(on error bound can be (ght • Theorem 1 (informal): We can have (ght upper bound of generaliza(on error by erasing layers from trained ResNet when following condi(on holds: • The condi(on probabilis(cally holds (because it u(lizes PAC bound) • In other words, we can start to re-training ResNet from good ini(al parameters in terms of generaliza(on error bound. 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 error bounds of f 2 FL and f 2 F[L]l0 : Theorem 1. Let ¯Eg(·) be an upper bound of general- ization error, and ⇢ be a fixed margin. Suppose that Lemma 6 holds, and E⇢ e (f) < E⇢ e (f0 ) where f 2 ˆFL is a trained L-layered FC-ResNet classifier such that ˆWl0 > 1 for the muti-label classification problem and f0 2 ˆF[L]l0 is a classifier that erases the l0 -th Residual Unit from f. For 8 > 0 and 8f 2 F, when the condi- tion E⇢ e (f0 ) E⇢ e (f)< 8M(2M 1) ⇢ ( ¯Rm( ˆFL) ¯Rm( ˆF[L]l0 )) holds, we have the following bound with probability (1 )2 : ¯Eg(f0 ) < ¯Eg(f). (12) Proof. From Lemma 3 and the upper bound of Rademacher average, we first have the following equation: ¯Eg(·) = representati the change (Greff et al makes the v hold Theore tance of eac erasure will (ii) Keepin reports that layers are e value of E⇢ e ( it is difficul Thus, we ne after erasin training error aeer erasure training error before erasure upper bound of Rademacher average before erasure upper bound of Rademacher average aeer erasure
  • 20. 20Copyright©2019 NTT Corp. All Rights Reserved. Addi4onal results: ResNet1001 etworks via Erasing Multiple Layers ading Accuracy 1000 800 600 400 200 93.094.095.096.0 1000 800 600 400 200 93.094.095.096.0 1000 800 600 400 200 93.094.095.096.0 the number of layers Accuracy(%) (a) Cifar-10 1000 800 600 400 200 747576777879 1000 800 600 400 200 747576777879 1000 800 600 400 200 747576777879 the number of layers Accuracy(%) (b) Cifar-100 Figure 1: The accuracies of 1001-layer models on Cifar-10 and Cifar-100 (black lines). The red lines represent accura- cies for initial models in our approach: Network Implosion. The blue dashed lines represent accuracies of original 1001- layer ResNets that were reported in (He et al. 2016). Our method uses a few layers but achieves higher accuracies than the original ResNets. Table 1: The processing times of forward and backward propagations. NI represents our approach: Network Implo- sion. We used 1001-layer models as initial models. dataset method forward (sec) forward+backward (sec) proach: Network Implosion. We used 1001-layer models as initial models. dataset method memory consumption (MB) Cifar-10 ResNet 85.6 NI 42.6 Cifar-100 ResNet 85.9 NI 42.8 values of weights numberofweights −1.50 −0.75 0.00 0.75 1.50 01020304050 Figure 2: Histogram of weights wl in 1001 layered model after training. Table 3: The average processing times of forward and back propagation. NI represents our approach: Network Implo- sion. dataset method forward (sec) forward+backward (sec) Tiny- ResNet 0.04146 0.1635 ImageNet NI 0.02129 0.09480 Table 4: Model sizes. NI represents our approach: Network Implosion.
  • 21. 21Copyright©2019 NTT Corp. All Rights Reserved. Addi4onal results: Tiny-ImageNet dataset epresents our ap- 1-layer models as mption (MB) 6 6 9 8 50 40 30 20 10 40455055 50 40 30 20 10 40455055 50 40 30 20 10 40455055 the number of layers Accuracy(%) ResNet Network Implosion Figure 3: The accuracies on Tiny-ImageNet (black line). The red lines represent accuracies for initial models in our ap- proach: Network Implosion. It achieves higher accuracies than original ResNets (blue dashed line). • 200 classes • each class has 500 images • image size: 64*64*3
  • 22. 22Copyright©2019 NTT Corp. All Rights Reserved. Addi4onal results: # of layers for each stage Table 7: The numbers of layers in each stage after erasing layers without accuracy loss. 56 and 50 layered models are original models for Cifar-10/100 and ImageNet, respectively. The layers in Stage 1 near inputs are aggressively erased. dataset Total # of layers accuracy (%) # of layers in Stage 1 # of layers in Stage 2 # of layers in Stage 3 # of layers in Stage 4 Cifar-10 56 92.88 18 18 18 - 32 93.05 3 15 12 - Cifar-100 56 71.83 18 18 18 - 35 71.99 3 12 18 - ImageNet 50 75.89 9 12 18 9 38 76.12 6 6 15 9 times to converge for each methods. Note that 50-layer mod- els are used as the teacher networks for teacher-student train- ing regime and the original models for our method. Our approach trains the models faster than teacher-student training regime because our method retrains the models with preferable initial parameters in terms of the generalization bounds (see Theorem 1 and Algorithm 1 in our paper). On the other hand, since student networks are trained from • Layers near the input are aggressively erased. • Layers near the output remains.