（DL輪読）Variational Dropout Sparsifies Deep Neural Networks

Variational Dropout Sparsifies
Deep Neural Networks
2017/03/24 鈴⽊雅⼤

本論⽂について
¤ Dmitry Molchanov，Arsenii Ashukha，Dmitry Vetrov
¤ スコルコボ科学技術⼤学，国⽴研究⼤学⾼等経済学院，モスクワ物理⼯科
⼤学
¤ ICML2017投稿論⽂（2017/2/27 arXiv）
¤ Bayesian Dropoutのドロップアウト率をマックスまで設定できる⼿
法を提案．
¤ アイディア⾃体はめちゃくちゃシンプル．
¤ スパース性が⾼くなるだけではなく，通常のCNNの汎化性能に関する問
題を解消できる．
¤ 選定理由
¤ ⼀昨年輪読した論⽂[Kingma+ 15]の拡張だから．
¤ シンプルなアイディアで⼤きな効果を上げているのが好み．

¤ データ𝐷 = (𝑥%, 𝑦%)%)*
+
を観測したとき・・・
¤ ⽬標は𝑝 𝑦 𝑥, 𝑤 = 𝑝 𝐷 𝑤 を求めること．
¤ ベイズ学習の枠組みでは，パラメーターwの事前知識を考える
¤ Dを観測した後のwの事後分布は次のようになる
¤ この処理をベイズ推論という．
¤ 事後分布を求めるためには，分⺟で周辺化が必要．
->変分推論
ベイズ推論
𝑝 𝑤 𝐷 =
𝑝 𝐷 𝑤 𝑝(𝑤)
𝑝(𝐷)
=
𝑝 𝐷 𝑤 𝑝(𝑤)
∫ 𝑝 𝐷 𝑤 𝑝 𝑤 𝑑𝑤

変分推論
¤ 近似分布 𝑞(𝑤|𝜙)を考えて，真の事後分布との距離𝐷45[𝑞(𝑤|𝜙)||𝑝 (𝑤|𝐷)]を
最⼩化する．
¤ これは次の変分下界を最⼤化することと等価
¤ 再パラメータ化トリックによって，変分下界は𝜙について微分可能になる．
¤ ミニバッチにおいて，下界と下界の勾配の不偏推定量は
ただし
Variational Dropout Sparsiﬁes Deep Neural Networks
L( ) = LD( ) DKL(q (w) k p(w)) ! max
2
(1)
LD( ) =
NX
n=1
Eq (w)[log p(yn | xn, w)] (2)
It consists of two parts, the expected log-likelihood LD( )
and the KL-divergence DKL(q (w) k p(w)), which acts as
a regularization term.
3.2. Stochastic Variational Inference
In the case of complex models expectations in (1) and (2)
are intractable. Therefore the variational lower bound (1)
and its gradients can not be computed exactly. However, it
is still possible to estimate them using sampling and opti-
mize the variational lower bound using stochastic optimiza-
tion.
noise ⌅ to the layer input
procedure (Hinton et al., 20
B = (A ⌅)W
The original version of dro
nary Dropout, was presente
(Hinton et al., 2012). It me
put matrix is put to zero
as a dropout rate. Later
Gaussian Dropout with con
p
1 p ) works as well and is
dropout rate p (Srivastava
to use continuous noise i
multiplying the inputs by
to putting Gaussian noise
dure can be used to obta
the model’s weights (Wan
et al., 2015). That is, puttin
tion.
We follow (Kingma & Welling, 2013) and use the Repa-
rameterization Trick to obtain an unbiased differentiable
minibatch-based Monte Carlo estimator of the expected
log-likelihood (3). The main idea is to represent the para-
metric noise q (w) as a deterministic differentiable func-
tion w = f( , ✏) of a non-parametric noise ✏ s p(✏).
This trick allows us to obtain an unbiased estimate of
r LD(q ). Here we denote objects from a mini-batch as
(˜xm, ˜ym)M
m=1.
L( )'LSGVB
( )=LSGVB
D ( ) DKL(q (w)kp(w)) (3)
LD( )'LSGVB
D ( )=
N
M
MX
m=1
log p(˜ym|˜xm, f( , ✏m)) (4)
r LD( )'
N
M
MX
m=1
r log p(˜ym|˜xm, f( , ✏m)) (5)
The Local Reparameterization Trick is another technique
put matrix is put to zero w
as a dropout rate. Later t
Gaussian Dropout with con
p
1 p ) works as well and is
dropout rate p (Srivastava
to use continuous noise in
multiplying the inputs by
to putting Gaussian noise
dure can be used to obtai
the model’s weights (Wan
et al., 2015). That is, puttin
⇠ij ⇠ N(1, ↵) on a weigh
of wij from q(wij | ✓ij, ↵)
becomes a random variable
wij = ✓ij⇠ij = ✓ij(1 +
p
✏ij s N
Gaussian Dropout training
timization of the expected
when we use the reparamete
sample W s q(W | ✓, ↵) pe
pectation. Variational Drop
explicitly uses q(W | ✓, ↵) a
tribution for a model with
The parameters ✓ and ↵ of
tuned via stochastic variatio
are the variational paramet
The prior distribution p(W
n=1
tion.
(˜xm, ˜ym)M
m=1.
L( )'LSGVB
( )=LSGVB
D ( ) DKL(q (w)kp(w)) (3)
LD( )'LSGVB
D ( )=
N
M
MX
m=1
log p(˜ym|˜xm, f( , ✏m)) (4)
r LD( )'
N
M
MX
m=1
r log p(˜ym|˜xm, f( , ✏m)) (5)
The Local Reparameterization Trick is another technique
that reduces the variance of this gradient estimator even fur-
nary Dropout, was presented with
(Hinton et al., 2012). It means th
put matrix is put to zero with p
as a dropout rate. Later the sa
Gaussian Dropout with continuou
p
1 p ) works as well and is simila
dropout rate p (Srivastava et al.
to use continuous noise instead
multiplying the inputs by a Gau
to putting Gaussian noise on th
dure can be used to obtain a p
the model’s weights (Wang &
et al., 2015). That is, putting mul
⇠ij ⇠ N(1, ↵) on a weight wij
of wij from q(wij | ✓ij, ↵) = N(
becomes a random variable param
p
↵✏ij)
✏ij s N(0, 1
Gaussian Dropout training is eq
timization of the expected log l
when we use the reparameterizati
sample W s q(W | ✓, ↵) per min
pectation. Variational Dropout e
explicitly uses q(W | ✓, ↵) as an a
tribution for a model with a spe
The parameters ✓ and ↵ of the di
tuned via stochastic variational i
are the variational parameters, a
The prior distribution p(W) is ch
scale uniform to make the Variati
L( ) = LD( ) DKL(q (w) k p(w)) ! max
2
(1)
LD( ) =
NX
n=1
tion.
The or
nary D
(Hinton
put ma
as a dr
Gaussi
p
1 p ) w
dropou
to use
multipl
to putt
dure c
the mo
et al., 2
⇠ij ⇠
of wij
becom
wij =
Gaussi

ドロップアウト
¤ 全結合層において，ドロップアウトは各訓練処理において
ランダムなノイズを加える．
¤ ノイズをサンプリングする分布としてベルヌーイやガウス分布が使われる
¤ 𝑊にガウスノイズを⼊れることは，から
𝑊をサンプリングすることと等価
¤ すると，確率変数𝑤は𝜃によって次のようにパラメータ化される．
In this section we consider a single fully-connected layer
with I input neurons and O output neurons before a non-
linearity. We denote an output matrix as BM⇥O
, input ma-
trix as AM⇥I
and a weight matrix as WI⇥O
. We index
the elements of these matrices as bmj, ami and wij respec-
tively. Then B = AW.
Dropout is one of the most popular regularization methods
for deep neural networks. It injects a multiplicative random
DKL(q(W | ✓,
bound (1) doe
Maximization
comes equival
likelihood (2) w
sian Dropout t
Dropout with fi
vides a way to
ational lower b
ational Dropout Sparsifies Deep Neural Networks
w)) ! max
2
(1)
n | xn, w)] (2)
g-likelihood LD( )
(w)), which acts as
tions in (1) and (2)
nal lower bound (1)
noise ⌅ to the layer input A at each iteration of training
procedure (Hinton et al., 2012).
B = (A ⌅)W, with ⇠mi s p(⇠) (6)
The original version of dropout, so-called Bernoulli or Bi-
nary Dropout, was presented with ⇠mi s Bernoulli(1 p)
(Hinton et al., 2012). It means that each element of the in-
put matrix is put to zero with probability p, also known
as a dropout rate. Later the same authors reported that
Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ =
p
1 p ) works as well and is similar to Binary Dropout with
dropout rate p (Srivastava et al., 2014). It is beneficial
to use continuous noise instead of discrete one because
nal Dropout Sparsifies Deep Neural Networks
! max
2
(1)
n, w)] (2)
kelihood LD( )
), which acts as
ns in (1) and (2)
ower bound (1)
tly. However, it
mpling and opti-
hastic optimiza-
noise ⌅ to the layer input A at each iteration of training
B = (A ⌅)W, with ⇠mi s p(⇠) (6)
p
multiplying the inputs by a Gaussian noise is equivalent
to putting Gaussian noise on the weights. This proce-
dure can be used to obtain a posterior distribution over
the model’s weights (Wang & Manning, 2013; Kingma
eep Neural Networks
se ⌅ to the layer input A at each iteration of training
cedure (Hinton et al., 2012).
B = (A ⌅)W, with ⇠mi s p(⇠) (6)
original version of dropout, so-called Bernoulli or Bi-
y Dropout, was presented with ⇠mi s Bernoulli(1 p)
nton et al., 2012). It means that each element of the in-
matrix is put to zero with probability p, also known
a dropout rate. Later the same authors reported that
ussian Dropout with continuous noise ⇠mi s N(1, ↵ =
) works as well and is similar to Binary Dropout with
pout rate p (Srivastava et al., 2014). It is beneficial
use continuous noise instead of discrete one because
tiplying the inputs by a Gaussian noise is equivalent
putting Gaussian noise on the weights. This proce-
e can be used to obtain a posterior distribution over
model’s weights (Wang & Manning, 2013; Kingma
l., 2015). That is, putting multiplicative Gaussian noise
⇠ N(1, ↵) on a weight wij is equivalent to sampling
wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2
ij). Now wij
omes a random variable parametrized by ✓ij.
L( ) = LD( ) DKL(q (w) k p(w)) ! max
2
(1)
LD( ) =
NX
n=1
onsists of two parts, the expected log-likelihood LD( )
the KL-divergence DKL(q (w) k p(w)), which acts as
gularization term.
Stochastic Variational Inference
he case of complex models expectations in (1) and (2)
intractable. Therefore the variational lower bound (1)
its gradients can not be computed exactly. However, it
till possible to estimate them using sampling and opti-
e the variational lower bound using stochastic optimiza-
.
follow (Kingma & Welling, 2013) and use the Repa-
eterization Trick to obtain an unbiased differentiable
ibatch-based Monte Carlo estimator of the expected
likelihood (3). The main idea is to represent the para-
ric noise q (w) as a deterministic differentiable func-
noise ⌅ to the layer input A at each iteration of
B = (A ⌅)W, with ⇠mi s p(⇠)
The original version of dropout, so-called Bernou
nary Dropout, was presented with ⇠mi s Bernoul
(Hinton et al., 2012). It means that each element o
put matrix is put to zero with probability p, als
as a dropout rate. Later the same authors repo
Gaussian Dropout with continuous noise ⇠mi s N
p
1 p ) works as well and is similar to Binary Drop
dropout rate p (Srivastava et al., 2014). It is b
to use continuous noise instead of discrete one
multiplying the inputs by a Gaussian noise is eq
to putting Gaussian noise on the weights. Thi
dure can be used to obtain a posterior distribut
the model’s weights (Wang & Manning, 2013;
et al., 2015). That is, putting multiplicative Gauss
⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to s
of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2
ij).
becomes a random variable parametrized by ✓ij.
p
↵✏ij) ⇠ N(wij | ✓ij, ↵
n=1
parts, the expected log-likelihood LD( )
gence DKL(q (w) k p(w)), which acts as
erm.
ariational Inference
mplex models expectations in (1) and (2)
Therefore the variational lower bound (1)
can not be computed exactly. However, it
o estimate them using sampling and opti-
al lower bound using stochastic optimiza-
ma & Welling, 2013) and use the Repa-
ick to obtain an unbiased differentiable
Monte Carlo estimator of the expected
. The main idea is to represent the para-
w) as a deterministic differentiable func-
✏) of a non-parametric noise ✏ s p(✏).
s us to obtain an unbiased estimate of
e we denote objects from a mini-batch as
)=LSGVB
D ( ) DKL(q (w)kp(w)) (3)
p
et al., 2015). That is, putting multiplicative Gaussian noise
⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to sampling
ij). Now wij
p
↵✏ij) ⇠ N(wij | ✓ij, ↵✓2
ij)
✏ij s N(0, 1)
(7)
Gaussian Dropout training is equivalent to stochastic op-
timization of the expected log likelihood (2) in the case
when we use the reparameterization trick and draw a single
sample W s q(W | ✓, ↵) per minibatch to estimate the ex-
yn | xn, w)] (2)
og-likelihood LD( )
p(w)), which acts as
ations in (1) and (2)
nal lower bound (1)
exactly. However, it
g sampling and opti-
stochastic optimiza-
) and use the Repa-
biased differentiable
tor of the expected
o represent the para-
differentiable func-
ic noise ✏ s p(✏).
nbiased estimate of
rom a mini-batch as
p
et al., 2015). That is, putting multiplicative Gaussian noise
⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to sampling
ij). Now wij
p
↵✏ij) ⇠ N(wij | ✓ij, ↵✓2
ij)
✏ij s N(0, 1)
(7)
Gaussian Dropout training is equivalent to stochastic op-
timization of the expected log likelihood (2) in the case

変分ドロップアウト
¤ をパラメータをもつ近似分布と考えると，このパ
ラメータは変分推論で計算することができる（変分ドロップアウト）
¤ 𝛼を固定すると，変分ドロップアウトとガウスドロップアウトは等価に
なる．
¤ KL項が⼀定になるため．
¤ 変分ドロップアウトにおいて，𝛼は学習するパラメータになっている！
¤ つまり， 𝛼を学習時に⾃動的に決定することができる．
¤ しかし，先⾏研究[Kigma+ 2015]では𝛼は1以下に制限されている．
¤ ノイズが⼊りすぎると，勾配の分散が⼤きくなる．
¤ しかし， 𝛼が無限⼤（=ドロップアウト率が1）まで設定できたほうが⾯⽩い
結果がでそう．
✏ij s N (0, 1)
aussian Dropout training is equivalent to stochastic op-
mization of the expected log likelihood (2) in the case
hen we use the reparameterization trick and draw a single
ample W s q(W | ✓, ↵) per minibatch to estimate the ex-
ectation. Variational Dropout extends this technique and
xplicitly uses q(W | ✓, ↵) as an approximate posterior dis-
ibution for a model with a special prior on the weights.
he parameters ✓ and ↵ of the distribution q(W | ✓, ↵) are
uned via stochastic variational inference, i.e. = (✓, ↵)
re the variational parameters, as denoted in Section 3.2.
he prior distribution p(W) is chosen to be improper log-
cale uniform to make the Variational Dropout with fixed ↵
quivalent to Gaussian Dropout (Kingma et al., 2015).
p(log |wij|) = const , p(|wij|) /
1
|wij|
(8)
n this model, it is the only prior distribution that makes
ariational inference consistent with Gaussian Dropout
Kingma et al., 2015). When parameter ↵ is fixed, the
DKL(q(W | ✓, ↵) k p(W)) term in the variational lower
ound (1) does not depend on ✓ (Kingma et al., 2015).
Maximization of the variational lower bound (1) then be-
e use the reparameterization trick and draw a single
W s q(W | ✓, ↵) per minibatch to estimate the ex-
n. Variational Dropout extends this technique and
ly uses q(W | ✓, ↵) as an approximate posterior dis-
n for a model with a special prior on the weights.
ameters ✓ and ↵ of the distribution q(W | ✓, ↵) are
ia stochastic variational inference, i.e. = (✓, ↵)
variational parameters, as denoted in Section 3.2.
or distribution p(W) is chosen to be improper log-
niform to make the Variational Dropout with fixed ↵
ent to Gaussian Dropout (Kingma et al., 2015).
p(log |wij|) = const , p(|wij|) /
1
|wij|
(8)
model, it is the only prior distribution that makes
nal inference consistent with Gaussian Dropout
a et al., 2015). When parameter ↵ is fixed, the
(W | ✓, ↵) k p(W)) term in the variational lower
(1) does not depend on ✓ (Kingma et al., 2015).
zation of the variational lower bound (1) then be-
equivalent to maximization of the expected log-
od (2) with fixed parameter ↵. It means that Gaus-
opout training is exactly equivalent to Variational

Additive Noise Reparameterization
¤ 下界の勾配の2つめの乗数はαが⼤きくなるとノイズが
⼤きくなる．
¤ そこで，つぎのような式変形をする．
¤ すると，
となるので，勾配の分散を⼤幅に減らすことができる！
¤ これによって， 𝛼を∞にまで⼤きく設定することができる．
4.1. Additive Noise Reparameterization
Training Neural Networks with Variational Dropout is dif-
ﬁcult when dropout rates ↵ij are large because of a huge
variance of stochastic gradients (Kingma et al., 2015). The
cause of large gradient variance arises from multiplicative
noise. To see it clearly, we can rewrite the gradient of LSGVB
w.r.t. ✓ij as follows.
@LSGVB
@✓ij
=
@LSGVB
@wij
·
@wij
@✓ij
(9)
In the case of original parameterization (✓, ↵) the second
multiplier in (9) is very noisy if ↵ij is large.
wij = ✓ij(1 +
p
↵ij · ✏ij),
@wij
@✓ij
= 1 +
p
↵ij · ✏ij,
✏ij ⇠ N(0, 1)
(10)
We propose a trick that allows us to drastically reduce the
variance of this term in the case when ↵ij is large. The idea
is to replace the multiplicative noise term 1+
p
↵ij ·✏ij with
an exactly equivalent additive noise term ij · ✏ij, where
2
ij = ↵ij✓2
ij is treated as a new independent variable. Af-
ter this trick we will optimize the variational lower bound
w.r.t. (✓, ). However, we will still use ↵ throughout the
paper, as it has a nice interpretation as a dropout rate.
wij = ✓ij(1 +
p
↵ij · ✏ij) = ✓ij + ij · ✏ij
@wij
@✓ij
= 1, ✏ij ⇠ N(0, 1)
(11)
can be decomposed into a sum:
DKL(q(W | ✓, ↵)k p(W)) =
=
X
ij
DKL(q(wij | ✓ij, ↵ij) k p(wij)) (12)
The log-scale uniform prior distribution is an improper
prior, so the KL divergence can only be calculated up to
an additive constant C (Kingma et al., 2015).
DKL(q(wij | ✓ij, ↵ij) k p(wij)) =
=
1
2
log ↵ij E✏⇠N (1,↵ij ) log |✏| + C
(13)
In the Variational Dropout model this term is intractable, as
the expectation E✏⇠N (1,↵ij ) log |✏| in (13) cannot be com-
puted analytically (Kingma et al., 2015). However, this
term can be sampled and then approximated. Two different
approximations were provided in the original paper, how-
ever they are accurate only for small values of the dropout
rate ↵ (↵  1). We propose another approximation (14)
that is tight for all values of alpha. Here (·) denotes the
sigmoid function. Different approximations and the true
value of DKL are presented in Fig. 1. Original DKL
was obtained by averaging over 107
samples of ✏ with less
than 2 ⇥ 10 3
variance of the estimation.
DKL(q(wij | ✓ij, ↵ij) k p(wij)) ⇡
⇡ k1 (k2 + k3 log ↵ij)) 0.5 log(1 + ↵ 1
ij ) + C
k1 = 0.63576 k2 = 1.87320 k3 = 1.48695
(14)
We used the following intuition to obtain this formula. The
negative KL-divergence goes to a constant as log ↵ij goes
@LSGVB
@✓ij
=
@LSGVB
@wij
·
@wij
@✓ij
(9)
wij = ✓ij(1 +
p
↵ij · ✏ij),
@wij
@✓ij
= 1 +
p
↵ij · ✏ij,
✏ij ⇠ N(0, 1)
(10)
p
↵ij ·✏ij with
2
ij = ↵ij✓2
wij = ✓ij(1 +
p
↵ij · ✏ij) = ✓ij + ij · ✏ij
@wij
@✓
= 1, ✏ij ⇠ N(0, 1)
(11)
an add
In the V
the exp
puted
term ca
approx
ever th
rate ↵
that is
sigmoi
value o
was ob
than 2
⇡ k
k1
We use
negativ
noise. To see it clearly, we can rewrite the gradient of LSGVB
@LSGVB
@✓ij
=
@LSGVB
@wij
·
@wij
@✓ij
(9)
wij = ✓ij(1 +
p
↵ij · ✏ij),
@wij
@✓ij
= 1 +
p
↵ij · ✏ij,
✏ij ⇠ N(0, 1)
(10)
p
↵ij ·✏ij with
2
ij = ↵ij✓2
w = ✓ (1 +
p
↵ · ✏ ) = ✓ + · ✏
The log-scale uniform prior dis
prior, so the KL divergence can
an additive constant C (Kingma e
DKL(q(wij | ✓ij, ↵ij
=
1
2
log ↵ij E✏⇠N(1,
In the Variational Dropout model
the expectation E✏⇠N(1,↵ij ) log |✏
puted analytically (Kingma et al
term can be sampled and then app
approximations were provided in
ever they are accurate only for sm
rate ↵ (↵  1). We propose ano
that is tight for all values of alph
sigmoid function. Different app
value of DKL are presented in
than 2 ⇥ 10 3
variance of the esti
DKL(q(wij | ✓ij, ↵ij)
⇡ k1 (k2 + k3 log ↵ij)) 0.5
k1 = 0.63576 k2 = 1.87320
@✓ij
In the case of original
multiplier in (9) is very
wij = ✓
@wij
@✓ij
✏
We propose a trick that
variance of this term in
is to replace the multipli
an exactly equivalent a
2
ij = ↵ij✓2
ij is treated
ter this trick we will op
w.r.t. (✓, ). However,
paper, as it has a nice in
wij = ✓ij(1 +
p
↵
@wij
@✓ij
= 1,
@✓ij
= 1 + ↵ij · ✏ij,
✏ij ⇠ N(0, 1)
(10)
p
↵ij ·✏ij with
2
ij = ↵ij✓2
wij = ✓ij(1 +
p
↵ij · ✏ij) = ✓ij + ij · ✏ij
@wij
@✓ij
= 1, ✏ij ⇠ N(0, 1)
(11)
approxima
ever they
rate ↵ (↵
that is tigh
sigmoid f
value of
was obtain
than 2 ⇥ 1
⇡ k1 (
k1 = 0
We used th
negative K
ただし
𝛼が⼤きくなると，
この項も⼤きくなる
wij = ✓ij(1 + ↵ij · ✏ij),
@wij
@✓ij
= 1 +
p
↵ij · ✏ij,
✏ij ⇠ N(0, 1)
(10)
p
↵ij ·✏ij with
2
ij = ↵ij✓2
wij = ✓ij(1 +
p
↵ij · ✏ij) = ✓ij + ij · ✏ij
@wij
@✓ij
= 1, ✏ij ⇠ N(0, 1)
(11)
puted ana
term can b
approxima
ever they a
rate ↵ (↵
that is tigh
sigmoid fu
value of
was obtain
than 2 ⇥ 1
⇡ k1 (
k1 = 0
We used th
negative K

KL項について
¤ [Kingma+15]で提案されたKL項（正規化項）の近似⽅法は， 𝛼が1以
下の場合のみ．
¤ 本研究では，すべての値の𝛼で適⽤可能なKL項を提案
ij
+
p
↵ij · ✏ij),
+
p
↵ij · ✏ij,
N(0, 1)
(10)
ws us to drastically reduce the
ase when ↵ij is large. The idea
e noise term 1+
p
↵ij ·✏ij with
ve noise term ij · ✏ij, where
new independent variable. Af-
ze the variational lower bound
will still use ↵ throughout the
etation as a dropout rate.
✏ij) = ✓ij + ij · ✏ij
✏ij ⇠ N(0, 1)
(11)
In the Variational Dropout model this term is intractable, as
the expectation E✏⇠N(1,↵ij ) log |✏| in (13) cannot be com-
puted analytically (Kingma et al., 2015). However, this
term can be sampled and then approximated. Two different
approximations were provided in the original paper, how-
ever they are accurate only for small values of the dropout
rate ↵ (↵  1). We propose another approximation (14)
that is tight for all values of alpha. Here (·) denotes the
sigmoid function. Different approximations and the true
value of DKL are presented in Fig. 1. Original DKL
samples of ✏ with less
than 2 ⇥ 10 3
variance of the estimation.
DKL(q(wij | ✓ij, ↵ij) k p(wij)) ⇡
⇡ k1 (k2 + k3 log ↵ij)) 0.5 log(1 + ↵ 1
ij ) + C
k1 = 0.63576 k2 = 1.87320 k3 = 1.48695
(14)
We used the following intuition to obtain this formula. The
negative KL-divergence goes to a constant as log ↵ij goes
↵ij✓2
ij goes to zero as w
is effectively a delta func
✓ij ! 0,
q(wij | ✓ij, ↵ij) !
In the case of linear regr
alytically. We denote a d
RD
. If ↵ is ﬁxed, the op
tained in a closed form.
✓ = (X>
X + diag(

スパース変分ドロップアウトの計算
¤ 下界の学習では，提案するAdditive Noise Reparameterizationに加
えて， Local Reparameterization Trick[Kingma+15]を適⽤して分
散を抑える．
¤ Local Reparameterization Trickは以前の輪読スライドを参照．
¤ 全結合層だけではなく，畳込み層でも適⽤可能．
der DKL + 0.5 log(1 + ↵ij )
moid function of log ↵ij, so we fit
(k2 +k3 log ↵ij) to this curve.
oximation is extremely accurate
m absolute deviation on the full
+1); the original approximation
0.04 maximum absolute devia-
0]).
↵ approaches infinity, the KL-
constant. As in this model the
up to an additive constant, it is
k1 so that the KL-divergence
to infinity. It allows us to com-
ural networks of different sizes.
see that DKL term increases
eans that this regularization term
orresponds to a Binary Dropout
p
p ). Intuitively it means that the
lmost always dropped from the
e does not influence the model
nd is put to zero during the test-
tuation from another angle. In-
ds to infinitely large multiplica-
ns that the value of this weight
m and its magnitude will be un-
lower bound (3) with our approximation of KL-divergence
(14). We apply Sparse Variational Dropout to both convo-
lutional and fully-connected layers. To reduce the variance
of LSGVB
we use a combination of the Local Reparameter-
ization Trick and Additive Noise Reparameterization. In
order to improve convergence, optimization is performed
w.r.t. (✓, log 2
).
For a fully connected layer we use the same notation as in
Section 3.3. In this case, Sparse Variational Dropout with
the Local Reparameterization Trick and Additive Noise
Reparameterization can be computed as follows:
bmj s N( mj, mj)
mj =
IX
i=1
ami✓ij, mj =
IX
i=1
a2
mi
2
ij
(17)
Now consider a convolutional layer. Take a single input
tensor AH⇥W ⇥C
m , a single filter wh⇥w⇥C
k and correspond-
ing output matrix bH0
⇥W 0
mk . This filter has corresponding
variational parameters ✓h⇥w⇥C
k and h⇥w⇥C
k . Note that in
this case Am, ✓k and k are tensors. Because of linear-
ity of convolutional layers, it is possible to apply the Local
Reparameterization Trick. Sparse Variational Dropout for
convolutional layers then can be expressed in a way, simi-
lar to (17). Here we use (·)2
as an element-wise operation,
⇤ denotes the convolution operation, vec(·) denotes reshap-
ing of a matrix/tensor into a vector.
vec(bmk) s N( mk, mk)
mk = vec(Am ⇤✓k), mk = diag(vec(A2
m ⇤ 2
k))
(18)
t. As in this model the
n additive constant, it is
o that the KL-divergence
ity. It allows us to com-
works of different sizes.
t DKL term increases
t this regularization term
nds to a Binary Dropout
uitively it means that the
lways dropped from the
not influence the model
ut to zero during the test-
from another angle. In-
nfinitely large multiplica-
the value of this weight
s magnitude will be un-
l prediction and decrease
refore it is beneficial to
o zero in such a way that
the Local Reparameterization Trick and Additive Noise
Reparameterization can be computed as follows:
bmj s N( mj, mj)
mj =
IX
i=1
ami✓ij, mj =
IX
i=1
a2
mi
2
ij
(17)
Now consider a convolutional layer. Take a single input
tensor AH⇥W ⇥C
m , a single filter wh⇥w⇥C
k and correspond-
ing output matrix bH0
⇥W 0
mk . This filter has corresponding
variational parameters ✓h⇥w⇥C
k and h⇥w⇥C
k . Note that in
this case Am, ✓k and k are tensors. Because of linear-
ity of convolutional layers, it is possible to apply the Local
Reparameterization Trick. Sparse Variational Dropout for
convolutional layers then can be expressed in a way, simi-
lar to (17). Here we use (·)2
as an element-wise operation,
⇤ denotes the convolution operation, vec(·) denotes reshap-
ing of a matrix/tensor into a vector.
vec(bmk) s N( mk, mk)
mk = vec(Am ⇤✓k), mk = diag(vec(A2
m ⇤ 2
k))
(18)
These formulae can be used for the implementation of
Sparse Variational Dropout layers. We will provide a refer-
ence implementation using Theano (Bergstra et al., 2010)

実験設定
¤ 𝛼はlog 𝛼 = 3 まで（ドロップアウト率0.95まで）に制限
¤ 事前に，本⼿法を適⽤しない学習を⾏う
¤ 事前学習をしない場合，⾼いスパースレベルとなるが，正解率が低くなる
¤ Bayesian DNNでは共通の問題点らしい．
¤ 本研究で⾏った事前学習は10~30epochほど
¤ その他の設定は論⽂参照

Additive Noise Reparameterizationの検証
¤ Additive Noise Reparameterizationによって分散が抑えられている
かを検証
¤ 本研究の⼿法を適⽤しない⽅法と，スパース性&下界の精度について⽐較Variational Dropout Sparsiﬁes Deep Neural Network
Figure 2. Original parameterization vs Additive Noise Reparam-
Table 1. Comparison of
(Pruning (Han et al., 2015
rich et al., 2017)) on Le
the highest level of spars
Network Method
Original
Pruning
LeNet-300-100 DNS
SWS
(ours) Sparse VD
Original
Pruning
LeNet-5-Caffe DNS
SWS
(ours) Sparse VD
提案⼿法のほうが
スパースになるのが速い
提案⼿法の下界のほう
が速く収束

MNIST
¤ LeNetでMNISTを学習
¤ LeNet-300-100（全結合）とLeNet-5-Caffe（畳込み）
¤ Pruning[Han+ 15], Dynamic Network Surgery[Guo+ 16], Soft Weight
Sharing[Ullrich+ 17]と⽐較
s Additive Noise Reparam-
eterization leads to a much
he variational lower bound
Table 1. Comparison of different sparsity-inducing techniques
(Pruning (Han et al., 2015b;a), DNS (Guo et al., 2016), SWS (Ull-
rich et al., 2017)) on LeNet architectures. Our method provides
the highest level of sparsity with a similar accuracy.
Network Method Error % Sparsity per Layer % |W|
|W6=0|
Original 1.64 1
Pruning 1.59 92.0 91.0 74.0 12
LeNet-300-100 DNS 1.99 98.2 98.2 94.5 56
SWS 1.94 23
(ours) Sparse VD 1.92 98.9 97.2 62.0 68
Original 0.80 1
Pruning 0.77 34 88 92.0 81 12
LeNet-5-Caffe DNS 0.91 86 97 99.3 96 111
SWS 0.97 200
(ours) Sparse VD 0.75 67 98 99.8 95 280
from a random initialization and without data augmenta-
提案⼿法が
最もスパース
提案⼿法が
最もスパース

CIFAR-10,CIFAR-100
¤ VGG-like network[Zagoruyko+15]でCIFAR10,CIFAR-100を学習
¤ ユニットサイズのスケーリングkを変更して実験
¤ 正解率はほぼ同じで，最⼤65倍のスパース性（CIFAR-10）

ランダムラベルの学習
¤ [Zhang+ 16]では，CNNがランダムラベルについても学習してしまう
ことが⽰されている．
¤ 通常のドロップアウトではこの問題を解消できない．
¤ 提案⼿法（Sparse VD）では，学習すると重みがすべて1つの値になり，
⼀定の予測しかしないようになった．
¤ しかも，スパース性が100%になった．
¤ スパース性が100%になると重みが0になる（4.3節を参照）．
¤ 提案⼿法によって，記憶にペナルティがかけられて，汎化を促進してい
る？
Figure 3. Accuracy and sparsity level for VGG-like architectures of different sizes. T
networks were trained with Binary Dropout, and Sparse VD networks were trained
overall sparsity level, achieved by our method, is reported as a dashed line. The
sparsity level is high, especially in larger networks.
Table 2. Experiments with random labeling. Sparse Variational
Dropout (Sparse VD) removes all weights from the model and
fails to overﬁt where Binary Dropout networks (BD) learn the
random labeling perfectly.
Dataset Architecture Train acc. Test acc. Sparsity
MNIST FC + BD 1.0 0.1 —
MNIST FC + Sparse VD 0.1 0.1 100%
CIFAR-10 VGG-like + BD 1.0 0.1 —
CIFAR-10 VGG-like + Sparse VD 0.1 0.1 100%
5.5. Random Labels
Recently is was shown that the CNNs are capable of mem-
orizing the data even with random labeling (Zhang et al.,
2016). The standard dropout as well as other regulariza-
6. Discuss
The “Occam
complex sho
1992). Aut
a Bayesian
different cas
of factorize
Processes, e
(Molchanov
ing Beta dis
ARD-effect
We conside
ational infer
by the partic
distribution
selection. T
approach th

まとめ
¤ 本研究では，Variational Dropoutにおいて，𝛼を⼤きくしても勾配の
分散が⼤きくならない再パラメータ化⼿法を提案した．
¤ [Kingma+ 15]のLocal reparameterizaiton trickとも併⽤できる．
¤ CNNにも適⽤可能．
¤ 実験では，既存⼿法よりも⾼いスパース性を獲得できることがわかっ
た．
¤ さらに，ランダムラベルのデータをDNNが簡単に学習してしまう問題
が，本⼿法では当てはまらないことを⽰した．

（DL輪読）Variational Dropout Sparsifies Deep Neural Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to （DL輪読）Variational Dropout Sparsifies Deep Neural Networks

Similar to （DL輪読）Variational Dropout Sparsifies Deep Neural Networks (20)

More from Masahiro Suzuki

More from Masahiro Suzuki (16)

Recently uploaded

Recently uploaded (20)

（DL輪読）Variational Dropout Sparsifies Deep Neural Networks