Successfully reported this slideshow.
Upcoming SlideShare
×

# （DL輪読）Variational Dropout Sparsifies Deep Neural Networks

2017/02/24 DL輪読会

• Full Name
Comment goes here.

Are you sure you want to Yes No

### （DL輪読）Variational Dropout Sparsifies Deep Neural Networks

1. 1. Variational Dropout Sparsifies Deep Neural Networks 2017/03/24 鈴⽊雅⼤
2. 2. 本論⽂について ¤ Dmitry Molchanov，Arsenii Ashukha，Dmitry Vetrov ¤ スコルコボ科学技術⼤学，国⽴研究⼤学⾼等経済学院，モスクワ物理⼯科 ⼤学 ¤ ICML2017投稿論⽂（2017/2/27 arXiv） ¤ Bayesian Dropoutのドロップアウト率をマックスまで設定できる⼿ 法を提案． ¤ アイディア⾃体はめちゃくちゃシンプル． ¤ スパース性が⾼くなるだけではなく，通常のCNNの汎化性能に関する問 題を解消できる． ¤ 選定理由 ¤ ⼀昨年輪読した論⽂[Kingma+ 15]の拡張だから． ¤ シンプルなアイディアで⼤きな効果を上げているのが好み．
3. 3. ¤ データ𝐷 = (𝑥%, 𝑦%)%)* + を観測したとき・・・ ¤ ⽬標は𝑝 𝑦 𝑥, 𝑤 = 𝑝 𝐷 𝑤 を求めること． ¤ ベイズ学習の枠組みでは，パラメーターwの事前知識を考える ¤ Dを観測した後のwの事後分布は次のようになる ¤ この処理をベイズ推論という． ¤ 事後分布を求めるためには，分⺟で周辺化が必要． ->変分推論 ベイズ推論 𝑝 𝑤 𝐷 = 𝑝 𝐷 𝑤 𝑝(𝑤) 𝑝(𝐷) = 𝑝 𝐷 𝑤 𝑝(𝑤) ∫ 𝑝 𝐷 𝑤 𝑝 𝑤 𝑑𝑤
4. 4. 変分推論 ¤ 近似分布 𝑞(𝑤|𝜙)を考えて，真の事後分布との距離𝐷45[𝑞(𝑤|𝜙)||𝑝 (𝑤|𝐷)]を 最⼩化する． ¤ これは次の変分下界を最⼤化することと等価 ¤ 再パラメータ化トリックによって，変分下界は𝜙について微分可能になる． ¤ ミニバッチ において，下界と下界の勾配の不偏推定量は ただし Variational Dropout Sparsiﬁes Deep Neural Networks L( ) = LD( ) DKL(q (w) k p(w)) ! max 2 (1) LD( ) = NX n=1 Eq (w)[log p(yn | xn, w)] (2) It consists of two parts, the expected log-likelihood LD( ) and the KL-divergence DKL(q (w) k p(w)), which acts as a regularization term. 3.2. Stochastic Variational Inference In the case of complex models expectations in (1) and (2) are intractable. Therefore the variational lower bound (1) and its gradients can not be computed exactly. However, it is still possible to estimate them using sampling and opti- mize the variational lower bound using stochastic optimiza- tion. noise ⌅ to the layer input procedure (Hinton et al., 20 B = (A ⌅)W The original version of dro nary Dropout, was presente (Hinton et al., 2012). It me put matrix is put to zero as a dropout rate. Later Gaussian Dropout with con p 1 p ) works as well and is dropout rate p (Srivastava to use continuous noise i multiplying the inputs by to putting Gaussian noise dure can be used to obta the model’s weights (Wan et al., 2015). That is, puttin and the KL-divergence DKL(q (w) k p(w)), which acts as a regularization term. 3.2. Stochastic Variational Inference In the case of complex models expectations in (1) and (2) are intractable. Therefore the variational lower bound (1) and its gradients can not be computed exactly. However, it is still possible to estimate them using sampling and opti- mize the variational lower bound using stochastic optimiza- tion. We follow (Kingma & Welling, 2013) and use the Repa- rameterization Trick to obtain an unbiased differentiable minibatch-based Monte Carlo estimator of the expected log-likelihood (3). The main idea is to represent the para- metric noise q (w) as a deterministic differentiable func- tion w = f( , ✏) of a non-parametric noise ✏ s p(✏). This trick allows us to obtain an unbiased estimate of r LD(q ). Here we denote objects from a mini-batch as (˜xm, ˜ym)M m=1. L( )'LSGVB ( )=LSGVB D ( ) DKL(q (w)kp(w)) (3) LD( )'LSGVB D ( )= N M MX m=1 log p(˜ym|˜xm, f( , ✏m)) (4) r LD( )' N M MX m=1 r log p(˜ym|˜xm, f( , ✏m)) (5) The Local Reparameterization Trick is another technique put matrix is put to zero w as a dropout rate. Later t Gaussian Dropout with con p 1 p ) works as well and is dropout rate p (Srivastava to use continuous noise in multiplying the inputs by to putting Gaussian noise dure can be used to obtai the model’s weights (Wan et al., 2015). That is, puttin ⇠ij ⇠ N(1, ↵) on a weigh of wij from q(wij | ✓ij, ↵) becomes a random variable wij = ✓ij⇠ij = ✓ij(1 + p ✏ij s N Gaussian Dropout training timization of the expected when we use the reparamete sample W s q(W | ✓, ↵) pe pectation. Variational Drop explicitly uses q(W | ✓, ↵) a tribution for a model with The parameters ✓ and ↵ of tuned via stochastic variatio are the variational paramet The prior distribution p(W n=1 It consists of two parts, the expected log-likelihood LD( ) and the KL-divergence DKL(q (w) k p(w)), which acts as a regularization term. 3.2. Stochastic Variational Inference In the case of complex models expectations in (1) and (2) are intractable. Therefore the variational lower bound (1) and its gradients can not be computed exactly. However, it is still possible to estimate them using sampling and opti- mize the variational lower bound using stochastic optimiza- tion. We follow (Kingma & Welling, 2013) and use the Repa- rameterization Trick to obtain an unbiased differentiable minibatch-based Monte Carlo estimator of the expected log-likelihood (3). The main idea is to represent the para- metric noise q (w) as a deterministic differentiable func- tion w = f( , ✏) of a non-parametric noise ✏ s p(✏). This trick allows us to obtain an unbiased estimate of r LD(q ). Here we denote objects from a mini-batch as (˜xm, ˜ym)M m=1. L( )'LSGVB ( )=LSGVB D ( ) DKL(q (w)kp(w)) (3) LD( )'LSGVB D ( )= N M MX m=1 log p(˜ym|˜xm, f( , ✏m)) (4) r LD( )' N M MX m=1 r log p(˜ym|˜xm, f( , ✏m)) (5) The Local Reparameterization Trick is another technique that reduces the variance of this gradient estimator even fur- nary Dropout, was presented with (Hinton et al., 2012). It means th put matrix is put to zero with p as a dropout rate. Later the sa Gaussian Dropout with continuou p 1 p ) works as well and is simila dropout rate p (Srivastava et al. to use continuous noise instead multiplying the inputs by a Gau to putting Gaussian noise on th dure can be used to obtain a p the model’s weights (Wang & et al., 2015). That is, putting mul ⇠ij ⇠ N(1, ↵) on a weight wij of wij from q(wij | ✓ij, ↵) = N( becomes a random variable param wij = ✓ij⇠ij = ✓ij(1 + p ↵✏ij) ✏ij s N(0, 1 Gaussian Dropout training is eq timization of the expected log l when we use the reparameterizati sample W s q(W | ✓, ↵) per min pectation. Variational Dropout e explicitly uses q(W | ✓, ↵) as an a tribution for a model with a spe The parameters ✓ and ↵ of the di tuned via stochastic variational i are the variational parameters, a The prior distribution p(W) is ch scale uniform to make the Variati L( ) = LD( ) DKL(q (w) k p(w)) ! max 2 (1) LD( ) = NX n=1 Eq (w)[log p(yn | xn, w)] (2) It consists of two parts, the expected log-likelihood LD( ) and the KL-divergence DKL(q (w) k p(w)), which acts as a regularization term. 3.2. Stochastic Variational Inference In the case of complex models expectations in (1) and (2) are intractable. Therefore the variational lower bound (1) and its gradients can not be computed exactly. However, it is still possible to estimate them using sampling and opti- mize the variational lower bound using stochastic optimiza- tion. We follow (Kingma & Welling, 2013) and use the Repa- rameterization Trick to obtain an unbiased differentiable minibatch-based Monte Carlo estimator of the expected log-likelihood (3). The main idea is to represent the para- metric noise q (w) as a deterministic differentiable func- tion w = f( , ✏) of a non-parametric noise ✏ s p(✏). This trick allows us to obtain an unbiased estimate of r LD(q ). Here we denote objects from a mini-batch as The or nary D (Hinton put ma as a dr Gaussi p 1 p ) w dropou to use multipl to putt dure c the mo et al., 2 ⇠ij ⇠ of wij becom wij = Gaussi
5. 5. ドロップアウト ¤ 全結合層 において，ドロップアウトは各訓練処理において ランダムなノイズ を加える． ¤ ノイズをサンプリングする分布としてベルヌーイやガウス分布が使われる ¤ 𝑊にガウスノイズを⼊れることは， から 𝑊をサンプリングすることと等価 ¤ すると，確率変数𝑤は𝜃によって次のようにパラメータ化される． In this section we consider a single fully-connected layer with I input neurons and O output neurons before a non- linearity. We denote an output matrix as BM⇥O , input ma- trix as AM⇥I and a weight matrix as WI⇥O . We index the elements of these matrices as bmj, ami and wij respec- tively. Then B = AW. Dropout is one of the most popular regularization methods for deep neural networks. It injects a multiplicative random DKL(q(W | ✓, bound (1) doe Maximization comes equival likelihood (2) w sian Dropout t Dropout with ﬁ vides a way to ational lower b ational Dropout Sparsiﬁes Deep Neural Networks w)) ! max 2 (1) n | xn, w)] (2) g-likelihood LD( ) (w)), which acts as tions in (1) and (2) nal lower bound (1) noise ⌅ to the layer input A at each iteration of training procedure (Hinton et al., 2012). B = (A ⌅)W, with ⇠mi s p(⇠) (6) The original version of dropout, so-called Bernoulli or Bi- nary Dropout, was presented with ⇠mi s Bernoulli(1 p) (Hinton et al., 2012). It means that each element of the in- put matrix is put to zero with probability p, also known as a dropout rate. Later the same authors reported that Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ = p 1 p ) works as well and is similar to Binary Dropout with dropout rate p (Srivastava et al., 2014). It is beneﬁcial to use continuous noise instead of discrete one because nal Dropout Sparsiﬁes Deep Neural Networks ! max 2 (1) n, w)] (2) kelihood LD( ) ), which acts as ns in (1) and (2) ower bound (1) tly. However, it mpling and opti- hastic optimiza- noise ⌅ to the layer input A at each iteration of training procedure (Hinton et al., 2012). B = (A ⌅)W, with ⇠mi s p(⇠) (6) The original version of dropout, so-called Bernoulli or Bi- nary Dropout, was presented with ⇠mi s Bernoulli(1 p) (Hinton et al., 2012). It means that each element of the in- put matrix is put to zero with probability p, also known as a dropout rate. Later the same authors reported that Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ = p 1 p ) works as well and is similar to Binary Dropout with dropout rate p (Srivastava et al., 2014). It is beneﬁcial to use continuous noise instead of discrete one because multiplying the inputs by a Gaussian noise is equivalent to putting Gaussian noise on the weights. This proce- dure can be used to obtain a posterior distribution over the model’s weights (Wang & Manning, 2013; Kingma eep Neural Networks se ⌅ to the layer input A at each iteration of training cedure (Hinton et al., 2012). B = (A ⌅)W, with ⇠mi s p(⇠) (6) original version of dropout, so-called Bernoulli or Bi- y Dropout, was presented with ⇠mi s Bernoulli(1 p) nton et al., 2012). It means that each element of the in- matrix is put to zero with probability p, also known a dropout rate. Later the same authors reported that ussian Dropout with continuous noise ⇠mi s N(1, ↵ = ) works as well and is similar to Binary Dropout with pout rate p (Srivastava et al., 2014). It is beneﬁcial use continuous noise instead of discrete one because tiplying the inputs by a Gaussian noise is equivalent putting Gaussian noise on the weights. This proce- e can be used to obtain a posterior distribution over model’s weights (Wang & Manning, 2013; Kingma l., 2015). That is, putting multiplicative Gaussian noise ⇠ N(1, ↵) on a weight wij is equivalent to sampling wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2 ij). Now wij omes a random variable parametrized by ✓ij. Variational Dropout Sparsiﬁes Deep Neural Networks L( ) = LD( ) DKL(q (w) k p(w)) ! max 2 (1) LD( ) = NX n=1 Eq (w)[log p(yn | xn, w)] (2) onsists of two parts, the expected log-likelihood LD( ) the KL-divergence DKL(q (w) k p(w)), which acts as gularization term. Stochastic Variational Inference he case of complex models expectations in (1) and (2) intractable. Therefore the variational lower bound (1) its gradients can not be computed exactly. However, it till possible to estimate them using sampling and opti- e the variational lower bound using stochastic optimiza- . follow (Kingma & Welling, 2013) and use the Repa- eterization Trick to obtain an unbiased differentiable ibatch-based Monte Carlo estimator of the expected likelihood (3). The main idea is to represent the para- ric noise q (w) as a deterministic differentiable func- noise ⌅ to the layer input A at each iteration of procedure (Hinton et al., 2012). B = (A ⌅)W, with ⇠mi s p(⇠) The original version of dropout, so-called Bernou nary Dropout, was presented with ⇠mi s Bernoul (Hinton et al., 2012). It means that each element o put matrix is put to zero with probability p, als as a dropout rate. Later the same authors repo Gaussian Dropout with continuous noise ⇠mi s N p 1 p ) works as well and is similar to Binary Drop dropout rate p (Srivastava et al., 2014). It is b to use continuous noise instead of discrete one multiplying the inputs by a Gaussian noise is eq to putting Gaussian noise on the weights. Thi dure can be used to obtain a posterior distribut the model’s weights (Wang & Manning, 2013; et al., 2015). That is, putting multiplicative Gauss ⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to s of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2 ij). becomes a random variable parametrized by ✓ij. wij = ✓ij⇠ij = ✓ij(1 + p ↵✏ij) ⇠ N(wij | ✓ij, ↵ n=1 parts, the expected log-likelihood LD( ) gence DKL(q (w) k p(w)), which acts as erm. ariational Inference mplex models expectations in (1) and (2) Therefore the variational lower bound (1) can not be computed exactly. However, it o estimate them using sampling and opti- al lower bound using stochastic optimiza- ma & Welling, 2013) and use the Repa- ick to obtain an unbiased differentiable Monte Carlo estimator of the expected . The main idea is to represent the para- w) as a deterministic differentiable func- ✏) of a non-parametric noise ✏ s p(✏). s us to obtain an unbiased estimate of e we denote objects from a mini-batch as )=LSGVB D ( ) DKL(q (w)kp(w)) (3) The original version of dropout, so-called Bernoulli or Bi- nary Dropout, was presented with ⇠mi s Bernoulli(1 p) (Hinton et al., 2012). It means that each element of the in- put matrix is put to zero with probability p, also known as a dropout rate. Later the same authors reported that Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ = p 1 p ) works as well and is similar to Binary Dropout with dropout rate p (Srivastava et al., 2014). It is beneﬁcial to use continuous noise instead of discrete one because multiplying the inputs by a Gaussian noise is equivalent to putting Gaussian noise on the weights. This proce- dure can be used to obtain a posterior distribution over the model’s weights (Wang & Manning, 2013; Kingma et al., 2015). That is, putting multiplicative Gaussian noise ⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to sampling of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2 ij). Now wij becomes a random variable parametrized by ✓ij. wij = ✓ij⇠ij = ✓ij(1 + p ↵✏ij) ⇠ N(wij | ✓ij, ↵✓2 ij) ✏ij s N(0, 1) (7) Gaussian Dropout training is equivalent to stochastic op- timization of the expected log likelihood (2) in the case when we use the reparameterization trick and draw a single sample W s q(W | ✓, ↵) per minibatch to estimate the ex- yn | xn, w)] (2) og-likelihood LD( ) p(w)), which acts as ations in (1) and (2) nal lower bound (1) exactly. However, it g sampling and opti- stochastic optimiza- ) and use the Repa- biased differentiable tor of the expected o represent the para- differentiable func- ic noise ✏ s p(✏). nbiased estimate of rom a mini-batch as The original version of dropout, so-called Bernoulli or Bi- nary Dropout, was presented with ⇠mi s Bernoulli(1 p) (Hinton et al., 2012). It means that each element of the in- put matrix is put to zero with probability p, also known as a dropout rate. Later the same authors reported that Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ = p 1 p ) works as well and is similar to Binary Dropout with dropout rate p (Srivastava et al., 2014). It is beneﬁcial to use continuous noise instead of discrete one because multiplying the inputs by a Gaussian noise is equivalent to putting Gaussian noise on the weights. This proce- dure can be used to obtain a posterior distribution over the model’s weights (Wang & Manning, 2013; Kingma et al., 2015). That is, putting multiplicative Gaussian noise ⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to sampling of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2 ij). Now wij becomes a random variable parametrized by ✓ij. wij = ✓ij⇠ij = ✓ij(1 + p ↵✏ij) ⇠ N(wij | ✓ij, ↵✓2 ij) ✏ij s N(0, 1) (7) Gaussian Dropout training is equivalent to stochastic op- timization of the expected log likelihood (2) in the case
6. 6. 変分ドロップアウト ¤ をパラメータ をもつ近似分布と考えると，このパ ラメータは変分推論で計算することができる（変分ドロップアウト） ¤ 𝛼を固定すると，変分ドロップアウトとガウスドロップアウトは等価に なる． ¤ KL項が⼀定になるため． ¤ 変分ドロップアウトにおいて，𝛼は学習するパラメータになっている！ ¤ つまり， 𝛼を学習時に⾃動的に決定することができる． ¤ しかし，先⾏研究[Kigma+ 2015]では𝛼は1以下に制限されている． ¤ ノイズが⼊りすぎると，勾配の分散が⼤きくなる． ¤ しかし， 𝛼が無限⼤（=ドロップアウト率が1）まで設定できたほうが⾯⽩い 結果がでそう． ✏ij s N (0, 1) aussian Dropout training is equivalent to stochastic op- mization of the expected log likelihood (2) in the case hen we use the reparameterization trick and draw a single ample W s q(W | ✓, ↵) per minibatch to estimate the ex- ectation. Variational Dropout extends this technique and xplicitly uses q(W | ✓, ↵) as an approximate posterior dis- ibution for a model with a special prior on the weights. he parameters ✓ and ↵ of the distribution q(W | ✓, ↵) are uned via stochastic variational inference, i.e. = (✓, ↵) re the variational parameters, as denoted in Section 3.2. he prior distribution p(W) is chosen to be improper log- cale uniform to make the Variational Dropout with ﬁxed ↵ quivalent to Gaussian Dropout (Kingma et al., 2015). p(log |wij|) = const , p(|wij|) / 1 |wij| (8) n this model, it is the only prior distribution that makes ariational inference consistent with Gaussian Dropout Kingma et al., 2015). When parameter ↵ is ﬁxed, the DKL(q(W | ✓, ↵) k p(W)) term in the variational lower ound (1) does not depend on ✓ (Kingma et al., 2015). Maximization of the variational lower bound (1) then be- e use the reparameterization trick and draw a single W s q(W | ✓, ↵) per minibatch to estimate the ex- n. Variational Dropout extends this technique and ly uses q(W | ✓, ↵) as an approximate posterior dis- n for a model with a special prior on the weights. ameters ✓ and ↵ of the distribution q(W | ✓, ↵) are ia stochastic variational inference, i.e. = (✓, ↵) variational parameters, as denoted in Section 3.2. or distribution p(W) is chosen to be improper log- niform to make the Variational Dropout with ﬁxed ↵ ent to Gaussian Dropout (Kingma et al., 2015). p(log |wij|) = const , p(|wij|) / 1 |wij| (8) model, it is the only prior distribution that makes nal inference consistent with Gaussian Dropout a et al., 2015). When parameter ↵ is ﬁxed, the (W | ✓, ↵) k p(W)) term in the variational lower (1) does not depend on ✓ (Kingma et al., 2015). zation of the variational lower bound (1) then be- equivalent to maximization of the expected log- od (2) with ﬁxed parameter ↵. It means that Gaus- opout training is exactly equivalent to Variational
7. 7. Additive Noise Reparameterization ¤ 下界の勾配 の2つめの乗数はαが⼤きくなるとノイズが ⼤きくなる． ¤ そこで，つぎのような式変形をする． ¤ すると， となるので，勾配の分散を⼤幅に減らすことができる！ ¤ これによって， 𝛼を∞にまで⼤きく設定することができる． 4.1. Additive Noise Reparameterization Training Neural Networks with Variational Dropout is dif- ﬁcult when dropout rates ↵ij are large because of a huge variance of stochastic gradients (Kingma et al., 2015). The cause of large gradient variance arises from multiplicative noise. To see it clearly, we can rewrite the gradient of LSGVB w.r.t. ✓ij as follows. @LSGVB @✓ij = @LSGVB @wij · @wij @✓ij (9) In the case of original parameterization (✓, ↵) the second multiplier in (9) is very noisy if ↵ij is large. wij = ✓ij(1 + p ↵ij · ✏ij), @wij @✓ij = 1 + p ↵ij · ✏ij, ✏ij ⇠ N(0, 1) (10) We propose a trick that allows us to drastically reduce the variance of this term in the case when ↵ij is large. The idea is to replace the multiplicative noise term 1+ p ↵ij ·✏ij with an exactly equivalent additive noise term ij · ✏ij, where 2 ij = ↵ij✓2 ij is treated as a new independent variable. Af- ter this trick we will optimize the variational lower bound w.r.t. (✓, ). However, we will still use ↵ throughout the paper, as it has a nice interpretation as a dropout rate. wij = ✓ij(1 + p ↵ij · ✏ij) = ✓ij + ij · ✏ij @wij @✓ij = 1, ✏ij ⇠ N(0, 1) (11) can be decomposed into a sum: DKL(q(W | ✓, ↵)k p(W)) = = X ij DKL(q(wij | ✓ij, ↵ij) k p(wij)) (12) The log-scale uniform prior distribution is an improper prior, so the KL divergence can only be calculated up to an additive constant C (Kingma et al., 2015). DKL(q(wij | ✓ij, ↵ij) k p(wij)) = = 1 2 log ↵ij E✏⇠N (1,↵ij ) log |✏| + C (13) In the Variational Dropout model this term is intractable, as the expectation E✏⇠N (1,↵ij ) log |✏| in (13) cannot be com- puted analytically (Kingma et al., 2015). However, this term can be sampled and then approximated. Two different approximations were provided in the original paper, how- ever they are accurate only for small values of the dropout rate ↵ (↵  1). We propose another approximation (14) that is tight for all values of alpha. Here (·) denotes the sigmoid function. Different approximations and the true value of DKL are presented in Fig. 1. Original DKL was obtained by averaging over 107 samples of ✏ with less than 2 ⇥ 10 3 variance of the estimation. DKL(q(wij | ✓ij, ↵ij) k p(wij)) ⇡ ⇡ k1 (k2 + k3 log ↵ij)) 0.5 log(1 + ↵ 1 ij ) + C k1 = 0.63576 k2 = 1.87320 k3 = 1.48695 (14) We used the following intuition to obtain this formula. The negative KL-divergence goes to a constant as log ↵ij goes w.r.t. ✓ij as follows. @LSGVB @✓ij = @LSGVB @wij · @wij @✓ij (9) In the case of original parameterization (✓, ↵) the second multiplier in (9) is very noisy if ↵ij is large. wij = ✓ij(1 + p ↵ij · ✏ij), @wij @✓ij = 1 + p ↵ij · ✏ij, ✏ij ⇠ N(0, 1) (10) We propose a trick that allows us to drastically reduce the variance of this term in the case when ↵ij is large. The idea is to replace the multiplicative noise term 1+ p ↵ij ·✏ij with an exactly equivalent additive noise term ij · ✏ij, where 2 ij = ↵ij✓2 ij is treated as a new independent variable. Af- ter this trick we will optimize the variational lower bound w.r.t. (✓, ). However, we will still use ↵ throughout the paper, as it has a nice interpretation as a dropout rate. wij = ✓ij(1 + p ↵ij · ✏ij) = ✓ij + ij · ✏ij @wij @✓ = 1, ✏ij ⇠ N(0, 1) (11) an add In the V the exp puted term ca approx ever th rate ↵ that is sigmoi value o was ob than 2 ⇡ k k1 We use negativ noise. To see it clearly, we can rewrite the gradient of LSGVB w.r.t. ✓ij as follows. @LSGVB @✓ij = @LSGVB @wij · @wij @✓ij (9) In the case of original parameterization (✓, ↵) the second multiplier in (9) is very noisy if ↵ij is large. wij = ✓ij(1 + p ↵ij · ✏ij), @wij @✓ij = 1 + p ↵ij · ✏ij, ✏ij ⇠ N(0, 1) (10) We propose a trick that allows us to drastically reduce the variance of this term in the case when ↵ij is large. The idea is to replace the multiplicative noise term 1+ p ↵ij ·✏ij with an exactly equivalent additive noise term ij · ✏ij, where 2 ij = ↵ij✓2 ij is treated as a new independent variable. Af- ter this trick we will optimize the variational lower bound w.r.t. (✓, ). However, we will still use ↵ throughout the paper, as it has a nice interpretation as a dropout rate. w = ✓ (1 + p ↵ · ✏ ) = ✓ + · ✏ The log-scale uniform prior dis prior, so the KL divergence can an additive constant C (Kingma e DKL(q(wij | ✓ij, ↵ij = 1 2 log ↵ij E✏⇠N(1, In the Variational Dropout model the expectation E✏⇠N(1,↵ij ) log |✏ puted analytically (Kingma et al term can be sampled and then app approximations were provided in ever they are accurate only for sm rate ↵ (↵  1). We propose ano that is tight for all values of alph sigmoid function. Different app value of DKL are presented in was obtained by averaging over 1 than 2 ⇥ 10 3 variance of the esti DKL(q(wij | ✓ij, ↵ij) ⇡ k1 (k2 + k3 log ↵ij)) 0.5 k1 = 0.63576 k2 = 1.87320 @✓ij In the case of original multiplier in (9) is very wij = ✓ @wij @✓ij ✏ We propose a trick that variance of this term in is to replace the multipli an exactly equivalent a 2 ij = ↵ij✓2 ij is treated ter this trick we will op w.r.t. (✓, ). However, paper, as it has a nice in wij = ✓ij(1 + p ↵ @wij @✓ij = 1, @✓ij = 1 + ↵ij · ✏ij, ✏ij ⇠ N(0, 1) (10) We propose a trick that allows us to drastically reduce the variance of this term in the case when ↵ij is large. The idea is to replace the multiplicative noise term 1+ p ↵ij ·✏ij with an exactly equivalent additive noise term ij · ✏ij, where 2 ij = ↵ij✓2 ij is treated as a new independent variable. Af- ter this trick we will optimize the variational lower bound w.r.t. (✓, ). However, we will still use ↵ throughout the paper, as it has a nice interpretation as a dropout rate. wij = ✓ij(1 + p ↵ij · ✏ij) = ✓ij + ij · ✏ij @wij @✓ij = 1, ✏ij ⇠ N(0, 1) (11) approxima ever they rate ↵ (↵ that is tigh sigmoid f value of was obtain than 2 ⇥ 1 ⇡ k1 ( k1 = 0 We used th negative K ただし 𝛼が⼤きくなると， この項も⼤きくなる wij = ✓ij(1 + ↵ij · ✏ij), @wij @✓ij = 1 + p ↵ij · ✏ij, ✏ij ⇠ N(0, 1) (10) We propose a trick that allows us to drastically reduce the variance of this term in the case when ↵ij is large. The idea is to replace the multiplicative noise term 1+ p ↵ij ·✏ij with an exactly equivalent additive noise term ij · ✏ij, where 2 ij = ↵ij✓2 ij is treated as a new independent variable. Af- ter this trick we will optimize the variational lower bound w.r.t. (✓, ). However, we will still use ↵ throughout the paper, as it has a nice interpretation as a dropout rate. wij = ✓ij(1 + p ↵ij · ✏ij) = ✓ij + ij · ✏ij @wij @✓ij = 1, ✏ij ⇠ N(0, 1) (11) puted ana term can b approxima ever they a rate ↵ (↵ that is tigh sigmoid fu value of was obtain than 2 ⇥ 1 ⇡ k1 ( k1 = 0 We used th negative K
8. 8. KL項について ¤ [Kingma+15]で提案されたKL項（正規化項）の近似⽅法は， 𝛼が1以 下の場合のみ． ¤ 本研究では，すべての値の𝛼で適⽤可能なKL項を提案 ij + p ↵ij · ✏ij), + p ↵ij · ✏ij, N(0, 1) (10) ws us to drastically reduce the ase when ↵ij is large. The idea e noise term 1+ p ↵ij ·✏ij with ve noise term ij · ✏ij, where new independent variable. Af- ze the variational lower bound will still use ↵ throughout the etation as a dropout rate. ✏ij) = ✓ij + ij · ✏ij ✏ij ⇠ N(0, 1) (11) In the Variational Dropout model this term is intractable, as the expectation E✏⇠N(1,↵ij ) log |✏| in (13) cannot be com- puted analytically (Kingma et al., 2015). However, this term can be sampled and then approximated. Two different approximations were provided in the original paper, how- ever they are accurate only for small values of the dropout rate ↵ (↵  1). We propose another approximation (14) that is tight for all values of alpha. Here (·) denotes the sigmoid function. Different approximations and the true value of DKL are presented in Fig. 1. Original DKL was obtained by averaging over 107 samples of ✏ with less than 2 ⇥ 10 3 variance of the estimation. DKL(q(wij | ✓ij, ↵ij) k p(wij)) ⇡ ⇡ k1 (k2 + k3 log ↵ij)) 0.5 log(1 + ↵ 1 ij ) + C k1 = 0.63576 k2 = 1.87320 k3 = 1.48695 (14) We used the following intuition to obtain this formula. The negative KL-divergence goes to a constant as log ↵ij goes Variational Dropout Sparsiﬁes Deep Neural Networks ↵ij✓2 ij goes to zero as w is effectively a delta func ✓ij ! 0, q(wij | ✓ij, ↵ij) ! In the case of linear regr alytically. We denote a d RD . If ↵ is ﬁxed, the op tained in a closed form. ✓ = (X> X + diag(
9. 9. スパース変分ドロップアウトの計算 ¤ 下界の学習では，提案するAdditive Noise Reparameterizationに加 えて， Local Reparameterization Trick[Kingma+15]を適⽤して分 散を抑える． ¤ Local Reparameterization Trickは以前の輪読スライドを参照． ¤ 全結合層だけではなく，畳込み層でも適⽤可能． der DKL + 0.5 log(1 + ↵ij ) moid function of log ↵ij, so we ﬁt (k2 +k3 log ↵ij) to this curve. oximation is extremely accurate m absolute deviation on the full +1); the original approximation 0.04 maximum absolute devia- 0]). ↵ approaches inﬁnity, the KL- constant. As in this model the up to an additive constant, it is k1 so that the KL-divergence to inﬁnity. It allows us to com- ural networks of different sizes. see that DKL term increases eans that this regularization term orresponds to a Binary Dropout p p ). Intuitively it means that the lmost always dropped from the e does not inﬂuence the model nd is put to zero during the test- tuation from another angle. In- ds to inﬁnitely large multiplica- ns that the value of this weight m and its magnitude will be un- lower bound (3) with our approximation of KL-divergence (14). We apply Sparse Variational Dropout to both convo- lutional and fully-connected layers. To reduce the variance of LSGVB we use a combination of the Local Reparameter- ization Trick and Additive Noise Reparameterization. In order to improve convergence, optimization is performed w.r.t. (✓, log 2 ). For a fully connected layer we use the same notation as in Section 3.3. In this case, Sparse Variational Dropout with the Local Reparameterization Trick and Additive Noise Reparameterization can be computed as follows: bmj s N( mj, mj) mj = IX i=1 ami✓ij, mj = IX i=1 a2 mi 2 ij (17) Now consider a convolutional layer. Take a single input tensor AH⇥W ⇥C m , a single ﬁlter wh⇥w⇥C k and correspond- ing output matrix bH0 ⇥W 0 mk . This ﬁlter has corresponding variational parameters ✓h⇥w⇥C k and h⇥w⇥C k . Note that in this case Am, ✓k and k are tensors. Because of linear- ity of convolutional layers, it is possible to apply the Local Reparameterization Trick. Sparse Variational Dropout for convolutional layers then can be expressed in a way, simi- lar to (17). Here we use (·)2 as an element-wise operation, ⇤ denotes the convolution operation, vec(·) denotes reshap- ing of a matrix/tensor into a vector. vec(bmk) s N( mk, mk) mk = vec(Am ⇤✓k), mk = diag(vec(A2 m ⇤ 2 k)) (18) t. As in this model the n additive constant, it is o that the KL-divergence ity. It allows us to com- works of different sizes. t DKL term increases t this regularization term nds to a Binary Dropout uitively it means that the lways dropped from the not inﬂuence the model ut to zero during the test- from another angle. In- nﬁnitely large multiplica- the value of this weight s magnitude will be un- l prediction and decrease refore it is beneﬁcial to o zero in such a way that the Local Reparameterization Trick and Additive Noise Reparameterization can be computed as follows: bmj s N( mj, mj) mj = IX i=1 ami✓ij, mj = IX i=1 a2 mi 2 ij (17) Now consider a convolutional layer. Take a single input tensor AH⇥W ⇥C m , a single ﬁlter wh⇥w⇥C k and correspond- ing output matrix bH0 ⇥W 0 mk . This ﬁlter has corresponding variational parameters ✓h⇥w⇥C k and h⇥w⇥C k . Note that in this case Am, ✓k and k are tensors. Because of linear- ity of convolutional layers, it is possible to apply the Local Reparameterization Trick. Sparse Variational Dropout for convolutional layers then can be expressed in a way, simi- lar to (17). Here we use (·)2 as an element-wise operation, ⇤ denotes the convolution operation, vec(·) denotes reshap- ing of a matrix/tensor into a vector. vec(bmk) s N( mk, mk) mk = vec(Am ⇤✓k), mk = diag(vec(A2 m ⇤ 2 k)) (18) These formulae can be used for the implementation of Sparse Variational Dropout layers. We will provide a refer- ence implementation using Theano (Bergstra et al., 2010)
10. 10. 実験設定 ¤ 𝛼はlog 𝛼 = 3 まで（ドロップアウト率0.95まで）に制限 ¤ 事前に，本⼿法を適⽤しない学習を⾏う ¤ 事前学習をしない場合，⾼いスパースレベルとなるが，正解率が低くなる ¤ Bayesian DNNでは共通の問題点らしい． ¤ 本研究で⾏った事前学習は10~30epochほど ¤ その他の設定は論⽂参照
11. 11. Additive Noise Reparameterizationの検証 ¤ Additive Noise Reparameterizationによって分散が抑えられている かを検証 ¤ 本研究の⼿法を適⽤しない⽅法と，スパース性&下界の精度について⽐較Variational Dropout Sparsiﬁes Deep Neural Network Figure 2. Original parameterization vs Additive Noise Reparam- Table 1. Comparison of (Pruning (Han et al., 2015 rich et al., 2017)) on Le the highest level of spars Network Method Original Pruning LeNet-300-100 DNS SWS (ours) Sparse VD Original Pruning LeNet-5-Caffe DNS SWS (ours) Sparse VD 提案⼿法のほうが スパースになるのが速い 提案⼿法の下界のほう が速く収束
12. 12. MNIST ¤ LeNetでMNISTを学習 ¤ LeNet-300-100（全結合）とLeNet-5-Caffe（畳込み） ¤ Pruning[Han+ 15], Dynamic Network Surgery[Guo+ 16], Soft Weight Sharing[Ullrich+ 17]と⽐較 Variational Dropout Sparsiﬁes Deep Neural Networks s Additive Noise Reparam- eterization leads to a much he variational lower bound Table 1. Comparison of different sparsity-inducing techniques (Pruning (Han et al., 2015b;a), DNS (Guo et al., 2016), SWS (Ull- rich et al., 2017)) on LeNet architectures. Our method provides the highest level of sparsity with a similar accuracy. Network Method Error % Sparsity per Layer % |W| |W6=0| Original 1.64 1 Pruning 1.59 92.0 91.0 74.0 12 LeNet-300-100 DNS 1.99 98.2 98.2 94.5 56 SWS 1.94 23 (ours) Sparse VD 1.92 98.9 97.2 62.0 68 Original 0.80 1 Pruning 0.77 34 88 92.0 81 12 LeNet-5-Caffe DNS 0.91 86 97 99.3 96 111 SWS 0.97 200 (ours) Sparse VD 0.75 67 98 99.8 95 280 from a random initialization and without data augmenta- 提案⼿法が 最もスパース 提案⼿法が 最もスパース
13. 13. CIFAR-10,CIFAR-100 ¤ VGG-like network[Zagoruyko+15]でCIFAR10,CIFAR-100を学習 ¤ ユニットサイズのスケーリングkを変更して実験 ¤ 正解率はほぼ同じで，最⼤65倍のスパース性（CIFAR-10）
14. 14. ランダムラベルの学習 ¤ [Zhang+ 16]では，CNNがランダムラベルについても学習してしまう ことが⽰されている． ¤ 通常のドロップアウトではこの問題を解消できない． ¤ 提案⼿法（Sparse VD）では，学習すると重みがすべて1つの値になり， ⼀定の予測しかしないようになった． ¤ しかも，スパース性が100%になった． ¤ スパース性が100%になると重みが0になる（4.3節を参照）． ¤ 提案⼿法によって，記憶にペナルティがかけられて，汎化を促進してい る？ Figure 3. Accuracy and sparsity level for VGG-like architectures of different sizes. T networks were trained with Binary Dropout, and Sparse VD networks were trained overall sparsity level, achieved by our method, is reported as a dashed line. The sparsity level is high, especially in larger networks. Table 2. Experiments with random labeling. Sparse Variational Dropout (Sparse VD) removes all weights from the model and fails to overﬁt where Binary Dropout networks (BD) learn the random labeling perfectly. Dataset Architecture Train acc. Test acc. Sparsity MNIST FC + BD 1.0 0.1 — MNIST FC + Sparse VD 0.1 0.1 100% CIFAR-10 VGG-like + BD 1.0 0.1 — CIFAR-10 VGG-like + Sparse VD 0.1 0.1 100% 5.5. Random Labels Recently is was shown that the CNNs are capable of mem- orizing the data even with random labeling (Zhang et al., 2016). The standard dropout as well as other regulariza- 6. Discuss The “Occam complex sho 1992). Aut a Bayesian different cas of factorize Processes, e (Molchanov ing Beta dis ARD-effect We conside ational infer by the partic distribution selection. T approach th
15. 15. まとめ ¤ 本研究では，Variational Dropoutにおいて，𝛼を⼤きくしても勾配の 分散が⼤きくならない再パラメータ化⼿法を提案した． ¤ [Kingma+ 15]のLocal reparameterizaiton trickとも併⽤できる． ¤ CNNにも適⽤可能． ¤ 実験では，既存⼿法よりも⾼いスパース性を獲得できることがわかっ た． ¤ さらに，ランダムラベルのデータをDNNが簡単に学習してしまう問題 が，本⼿法では当てはまらないことを⽰した．