Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(DL hacks輪読)Bayesian Neural Network

7,808 views

Published on

輪読日:2017/01/27
輪読というよりかは,関連研究のまとめです.

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

(DL hacks輪読)Bayesian Neural Network

  1. 1. Bayesian Neural Network 2017/01/27
  2. 2. ¤ bayesian neural network ¤ NIPS http://bayesiandeeplearning.org ¤ bayesian neural network ¤ Stan PyMC3 Edward ¤ ¤ ¤
  3. 3. ¤ ¤ ¤ ¤
  4. 4. ¤ ! " # ¤ $(#|") ¤ ¤ $(!|#, ") ¤ ¤ ¤ w ¤ $(#|!, ") ¤ ¤ $(!|") ¤ ¤ ¤
  5. 5. ¤ ¤ D ¤ ¤ ! = {+} = - $ ! # = ∏ $(+|#)/ 012 = $(-|#) ¤ ! = {+, 3} = (-, 3) $ ! # = ∏ $(3|+, #)/ 012 = $(3|-, #) $ # ! = $ ! # $(#) $(!) = $ ! # $(#) ∫ $ ! # $ # 5#
  6. 6. MAP ¤ ¤ MAP ¤ MAP ¤ ¤ MAP #789 = arg max log $(!|#) = arg max A B log $(+C|#) C #7DE = arg max log $(#|!) = arg max A log $ ! # + log $(#)
  7. 7. ¤ +G ¤ # ¤ ¤ +G 3G $ +G ! = H $ +G # $ # ! 5# = IJ(A|K)[$ +G # ] $ 3G +G, ! = H $ 3G +G, # $ # ! 5# = IJ(A|K)[$ 3G +G, # ]
  8. 8. ¤ ¤ ¤ ¤ ¤ ¤ $ +G ! = H $ +G # $ # ! 5# = IJ(A|K)[$ +G # ] $ " ! = $ ! " $(") $(!) $ # ! = $ ! # $(#) $(!) = $ ! # $(#) ∫ $ ! # $ # 5#
  9. 9. ¤ ¤ ¤ # ¤ $ # ! = $ ! # $(#) $(!) = $ ! # $(#) ∫ $ ! # $ # 5#
  10. 10. ¤ 2 1. MCMC 2. 1. MCMC ¤ $(#|!) ¤ ¤ 2. ¤ $(#|N) O(#) ¤ 2 ¤ ¤
  11. 11. ¤ # $(#) ¤ # ¤ Weight Uncertainty in Neural Networks H1 H2 H3 1 X 1 Y 0.5 0.1 0.7 1.3 1.40.3 1.2 0.10.1 0.2 H1 H2 H3 1 X 1 Y Figure 1. Left: each weight has a fixed value, as provided by clas- sical backpropagation. Right: each weight is assigned a distribu- tion, as provided by Bayes by Backprop. is related to recent methods in deep, generative modelling (Kingma and Welling, 2014; Rezende et al., 2014; Gregor et al., 2014), where variational inference has been applied to stochastic hidden units of an autoencoder. Whilst the number of stochastic hidden units might be in the order of the parameters of the categorical dis through the exponential function then regression Y is R and P(y|x, w) is a G – this corresponds to a squared loss. Inputs x are mapped onto the param tion on Y by several successive layers tion (given by w) interleaved with elem transforms. The weights can be learnt by maximum tion (MLE): given a set of training exam the MLE weights wMLE are given by: wMLE = arg max w log P(D|w = arg max w i log P( This is typically achieved by gradient NN Bayesian NN $(3G|!, +G, N) = H $ 3G +G, # $ # !, N 5# [Blundell+ 2015]
  12. 12. ¤ ¤ ¤ “The Importance of Knowing What We Don't Know (by Yarin Gal)” ¤ ¤ ¤ ¤ ¤ ¤ WHY SHOULD WE CARE? Calibrated model and prediction uncertainty: getting systems that know when they don’t know. Automatic model complexity control and structure learnin (Bayesian Occam’s Razor) Figure from Yarin Gal’s thesis “Uncertainty in Deep Learning” (2016) Zoubin Ghahramani http://mlg.eng.cam.ac.uk/yarin/blog_2248.html
  13. 13. ¤ stochastic neural network ¤ VAE ¤ = ¤ ¤ https://jmhl.org/research/
  14. 14. ¤ ¤ ¤ ¤ ¤ 1 ¤ ¤ http://evelinag.com/blog/2014/09-15-introducing-ariadne/#.WI8X7LaLTEY
  15. 15. ¤ ¤ P Q ¤ ¤ α ¤ ¤ 3.1.5 Multiple outputs So far, we have considered the case of a single target variable t. In some applica- tions, we may wish to predict K > 1 target variables, which we denote collectively by the target vector t. This could be done by introducing a different set of basis func- tions for each component of t, leading to multiple, independent regression problems. However, a more interesting, and more common, approach is to use the same set of basis functions to model all of the components of the target vector so that y(x, w) = WT φ(x) (3.31) where y is a K-dimensional column vector, W is an M × K matrix of parameters, and φ(x) is an M-dimensional column vector with elements φj(x), with φ0(x) = 1 as before. Suppose we take the conditional distribution of the target vector to be an isotropic Gaussian of the form p(t|x, W, β) = N(t|WT φ(x), β−1 I). (3.32) If we have a set of observations t1, . . . , tN , we can combine these into a matrix T of size N × K such that the nth row is given by tT n. Similarly, we can combine the input vectors x1, . . . , xN into a matrix X. The log likelihood function is then given by ln p(T|X, W, β) = N n=1 ln N(tn|WT φ(xn), β−1 I) = NK 2 ln β 2π − β 2 N n=1 tn − WT φ(xn) 2 . (3.33) by the target vector t. This could be done by introducing a different set of basis func- tions for each component of t, leading to multiple, independent regression problems. However, a more interesting, and more common, approach is to use the same set of basis functions to model all of the components of the target vector so that y(x, w) = WT φ(x) (3.31) where y is a K-dimensional column vector, W is an M × K matrix of parameters, and φ(x) is an M-dimensional column vector with elements φj(x), with φ0(x) = 1 as before. Suppose we take the conditional distribution of the target vector to be an isotropic Gaussian of the form p(t|x, W, β) = N (t|WT φ(x), β−1 I). (3.32) If we have a set of observations t1, . . . , tN , we can combine these into a matrix T of size N × K such that the nth row is given by tT n. Similarly, we can combine the input vectors x1, . . . , xN into a matrix X. The log likelihood function is then given by ln p(T|X, W, β) = N n=1 ln N (tn|WT φ(xn), β−1 I) = NK 2 ln β 2π − β 2 N n=1 tn − WT φ(xn) 2 . (3.33) mN = SN S−1 0 m0 + βΦT t (3.50) S−1 N = S−1 0 + βΦT Φ. (3.51) Note that because the posterior distribution is Gaussian, its mode coincides with its mean. Thus the maximum posterior weight vector is simply given by wMAP = mN . If we consider an infinitely broad prior S0 = α−1 I with α → 0, the mean mN of the posterior distribution reduces to the maximum likelihood value wML given by (3.15). Similarly, if N = 0, then the posterior distribution reverts to the prior. Furthermore, if data points arrive sequentially, then the posterior distribution at any stage acts as the prior distribution for the subsequent data point, such that the new posterior distribution is again given by (3.49).3.8 For the remainder of this chapter, we shall consider a particular form of Gaus- sian prior in order to simplify the treatment. Specifically, we consider a zero-mean isotropic Gaussian governed by a single precision parameter α so that p(w|α) = N(w|0, α−1 I) (3.52) and the corresponding posterior distribution over w is then given by (3.49) with mN = βSN ΦT t (3.53) S−1 N = αI + βΦT Φ. (3.54) The log of the posterior distribution is given by the sum of the log likelihood and the log of the prior and, as a function of w, takes the form ln p(w|t) = − β 2 N n=1 {tn − wT φ(xn)}2 − α 2 wT w + const. (3.55) Maximization of this posterior distribution with respect to w is therefore equiva- 3.3. Bayesian Linear Regression 153 Next we compute the posterior distribution, which is proportional to the product of the likelihood function and the prior. Due to the choice of a conjugate Gaus- sian prior distribution, the posterior will also be Gaussian. We can evaluate this distribution by the usual procedure of completing the square in the exponential, and then finding the normalization coefficient using the standard result for a normalized Gaussian. However, we have already done the necessary work in deriving the gen- eral result (2.116), which allows us to write down the posterior distribution directly in the form p(w|t) = N (w|mN , SN ) (3.49) where mN = SN S−1 0 m0 + βΦT t (3.50) S−1 N = S−1 0 + βΦT Φ. (3.51) then finding the normalization coefficient using the standard result for a normalized Gaussian. However, we have already done the necessary work in deriving the gen- eral result (2.116), which allows us to write down the posterior distribution directly in the form p(w|t) = N(w|mN , SN ) (3.49) where mN = SN S−1 0 m0 + βΦT t (3.50) S−1 N = S−1 0 + βΦT Φ. (3.51) Note that because the posterior distribution is Gaussian, its mode coincides with its mean. Thus the maximum posterior weight vector is simply given by wMAP = mN . If we consider an infinitely broad prior S0 = α−1 I with α → 0, the mean mN of the posterior distribution reduces to the maximum likelihood value wML given by (3.15). Similarly, if N = 0, then the posterior distribution reverts to the prior. Furthermore, if data points arrive sequentially, then the posterior distribution at any stage acts as the prior distribution for the subsequent data point, such that the new posterior distribution is again given by (3.49). For the remainder of this chapter, we shall consider a particular form of Gaus- sian prior in order to simplify the treatment. Specifically, we consider a zero-mean isotropic Gaussian governed by a single precision parameter α so that p(w|α) = N(w|0, α−1 I) (3.52) and the corresponding posterior distribution over w is then given by (3.49) with mN = βSN ΦT t (3.53) S−1 N = αI + βΦT Φ. (3.54) The log of the posterior distribution is given by the sum of the log likelihood and the log of the prior and, as a function of w, takes the form PRML
  16. 16. ¤ ¤ equal to the mean, although this will no longer hold if q ̸= 2. 3.3.2 Predictive distribution In practice, we are not usually interested in the value of w itself but rather in making predictions of t for new values of x. This requires that we evaluate the predictive distribution defined by p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57) in which t is the vector of target values from the training set, and we have omitted the corresponding input vectors from the right-hand side of the conditioning statements to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari- able is given by (3.8), and the posterior weight distribution is given by (3.49). We see that (3.57) involves the convolution of two Gaussian distributions, and so making use of the result (2.115) from Section 8.1.4, we see that the predictive distribution takes the form3.10 p(t|x, t, α, β) = N(t|mT N φ(x), σ2 N (x)) (3.58) where the variance σ2 N (x) of the predictive distribution is given by σ2 N (x) = 1 β + φ(x)T SN φ(x). (3.59) The first term in (3.59) represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters w. Because the noise process and the distribution of w are independent Gaussians, their variances are additive. Note that, as additional data points are observed, the posterior distribution becomes narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2 N+1(x) σ2 (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variance3.11 p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57) in which t is the vector of target values from the training set, and we have omitted the corresponding input vectors from the right-hand side of the conditioning statements to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari- able is given by (3.8), and the posterior weight distribution is given by (3.49). We see that (3.57) involves the convolution of two Gaussian distributions, and so making use of the result (2.115) from Section 8.1.4, we see that the predictive distribution takes the form.10 p(t|x, t, α, β) = N(t|mT N φ(x), σ2 N (x)) (3.58) where the variance σ2 N (x) of the predictive distribution is given by σ2 N (x) = 1 β + φ(x)T SN φ(x). (3.59) The first term in (3.59) represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters w. Because the noise process and the distribution of w are independent Gaussians, their variances are additive. Note that, as additional data points are observed, the posterior distribution becomes narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2 N+1(x) σ2 N (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variance.11 of the predictive distribution arises solely from the additive noise governed by the parameter β. As an illustration of the predictive distribution for Bayesian linear regression models, let us return to the synthetic sinusoidal data set of Section 1.1. In Figure 3.8, p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57) in which t is the vector of target values from the training set, and we have omitted the corresponding input vectors from the right-hand side of the conditioning statements to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari- able is given by (3.8), and the posterior weight distribution is given by (3.49). We see that (3.57) involves the convolution of two Gaussian distributions, and so making use of the result (2.115) from Section 8.1.4, we see that the predictive distribution takes the formcise 3.10 p(t|x, t, α, β) = N(t|mT N φ(x), σ2 N (x)) (3.58) where the variance σ2 N (x) of the predictive distribution is given by σ2 N (x) = 1 β + φ(x)T SN φ(x). (3.59) The first term in (3.59) represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters w. Because the noise process and the distribution of w are independent Gaussians, their variances are additive. Note that, as additional data points are observed, the posterior distribution becomes narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2 N+1(x) σ2 N (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variancecise 3.11 of the predictive distribution arises solely from the additive noise governed by the parameter β. As an illustration of the predictive distribution for Bayesian linear regression models, let us return to the synthetic sinusoidal data set of Section 1.1. In Figure 3.8, 3.3. Bayesian Linear Regression 157 x t 0 1 −1 0 1 x t 0 1 −1 0 1 x t 0 1 −1 0 1 x t 0 1 −1 0 1 Figure 3.8 Examples of the predictive distribution (3.58) for a model consisting of 9 Gaussian basis functions of the form (3.4) using the synthetic sinusoidal data set of Section 1.1. See the text for a detailed discussion. PRML
  17. 17. ¤ ¤ ¤ ¤ ¤ 2 ¤ ¤
  18. 18. ¤ ¤ 1. 2. MAP #7DE 3. 2 R 4. chapters and so we can exploit the results obtained there. We can then make use of the evidence framework to provide point estimates for the hyperparameters and to compare alternative models (for example, networks having different numbers of hid- den units). To start with, we shall discuss the regression case and then later consider the modifications needed for solving classification tasks. 5.7.1 Posterior parameter distribution Consider the problem of predicting a single continuous target variable t from a vector x of inputs (the extension to multiple targets is straightforward). We shall suppose that the conditional distribution p(t|x) is Gaussian, with an x-dependent mean given by the output of a neural network model y(x, w), and with precision (inverse variance) β p(t|x, w, β) = N(t|y(x, w), β−1 ). (5.161) Similarly, we shall choose a prior distribution over the weights w that is Gaussian of the form p(w|α) = N(w|0, α−1 I). (5.162) For an i.i.d. data set of N observations x1, . . . , xN , with a corresponding set of target values D = {t1, . . . , tN }, the likelihood function is given by p(D|w, β) = N n=1 N(tn|y(xn, w), β−1 ) (5.163) and so the resulting posterior distribution is then p(w|D, α, β) ∝ p(w|α)p(D|w, β). (5.164) the modifications needed for solving classification tasks. 5.7.1 Posterior parameter distribution Consider the problem of predicting a single continuous target variable t fro a vector x of inputs (the extension to multiple targets is straightforward). We sh suppose that the conditional distribution p(t|x) is Gaussian, with an x-depende mean given by the output of a neural network model y(x, w), and with precisi (inverse variance) β p(t|x, w, β) = N(t|y(x, w), β−1 ). (5.16 Similarly, we shall choose a prior distribution over the weights w that is Gaussian the form p(w|α) = N(w|0, α−1 I). (5.16 For an i.i.d. data set of N observations x1, . . . , xN , with a corresponding set of targ values D = {t1, . . . , tN }, the likelihood function is given by p(D|w, β) = N n=1 N(tn|y(xn, w), β−1 ) (5.16 and so the resulting posterior distribution is then p(w|D, α, β) ∝ p(w|α)p(D|w, β). (5.16 which, as a consequence of the nonlinear dependence of y(x, w) on w, will be no Gaussian. We can find a Gaussian approximation to the posterior distribution by using t Laplace approximation. To do this, we must first find a (local) maximum of t 5.7. Bayesian Neural Networks 279 form ln p(w|D) = − α 2 wT w − β 2 N n=1 {y(xn, w) − tn} 2 + const (5.165) which corresponds to a regularized sum-of-squares error function. Assuming for the moment that α and β are fixed, we can find a maximum of the posterior, which we denote wMAP, by standard nonlinear optimization algorithms such as conjugate gradients, using error backpropagation to evaluate the required derivatives. Having found a mode wMAP, we can then build a local Gaussian approximation by evaluating the matrix of second derivatives of the negative log posterior distribu- tion. From (5.165), this is given by A = −∇∇ ln p(w|D, α, β) = αI + βH (5.166) where H is the Hessian matrix comprising the second derivatives of the sum-of- squares error function with respect to the components of w. Algorithms for comput- ing and approximating the Hessian were discussed in Section 5.4. The corresponding Gaussian approximation to the posterior is then given from (4.134) by q(w|D) = N(w|wMAP, A−1 ). (5.167) form ln p(w|D) = − α 2 wT w − β 2 N n=1 {y(xn, w) − tn} 2 + const (5.165) which corresponds to a regularized sum-of-squares error function. Assuming for the moment that α and β are fixed, we can find a maximum of the posterior, which we denote wMAP, by standard nonlinear optimization algorithms such as conjugate gradients, using error backpropagation to evaluate the required derivatives. Having found a mode wMAP, we can then build a local Gaussian approximation by evaluating the matrix of second derivatives of the negative log posterior distribu- tion. From (5.165), this is given by A = −∇∇ ln p(w|D, α, β) = αI + βH (5.166) where H is the Hessian matrix comprising the second derivatives of the sum-of- squares error function with respect to the components of w. Algorithms for comput- ing and approximating the Hessian were discussed in Section 5.4. The corresponding Gaussian approximation to the posterior is then given from (4.134) by q(w|D) = N(w|wMAP, A−1 ). (5.167) Similarly, the predictive distribution is obtained by marginalizing with respect PRML
  19. 19. ¤ ¤ ¤ https://www.r-bloggers.com/easy-laplace-approximation-of-bayesian-models-in-r/
  20. 20. ¤ O(#|N) ¤ ST[O(#|N)||$ # ! ] N ¤ KL ¤ N ELBO N → ST[O(#|N)||$ (#|!)] = −∫ O # N log W K|A W A J # N 5# + log $ ! = −ℒ !; N + log $ ! ELBO
  21. 21. ELBO ¤ ELBO ℒ !; N 1. ¤ $(#|!) MC ¤ 2. ELBO ¤ MC ¤ MC ∫ O # N log W K|A W A J # N 5# = IJ[log W K|A W A J # N ] 3. ¤ EM ¤ ELBO
  22. 22. ¤ Gal ¤ Denker, Schwartz, Wittner, Solla, Howard, Jackel, Hopfield (1987) ¤ Denker and LeCun (1991) ¤ MacKay (1992) ¤ Hinton and van Camp (1993) ¤ ¤ Neal (1995) ¤ Barber and Bishop (1998) ¤ Graves (2011) ¤ Blundell, Cornebise, Kavukcuoglu, and Wierstra (2015) ¤ Hernandez-Lobato and Adam (2015)
  23. 23. ¤ Practical Variational Inference for Neural Networks [Graves 2011] ¤ ¤ ¤ T9 O(#|N) N = {Z, [} ¤ ℒ !; N = ∫ O (#|N)log $ !|# $(#) O(#|N) 5# = − EJ(A|])[log $(!|#)] + ST[O(#|N)||$(#)] ^T9 ^Z ≈ 1 a B ^ log $(!|#) ^# / C ^T9 ^[b ≈ 1 2a B ^ log $(!|#) ^# b/ C
  24. 24. ¤ Weight Uncertainty in Neural Networks [Blundell+ 2015] ¤ ¤ O(#|N) ¤ ¤ http://www.slideshare.net/masa_s/weight-uncertainty-in-neural-networks Bayes by backprop ^ ^N ℒ !; N = ^ ^N IJ(A|]) d(#, N) d #, N = log $ !|# $(#) O(#|N) = IJ(e) ^d(#, N) ^N ^# ^N + ^d(#, N) ^N Reparameterization trick # = Z + diag([) ⊙ i i~k(0, m)
  25. 25. Dropout ¤ Dropout as a Bayesian Approximation [Gal+ 2015] ¤ O # N = ∏ O(nC|oC) ¤ T9 = − EJ # N [log $(!|#)] ¤ oC 0 ¤ 0 ¤ dropout ¤ drop-connect multiplicative Gaussian noise sults are summarised here n uncertainty estimates for el with L layers and a loss max loss or the Euclidean Wi the NN’s weight ma- 1, and by bi the bias vec- ayer i = 1, ..., L. We de- corresponding to input xi the input and output sets on a regularisation term is egularisation weighted by n a minimisation objective λ L i=1 ||Wi||2 2 + ||bi||2 2 . (1) variables for every input in each layer (apart from le takes value 1 with prob- ropped (i.e. its value is set orresponding binary vari- me values in the backward to the parameters. p(y|x, ω) = N y; y(x, ω), τ ID y x, ω = {W1, ...,WL} = 1 KL WLσ ... 1 K1 W2σ W1x + m1 ... The posterior distribution p(ω|X, Y) in eq. (2) is in- tractable. We use q(ω), a distribution over matrices whose columns are randomly set to zero, to approximate the in- tractable posterior. We define q(ω) as: Wi = Mi · diag([zi,j]Ki j=1) zi,j ∼ Bernoulli(pi) for i = 1, ..., L, j = 1, ..., Ki−1 given some probabilities pi and matrices Mi as variational parameters. The binary variable zi,j = 0 corresponds then to unit j in layer i − 1 being dropped out as an input to layer i. The variational distribution q(ω) is highly multi- modal, inducing strong joint correlations over the rows of the matrices Wi (which correspond to the frequencies in the sparse spectrum GP approximation). We minimise the KL divergence between the approximate posterior q(ω) above and the posterior of the full deep GP, p(ω|X, Y). This KL is our minimisation objective − q(ω) log p(Y|X, ω)dω + KL(q(ω)||p(ω)). (3)
  26. 26. Dropout ¤ dropout ¤ dropout ¤ ¤ MC dropout ¤ http://mlg.eng.cam.ac.uk/yarin/blog_2248.html where ω = {Wi}L i=1 is our set of random variables for a model with L layers. We will perform moment-matching and estimate the first two moments of the predictive distribution empirically. More specifically, we sample T sets of vectors of realisa- tions from the Bernoulli distribution {zt 1, ..., zt L}T t=1 with zt i = [zt i,j]Ki j=1, giving {Wt 1, ..., Wt L}T t=1. We estimate Eq(y∗|x∗)(y∗ ) ≈ 1 T T t=1 y∗ (x∗ , Wt 1, ..., Wt L) (6) following proposition C in the appendix. We refer to this Monte Carlo estimate as MC dropout. In practice this is equivalent to performing T stochastic forward passes through the network and averaging the results. This result has been presented in the literature before as model averaging. We have given a new derivation for this result which allows us to derive mathematically grounded uncertainty estimates as well. Srivastava et al. (2014, sec- tion 7.5) have reasoned empirically that MC dropout can be approximated by averaging the weights of the network (multiplying each Wi by pi at test time, referred to as stan- dard dropout). We estimate the second raw moment in the same way: log p(y∗ |x∗ , X, Y) ≈ with a log-sum-exp o passes through the ne Our predictive distr highly multi-modal, give a glimpse into i proximating variation matrix column is bi- tribution over each la 3.2 in the appendix). Note that the dropo To estimate the predi we simply collect the through the model. used with existing N thermore, the forward sulting in constant run dropout. 5. Experiments T t=1 following proposition C in the appendix. We refer to this Monte Carlo estimate as MC dropout. In practice this is equivalent to performing T stochastic forward passes through the network and averaging the results. This result has been presented in the literature before as model averaging. We have given a new derivation for this result which allows us to derive mathematically grounded uncertainty estimates as well. Srivastava et al. (2014, sec- tion 7.5) have reasoned empirically that MC dropout can be approximated by averaging the weights of the network (multiplying each Wi by pi at test time, referred to as stan- dard dropout). We estimate the second raw moment in the same way: Eq(y∗|x∗) (y∗ )T (y∗ ) ≈ τ−1 ID + 1 T T t=1 y∗ (x∗ , Wt 1, ..., Wt L)T y∗ (x∗ , Wt 1, ..., Wt L) following proposition D in the appendix. To obtain the model’s predictive variance we have: Varq(y∗|x∗) y∗ ≈ τ−1 ID 2 In the appendix (section 4.1) we extend this derivation to classification. E(·) is defined as softmax loss and τ is set to 1. proximating variational distributio matrix column is bi-modal, and tribution over each layer’s weight 3.2 in the appendix). Note that the dropout NN mod To estimate the predictive mean a we simply collect the results of s through the model. As a result, used with existing NN models tra thermore, the forward passes can sulting in constant running time id dropout. 5. Experiments We next perform an extensive ass of the uncertainty estimates obta and convnets on the tasks of regr We compare the uncertainty obtai architectures and non-linearities, olation, and show that model unc classification tasks using MNIST as an example. We then show th tainty we can obtain a considerabl tive log-likelihood and RMSE co of-the-art methods. We finish wi
  27. 27. ¤ MC dropout Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning (a) Standard dropout with weight averaging (b) Gaussian process with SE covariance function (c) MC dropout with ReLU non-linearities (d) MC dropout with TanH non-linearities Figure 2. Predictive mean and uncertainties on the Mauna Loa CO2 concentrations dataset, for various models. In red is the observed function (left of the dashed blue line); in blue is the predictive mean plus/minus two standard deviations (8 for fig. 2d). Different shades of blue represent half a standard deviation. Marked with a dashed red line is a point far away from the data: standard dropout confidently predicts an insensible value for the point; the other models predict insensible values as well but with the additional information that the models are uncertain about their predictions. model’s uncertainty in a Bayesian pipeline. We give a quantitative assessment of the model’s performance in the setting of reinforcement learning on a task similar to that used in deep reinforcement learning (Mnih et al., 2015). Using the results from the previous section, we begin by qualitatively evaluating the dropout NN uncertainty on two comparison. Fig. 2c shows the results of the same network as in fig. 2a, but with MC dropout used to evaluate the pre- dictive mean and uncertainty for the training and test sets. Lastly, fig. 2d shows the same using the TanH network with 5 layers (plotted with 8 times the standard deviation for vi- sualisation purposes). The shades of blue represent model uncertainty: each colour gradient represents half a standard
  28. 28. Dropout ¤ Variational dropout and the local reparameterization trick [Kingma+ 2015] ¤ ¤ 0 ¤ 0 local reparameterization trick ¤ Dropout ¤ ¤ http://www.slideshare.net/masa_s/dl-hacks-variational-dropout-and-the-local- reparameterization-trick 2.2 Variance of the SGVB estimator The theory of stochastic approximation tells us that stochastic gradient ascent using (3) will asymp- totically converge to a local optimum for an appropriately declining step size and sufficient weight updates [18], but in practice the performance of stochastic gradient ascent crucially depends on the variance of the gradients. If this variance is too large, stochastic gradient descent will fail to make much progress in any reasonable amount of time. Our objective function consists of an expected log likelihood term that we approximate using Monte Carlo, and a KL divergence term DKL(qφ(w)||p(w)) that we assume can be calculated analytically and otherwise be approximated with Monte Carlo with similar reparameterization. Assume that we draw minibatches of datapoints with replacement; see appendix F for a similar analysis for minibatches without replacement. Using Li as shorthand for log p(yi |xi , w = f(ϵi , φ)), the contribution to the likelihood for the i-th datapoint in the minibatch, the Monte Carlo estimator (3) may be rewritten as LSGVB D (φ) = N M M i=1 Li, whose variance is given by Var LSGVB D (φ) = N2 M2 M i=1 Var [Li] + 2 M i=1 M j=i+1 Cov [Li, Lj] (4) =N2 1 M Var [Li] + M − 1 M Cov [Li, Lj] , (5) where the variances and covariances are w.r.t. both the data distribution and ϵ distribution, i.e. Var [Li] = Varϵ,xi,yi log p(yi |xi , w = f(ϵ, φ)) , with xi , yi drawn from the empirical distribu- tion defined by the training set. As can be seen from (5), the total contribution to the variance by Var [Li] is inversely proportional to the minibatch size M. However, the total contribution by the covariances does not decrease with M. In practice, this means that the variance of LSGVB D (φ) can be dominated by the covariances for even moderately large M. 2.3 Local Reparameterization Trick We therefore propose an alternative estimator for which we have Cov [Li, Lj] = 0, so that the vari- ance of our stochastic gradients scales as 1/M. We then make this new estimator computationally efficient by not sampling ϵ directly, but only sampling the intermediate variables f(ϵ) through which SGVB
  29. 29. Automatic Differentiation Variational Inference
  30. 30. ¤ ¤ ¤ ¤ → Automatic differentiation variational inference ADVI
  31. 31. Automatic Differentiation Variational Inference ADVI ¤ Automatic Differentiation Variational Inference [Kucukelbir+ 2016] ¤ Stan PyMC3 Edward ¤ ADVI 1. N $(+, N) $(+, p) ST[O(p)||$(p, +))] 2. O MC 3. O
  32. 32. Automatic transformation ¤ N ¤ $(N) support ¤ ¤ N p ¤ p ¤ N −> p r p 0 1 2 3 θ De (a) Latent variable space T−1 −1 0 1 2 ζ (b) Real coordinate space 1: Transforming the latent variable to real coordinate space. The purple line is the pos ne is the approximation. (a) The latent variable space is >0. (a→b) T transforms space to . (b) The variational approximation is a Gaussian in real coordinate space : Transforming the latent variable to real coordinate space. The purple line is the post e is the approximation. (a) The latent variable space is >0. (a→b) T transforms space to . (b) The variational approximation is a Gaussian in real coordinate space. tify the transformed variables as ζ = T(θ). The transformed joint density p(x,ζ) is as the representation p(x,ζ) = p x, T−1 (ζ) det JT−1 (ζ) , (x,θ = T−1 (ζ)) is the joint density in the original latent variable space, and JT−1 of the inverse of T. Transformations of continuous probability densities require a nts for how the transformation warps unit volumes and ensures that the transforme s to one (Olive, 2014). (See Appendix A.)
  33. 33. Automatic transformation ¤ ¤ r = log (N) ¤ ¤ p p(x,ζ) = p x, T−1 (ζ) det JT−1 (ζ) , x,θ = T−1 (ζ)) is the joint density in the original latent variable space, and JT−1 (ζ) is the of the inverse of T. Transformations of continuous probability densities require a Jacobian; s for how the transformation warps unit volumes and ensures that the transformed density to one (Olive, 2014). (See Appendix A.) again our running Weibull-Poisson example from Section 2.1. The latent variable θ lives in ogarithm ζ = T(θ) = log(θ) transforms >0 to the real line . Its Jacobian adjustment is the of the inverse of the logarithm |det JT−1(ζ)| = exp(ζ). The transformed density is p(x,ζ) = Poisson(x | exp(ζ)) × Weibull(exp(ζ) ; 1.5,1) × exp(ζ). epicts this transformation. cribe in the introduction, we implement our algorithm in Stan (Stan Development Team, an maintains a library of transformations and their corresponding Jacobians.4 With Stan, tomatically transforms the joint density of any differentiable probability model to one with d latent variables. (See Figure 2.) riational Approximations in Real Coordinate Space ransformation, the latent variables ζ have support in the real coordinate space K . We have f variational approximations in this space. Here, we consider Gaussian distributions; these nduce non-Gaussian variational distributions in the original latent variable space. of ζ; it has the representation p(x,ζ) = p x, T−1 (ζ) det JT−1 (ζ) , where p(x,θ = T−1 (ζ)) is the joint density in the original latent variable space, and JT−1 (ζ) is t Jacobian of the inverse of T. Transformations of continuous probability densities require a Jacobia it accounts for how the transformation warps unit volumes and ensures that the transformed dens integrates to one (Olive, 2014). (See Appendix A.) Consider again our running Weibull-Poisson example from Section 2.1. The latent variable θ lives >0. The logarithm ζ = T(θ) = log(θ) transforms >0 to the real line . Its Jacobian adjustment is t derivative of the inverse of the logarithm |det JT−1(ζ)| = exp(ζ). The transformed density is p(x,ζ) = Poisson(x | exp(ζ)) × Weibull(exp(ζ) ; 1.5,1) × exp(ζ). Figure 1 depicts this transformation. As we describe in the introduction, we implement our algorithm in Stan (Stan Development Tea 2015). Stan maintains a library of transformations and their corresponding Jacobians.4 With St we can automatically transforms the joint density of any differentiable probability model to one w real-valued latent variables. (See Figure 2.) 2.4 Variational Approximations in Real Coordinate Space After the transformation, the latent variables ζ have support in the real coordinate space K . We ha 0 1 2 3 1 θ Density (a) Latent variable space T T−1 −1 0 1 2 1 ζ Prior Posterior Approximation (b) Real coordinate space Figure 1: Transforming the latent variable to real coordinate space. The purple line is the posterior. The green line is the approximation. (a) The latent variable space is >0. (a→b) T transforms the latent variable space to . (b) The variational approximation is a Gaussian in real coordinate space. Figure 1: Transforming the latent variable to real coordinate space. The purple line is the posterior. The green line is the approximation. (a) The latent variable space is >0. (a→b) T transforms the latent variable space to . (b) The variational approximation is a Gaussian in real coordinate space.
  34. 34. ¤ ¤ ¤ ¤ L ¤ ¤ .4 Variational Approximations in Real Coordinate Space fter the transformation, the latent variables ζ have support in the real coordinate space K . We hav choice of variational approximations in this space. Here, we consider Gaussian distributions; thes mplicitly induce non-Gaussian variational distributions in the original latent variable space. Mean-field Gaussian. One option is to posit a factorized (mean-field) Gaussian variational approxima on q(ζ; φ) = ζ; µ,diag(σ2 ) = K k=1 ζk ; µk,σ2 k , where the vector φ = (µ1,··· ,µK ,σ2 1,··· ,σ2 K ) concatenates the mean and variance of each Gaussia actor. Since the variance parameters must always be positive, the variational parameters live in th et Φ = { K , K >0}. Re-parameterizing the mean-field Gaussian removes this constraint. Consider th 4 Stan provides various transformations for upper and lower bounds, simplex and ordered vectors, and structured matrices suc covariance matrices and Cholesky factors. 6 N x [ n ] ~ poisson ( theta ) ; } Figure 2: Specifying a simple nonconjugate probability model in Stan. Figure 2: Specifying a simple nonconjugate probability model in Stan. arithm of the standard deviations, ω = log(σ), applied element-wise. The support of ω is n real coordinate space and σ is always positive. The mean-field Gaussian becomes q(ζ; φ) ζ; µ,diag(exp(ω)2 ) , where the vector φ = (µ1,··· ,µK ,ω1,··· ,ωK ) concatenates the mean a arithm of the standard deviation of each factor. Now, the variational parameters are unconstrain 2K . l-rank Gaussian. Another option is to posit a full-rank Gaussian variational approximation q(ζ; φ) = ζ; µ,Σ , ere the vector φ = (µ,Σ) concatenates the mean vector µ and covariance matrix Σ. To ensure t always remains positive semidefinite, we re-parameterize the covariance matrix using a Chole torization, Σ = LL⊤ . We use the non-unique definition of the Cholesky factorization where gonal elements of L need not be positively constrained (Pinheiro and Bates, 1996). Therefor s in the unconstrained space of lower-triangular matrices with K(K + 1)/2 real-valued entries. T -rank Gaussian becomes q(ζ; φ) = ζ; µ, LL⊤ , where the variational parameters φ = (µ, L) constrained in K+K(K+1)/2 . Figure 2: Specifying a simple nonconjugate probability model in Stan. Figure 2: Specifying a simple nonconjugate probability model in Stan. logarithm of the standard deviations, ω = log(σ), applied element-wise. The support of ω the real coordinate space and σ is always positive. The mean-field Gaussian becomes q(ζ ζ; µ,diag(exp(ω)2 ) , where the vector φ = (µ1,··· ,µK ,ω1,··· ,ωK ) concatenates the m logarithm of the standard deviation of each factor. Now, the variational parameters are uncon in 2K . Full-rank Gaussian. Another option is to posit a full-rank Gaussian variational approximation q(ζ; φ) = ζ; µ,Σ , where the vector φ = (µ,Σ) concatenates the mean vector µ and covariance matrix Σ. To ens Σ always remains positive semidefinite, we re-parameterize the covariance matrix using a C factorization, Σ = LL⊤ . We use the non-unique definition of the Cholesky factorization wh diagonal elements of L need not be positively constrained (Pinheiro and Bates, 1996). The lives in the unconstrained space of lower-triangular matrices with K(K + 1)/2 real-valued entr full-rank Gaussian becomes q(ζ; φ) = ζ; µ, LL⊤ , where the variational parameters φ = (µ unconstrained in K+K(K+1)/2 . The full-rank Gaussian generalizes the mean-field Gaussian approximation. The off-diagonal term covariance matrix Σ capture posterior correlations across latent random variables.5 This leads to accurate posterior approximation than the mean-field Gaussian; however, it comes at a compu cost. Various low-rank approximations to the covariance matrix reduce this cost, yet limit its a
  35. 35. ¤ ELBO ¤ s ¤ ¤ Reparameterization trick p Z ¤ ELBO ¤ s ¤ 2.5 The Variational Problem in Real Coordinate Space Here is the story so far. We began with a differentiable probability model p(x,θ). We transformed th latent variables into ζ, which live in the real coordinate space. We defined variational approximation in the transformed space. Now, we consider the variational optimization problem. Write the variational objective function, the ELBO, in real coordinate space as (φ) = q(ζ;φ) log p x, T−1 (ζ) + log det JT−1 (ζ) + q(ζ; φ) . (5 The inverse of the transformation T−1 appears in the joint model, along with the determinant of th Jacobian adjustment. The ELBO is a function of the variational parameters φ and the entropy , both o which depend on the variational approximation. (Derivation in Appendix B.) Now, we can freely optimize the ELBO in the real coordinate space without worrying about the suppo matching constraint. The optimization problem from Equation (3) becomes φ∗ = argmax φ (φ) (6 where the parameter vector φ lives in some appropriately dimensioned real coordinate space. This is a unconstrained optimization problem that we can solve using gradient ascent. Traditionally, this woul require manual computation of gradients. Instead, we develop a stochastic gradient ascent algorithm that uses automatic differentiation to compute gradients and MC integration to approximate expect tions. We cannot directly use automatic differentiation on the ELBO. This is because the ELBO involves an un where the parameter vector φ lives in some appropriately dimensioned real coordinate space. This is an unconstrained optimization problem that we can solve using gradient ascent. Traditionally, this would require manual computation of gradients. Instead, we develop a stochastic gradient ascent algorithm that uses automatic differentiation to compute gradients and MC integration to approximate expecta- tions. We cannot directly use automatic differentiation on the ELBO. This is because the ELBO involves an un- known expectation. However, we can automatically differentiate the functions inside the expectation. (The model p and transformation T are both easy to represent as computer functions (Baydin et al., 2015).) To apply automatic differentiation, we want to push the gradient operation inside the expec- tation. To this end, we employ one final transformation: elliptical standardization6 (Härdle and Simar, 2012). Elliptical standardization. Consider a transformation Sφ that absorbs the variational parameters φ; this converts the Gaussian variational approximation into a standard Gaussian. In the mean-field case, the standardization is η = Sφ(ζ) = diag exp(ω) −1 (ζ − µ). In the full-rank Gaussian, the standardiza- tion is η = Sφ(ζ) = L−1 (ζ − µ). In both cases, the standardization encapsulates the variational parameters; in return it gives a fixed variational density q(η) = η; 0, I = K k=1 ηk ; 0,1 , as shown in Figure 3. The standardization transforms the variational problem from Equation (5) into φ∗ = argmax φ (η;0,I) log p x, T−1 (S−1 φ (η)) + log det JT−1 S−1 φ (η) + q(ζ; φ) . The expectation is now in terms of a standard Gaussian density. The Jacobian of elliptical standard- ization evaluates to one, because the Gaussian distribution is a member of the location-scale family: standardizing a Gaussian gives another Gaussian distribution. (See Appendix A.) We do not need to transform the entropy term as it does not depend on the model or the transformation; we have a simple analytic form for the entropy of a Gaussian and its gradient. We implement these once and reuse for all models.
  36. 36. ¤ Black-box variational inference [Ranganath+ 2014] ¤ ¤ ADVI likelihood ratio trick ¤ reparameterization trick 3.2 Variance of the Stochastic Gradients ADVI uses Monte Carlo integration to approximate gradients of the ELBO, and then uses these gradients in a stochastic optimization algorithm (Section 2). The speed of ADVI hinges on the variance of the gradient estimates. When a stochastic optimization algorithm suffers from high-variance gradients, it must repeatedly recover from poor parameter estimates. ADVI is not the only way to compute Monte Carlo approximations of the gradient of the ELBO. Black box variational inference (BBVI) takes a different approach (Ranganath et al., 2014). The BBVI gradient estimator uses the gradient of the variational approximation and avoids using the gradient of the model. For example, the following BBVI estimator ∇BBVI µ = q(ζ;φ) ∇µ logq(ζ; φ) log p x, T−1 (ζ) + log det JT−1 (ζ) − logq(ζ; φ) and the ADVI gradient estimator in Equation (7) both lead to unbiased estimates of the exact gradient. While BBVI is more general—it does not require the gradient of the model and thus applies to more settings—its gradients can suffer from high variance. Figure 8 empirically compares the variance of both estimators for two models. Figure 8a shows the vari- ance of both gradient estimators for a simple univariate model, where the posterior is a Gamma(10,10). We estimate the variance using ten thousand re-calculations of the gradient ∇φ , across an increasing number of MC samples M. The ADVI gradient has lower variance; in practice, a single sample suffices. (See the experiments in Section 4.) Figure 8b shows the same calculation for a 100-dimensional nonlinear regression model with likeli- hood (y | tanh(x⊤ β), I) and a Gaussian prior on the regression coefficients β. Because this is a multivariate example, we also show the BBVI gradient with a variance reduction scheme using control variates described in Ranganath et al. (2014). In both cases, the ADVI gradients are statistically more efficient. 100 101 102 103 100 101 102 103 Number of MC samples Variance (a) Univariate Model 100 101 102 103 10−3 10−1 101 103 Number of MC samples ADVI BBVI BBVI with control variate (b) Multivariate Nonlinear Regression Model Figure 8: Comparison of gradient estimator variances. The ADVI gradient estimator exhibits lower variance than the BBVI estimator. Moreover, it does not require control variate variance reduction, which is not available in univariate situations. Figure 8: Comparison of gradient estimator variances. The ADVI gradient estimator exhibits lower variance than the BBVI estimator. Moreover, it does not require control variate variance reduction, which is not available in univariate situations.
  37. 37. ¤ ¤ ¤ ¤ ¤ 2 ¤ ¤ O ¤ ¤
  38. 38. ¤ Stan ¤ ¤ MCMC NUTS[Hoffman+ 2014] HMC ¤ Stan python R ¤ ADVI ¤ PyMC3 ¤ Python MCMC ¤ Theano GPU ¤ ADVI ¤ Edward ¤ ¤ criticism ¤ Python Tensorflow Keras ¤ Stan PyMC3 35x [Tran+ 2016]
  39. 39. ¤ Tars ¤ https://github.com/masa-su/Tars ¤ Edward Tran PyMC3 Wiecki star ¤ ¤ ¤ Edward PyMC3 ¤ ¤ ¤ Q Tars ¤ A

×