Successfully reported this slideshow.
Upcoming SlideShare
×

# （DL hacks輪読）Bayesian Neural Network

• Full Name
Comment goes here.

Are you sure you want to Yes No

### （DL hacks輪読）Bayesian Neural Network

1. 1. Bayesian Neural Network 2017/01/27
2. 2. ¤ bayesian neural network ¤ NIPS http://bayesiandeeplearning.org ¤ bayesian neural network ¤ Stan PyMC3 Edward ¤ ¤ ¤
3. 3. ¤ ¤ ¤ ¤
4. 4. ¤ ! " # ¤ \$(#|") ¤ ¤ \$(!|#, ") ¤ ¤ ¤ w ¤ \$(#|!, ") ¤ ¤ \$(!|") ¤ ¤ ¤
5. 5. ¤ ¤ D ¤ ¤ ! = {+} = - \$ ! # = ∏ \$(+|#)/ 012 = \$(-|#) ¤ ! = {+, 3} = (-, 3) \$ ! # = ∏ \$(3|+, #)/ 012 = \$(3|-, #) \$ # ! = \$ ! # \$(#) \$(!) = \$ ! # \$(#) ∫ \$ ! # \$ # 5#
6. 6. MAP ¤ ¤ MAP ¤ MAP ¤ ¤ MAP #789 = arg max log \$(!|#) = arg max A B log \$(+C|#) C #7DE = arg max log \$(#|!) = arg max A log \$ ! # + log \$(#)
7. 7. ¤ +G ¤ # ¤ ¤ +G 3G \$ +G ! = H \$ +G # \$ # ! 5# = IJ(A|K)[\$ +G # ] \$ 3G +G, ! = H \$ 3G +G, # \$ # ! 5# = IJ(A|K)[\$ 3G +G, # ]
8. 8. ¤ ¤ ¤ ¤ ¤ ¤ \$ +G ! = H \$ +G # \$ # ! 5# = IJ(A|K)[\$ +G # ] \$ " ! = \$ ! " \$(") \$(!) \$ # ! = \$ ! # \$(#) \$(!) = \$ ! # \$(#) ∫ \$ ! # \$ # 5#
9. 9. ¤ ¤ ¤ # ¤ \$ # ! = \$ ! # \$(#) \$(!) = \$ ! # \$(#) ∫ \$ ! # \$ # 5#
10. 10. ¤ 2 1. MCMC 2. 1. MCMC ¤ \$(#|!) ¤ ¤ 2. ¤ \$(#|N) O(#) ¤ 2 ¤ ¤
11. 11. ¤ # \$(#) ¤ # ¤ Weight Uncertainty in Neural Networks H1 H2 H3 1 X 1 Y 0.5 0.1 0.7 1.3 1.40.3 1.2 0.10.1 0.2 H1 H2 H3 1 X 1 Y Figure 1. Left: each weight has a ﬁxed value, as provided by clas- sical backpropagation. Right: each weight is assigned a distribu- tion, as provided by Bayes by Backprop. is related to recent methods in deep, generative modelling (Kingma and Welling, 2014; Rezende et al., 2014; Gregor et al., 2014), where variational inference has been applied to stochastic hidden units of an autoencoder. Whilst the number of stochastic hidden units might be in the order of the parameters of the categorical dis through the exponential function then regression Y is R and P(y|x, w) is a G – this corresponds to a squared loss. Inputs x are mapped onto the param tion on Y by several successive layers tion (given by w) interleaved with elem transforms. The weights can be learnt by maximum tion (MLE): given a set of training exam the MLE weights wMLE are given by: wMLE = arg max w log P(D|w = arg max w i log P( This is typically achieved by gradient NN Bayesian NN \$(3G|!, +G, N) = H \$ 3G +G, # \$ # !, N 5# [Blundell+ 2015]
12. 12. ¤ ¤ ¤ “The Importance of Knowing What We Don't Know (by Yarin Gal)” ¤ ¤ ¤ ¤ ¤ ¤ WHY SHOULD WE CARE? Calibrated model and prediction uncertainty: getting systems that know when they don’t know. Automatic model complexity control and structure learnin (Bayesian Occam’s Razor) Figure from Yarin Gal’s thesis “Uncertainty in Deep Learning” (2016) Zoubin Ghahramani http://mlg.eng.cam.ac.uk/yarin/blog_2248.html
13. 13. ¤ stochastic neural network ¤ VAE ¤ = ¤ ¤ https://jmhl.org/research/
14. 14. ¤ ¤ ¤ ¤ ¤ 1 ¤ ¤ http://evelinag.com/blog/2014/09-15-introducing-ariadne/#.WI8X7LaLTEY
15. 15. ¤ ¤ P Q ¤ ¤ α ¤ ¤ 3.1.5 Multiple outputs So far, we have considered the case of a single target variable t. In some applica- tions, we may wish to predict K > 1 target variables, which we denote collectively by the target vector t. This could be done by introducing a different set of basis func- tions for each component of t, leading to multiple, independent regression problems. However, a more interesting, and more common, approach is to use the same set of basis functions to model all of the components of the target vector so that y(x, w) = WT φ(x) (3.31) where y is a K-dimensional column vector, W is an M × K matrix of parameters, and φ(x) is an M-dimensional column vector with elements φj(x), with φ0(x) = 1 as before. Suppose we take the conditional distribution of the target vector to be an isotropic Gaussian of the form p(t|x, W, β) = N(t|WT φ(x), β−1 I). (3.32) If we have a set of observations t1, . . . , tN , we can combine these into a matrix T of size N × K such that the nth row is given by tT n. Similarly, we can combine the input vectors x1, . . . , xN into a matrix X. The log likelihood function is then given by ln p(T|X, W, β) = N n=1 ln N(tn|WT φ(xn), β−1 I) = NK 2 ln β 2π − β 2 N n=1 tn − WT φ(xn) 2 . (3.33) by the target vector t. This could be done by introducing a different set of basis func- tions for each component of t, leading to multiple, independent regression problems. However, a more interesting, and more common, approach is to use the same set of basis functions to model all of the components of the target vector so that y(x, w) = WT φ(x) (3.31) where y is a K-dimensional column vector, W is an M × K matrix of parameters, and φ(x) is an M-dimensional column vector with elements φj(x), with φ0(x) = 1 as before. Suppose we take the conditional distribution of the target vector to be an isotropic Gaussian of the form p(t|x, W, β) = N (t|WT φ(x), β−1 I). (3.32) If we have a set of observations t1, . . . , tN , we can combine these into a matrix T of size N × K such that the nth row is given by tT n. Similarly, we can combine the input vectors x1, . . . , xN into a matrix X. The log likelihood function is then given by ln p(T|X, W, β) = N n=1 ln N (tn|WT φ(xn), β−1 I) = NK 2 ln β 2π − β 2 N n=1 tn − WT φ(xn) 2 . (3.33) mN = SN S−1 0 m0 + βΦT t (3.50) S−1 N = S−1 0 + βΦT Φ. (3.51) Note that because the posterior distribution is Gaussian, its mode coincides with its mean. Thus the maximum posterior weight vector is simply given by wMAP = mN . If we consider an inﬁnitely broad prior S0 = α−1 I with α → 0, the mean mN of the posterior distribution reduces to the maximum likelihood value wML given by (3.15). Similarly, if N = 0, then the posterior distribution reverts to the prior. Furthermore, if data points arrive sequentially, then the posterior distribution at any stage acts as the prior distribution for the subsequent data point, such that the new posterior distribution is again given by (3.49).3.8 For the remainder of this chapter, we shall consider a particular form of Gaus- sian prior in order to simplify the treatment. Speciﬁcally, we consider a zero-mean isotropic Gaussian governed by a single precision parameter α so that p(w|α) = N(w|0, α−1 I) (3.52) and the corresponding posterior distribution over w is then given by (3.49) with mN = βSN ΦT t (3.53) S−1 N = αI + βΦT Φ. (3.54) The log of the posterior distribution is given by the sum of the log likelihood and the log of the prior and, as a function of w, takes the form ln p(w|t) = − β 2 N n=1 {tn − wT φ(xn)}2 − α 2 wT w + const. (3.55) Maximization of this posterior distribution with respect to w is therefore equiva- 3.3. Bayesian Linear Regression 153 Next we compute the posterior distribution, which is proportional to the product of the likelihood function and the prior. Due to the choice of a conjugate Gaus- sian prior distribution, the posterior will also be Gaussian. We can evaluate this distribution by the usual procedure of completing the square in the exponential, and then ﬁnding the normalization coefﬁcient using the standard result for a normalized Gaussian. However, we have already done the necessary work in deriving the gen- eral result (2.116), which allows us to write down the posterior distribution directly in the form p(w|t) = N (w|mN , SN ) (3.49) where mN = SN S−1 0 m0 + βΦT t (3.50) S−1 N = S−1 0 + βΦT Φ. (3.51) then ﬁnding the normalization coefﬁcient using the standard result for a normalized Gaussian. However, we have already done the necessary work in deriving the gen- eral result (2.116), which allows us to write down the posterior distribution directly in the form p(w|t) = N(w|mN , SN ) (3.49) where mN = SN S−1 0 m0 + βΦT t (3.50) S−1 N = S−1 0 + βΦT Φ. (3.51) Note that because the posterior distribution is Gaussian, its mode coincides with its mean. Thus the maximum posterior weight vector is simply given by wMAP = mN . If we consider an inﬁnitely broad prior S0 = α−1 I with α → 0, the mean mN of the posterior distribution reduces to the maximum likelihood value wML given by (3.15). Similarly, if N = 0, then the posterior distribution reverts to the prior. Furthermore, if data points arrive sequentially, then the posterior distribution at any stage acts as the prior distribution for the subsequent data point, such that the new posterior distribution is again given by (3.49). For the remainder of this chapter, we shall consider a particular form of Gaus- sian prior in order to simplify the treatment. Speciﬁcally, we consider a zero-mean isotropic Gaussian governed by a single precision parameter α so that p(w|α) = N(w|0, α−1 I) (3.52) and the corresponding posterior distribution over w is then given by (3.49) with mN = βSN ΦT t (3.53) S−1 N = αI + βΦT Φ. (3.54) The log of the posterior distribution is given by the sum of the log likelihood and the log of the prior and, as a function of w, takes the form PRML
17. 17. ¤ ¤ ¤ ¤ ¤ 2 ¤ ¤
18. 18. ¤ ¤ 1. 2. MAP #7DE 3. 2 R 4. chapters and so we can exploit the results obtained there. We can then make use of the evidence framework to provide point estimates for the hyperparameters and to compare alternative models (for example, networks having different numbers of hid- den units). To start with, we shall discuss the regression case and then later consider the modiﬁcations needed for solving classiﬁcation tasks. 5.7.1 Posterior parameter distribution Consider the problem of predicting a single continuous target variable t from a vector x of inputs (the extension to multiple targets is straightforward). We shall suppose that the conditional distribution p(t|x) is Gaussian, with an x-dependent mean given by the output of a neural network model y(x, w), and with precision (inverse variance) β p(t|x, w, β) = N(t|y(x, w), β−1 ). (5.161) Similarly, we shall choose a prior distribution over the weights w that is Gaussian of the form p(w|α) = N(w|0, α−1 I). (5.162) For an i.i.d. data set of N observations x1, . . . , xN , with a corresponding set of target values D = {t1, . . . , tN }, the likelihood function is given by p(D|w, β) = N n=1 N(tn|y(xn, w), β−1 ) (5.163) and so the resulting posterior distribution is then p(w|D, α, β) ∝ p(w|α)p(D|w, β). (5.164) the modiﬁcations needed for solving classiﬁcation tasks. 5.7.1 Posterior parameter distribution Consider the problem of predicting a single continuous target variable t fro a vector x of inputs (the extension to multiple targets is straightforward). We sh suppose that the conditional distribution p(t|x) is Gaussian, with an x-depende mean given by the output of a neural network model y(x, w), and with precisi (inverse variance) β p(t|x, w, β) = N(t|y(x, w), β−1 ). (5.16 Similarly, we shall choose a prior distribution over the weights w that is Gaussian the form p(w|α) = N(w|0, α−1 I). (5.16 For an i.i.d. data set of N observations x1, . . . , xN , with a corresponding set of targ values D = {t1, . . . , tN }, the likelihood function is given by p(D|w, β) = N n=1 N(tn|y(xn, w), β−1 ) (5.16 and so the resulting posterior distribution is then p(w|D, α, β) ∝ p(w|α)p(D|w, β). (5.16 which, as a consequence of the nonlinear dependence of y(x, w) on w, will be no Gaussian. We can ﬁnd a Gaussian approximation to the posterior distribution by using t Laplace approximation. To do this, we must ﬁrst ﬁnd a (local) maximum of t 5.7. Bayesian Neural Networks 279 form ln p(w|D) = − α 2 wT w − β 2 N n=1 {y(xn, w) − tn} 2 + const (5.165) which corresponds to a regularized sum-of-squares error function. Assuming for the moment that α and β are ﬁxed, we can ﬁnd a maximum of the posterior, which we denote wMAP, by standard nonlinear optimization algorithms such as conjugate gradients, using error backpropagation to evaluate the required derivatives. Having found a mode wMAP, we can then build a local Gaussian approximation by evaluating the matrix of second derivatives of the negative log posterior distribu- tion. From (5.165), this is given by A = −∇∇ ln p(w|D, α, β) = αI + βH (5.166) where H is the Hessian matrix comprising the second derivatives of the sum-of- squares error function with respect to the components of w. Algorithms for comput- ing and approximating the Hessian were discussed in Section 5.4. The corresponding Gaussian approximation to the posterior is then given from (4.134) by q(w|D) = N(w|wMAP, A−1 ). (5.167) form ln p(w|D) = − α 2 wT w − β 2 N n=1 {y(xn, w) − tn} 2 + const (5.165) which corresponds to a regularized sum-of-squares error function. Assuming for the moment that α and β are ﬁxed, we can ﬁnd a maximum of the posterior, which we denote wMAP, by standard nonlinear optimization algorithms such as conjugate gradients, using error backpropagation to evaluate the required derivatives. Having found a mode wMAP, we can then build a local Gaussian approximation by evaluating the matrix of second derivatives of the negative log posterior distribu- tion. From (5.165), this is given by A = −∇∇ ln p(w|D, α, β) = αI + βH (5.166) where H is the Hessian matrix comprising the second derivatives of the sum-of- squares error function with respect to the components of w. Algorithms for comput- ing and approximating the Hessian were discussed in Section 5.4. The corresponding Gaussian approximation to the posterior is then given from (4.134) by q(w|D) = N(w|wMAP, A−1 ). (5.167) Similarly, the predictive distribution is obtained by marginalizing with respect PRML
19. 19. ¤ ¤ ¤ https://www.r-bloggers.com/easy-laplace-approximation-of-bayesian-models-in-r/
20. 20. ¤ O(#|N) ¤ ST[O(#|N)||\$ # ! ] N ¤ KL ¤ N ELBO N → ST[O(#|N)||\$ (#|!)] = −∫ O # N log W K|A W A J # N 5# + log \$ ! = −ℒ !; N + log \$ ! ELBO
21. 21. ELBO ¤ ELBO ℒ !; N 1. ¤ \$(#|!) MC ¤ 2. ELBO ¤ MC ¤ MC ∫ O # N log W K|A W A J # N 5# = IJ[log W K|A W A J # N ] 3. ¤ EM ¤ ELBO
22. 22. ¤ Gal ¤ Denker, Schwartz, Wittner, Solla, Howard, Jackel, Hopfield (1987) ¤ Denker and LeCun (1991) ¤ MacKay (1992) ¤ Hinton and van Camp (1993) ¤ ¤ Neal (1995) ¤ Barber and Bishop (1998) ¤ Graves (2011) ¤ Blundell, Cornebise, Kavukcuoglu, and Wierstra (2015) ¤ Hernandez-Lobato and Adam (2015)
23. 23. ¤ Practical Variational Inference for Neural Networks [Graves 2011] ¤ ¤ ¤ T9 O(#|N) N = {Z, [} ¤ ℒ !; N = ∫ O (#|N)log \$ !|# \$(#) O(#|N) 5# = − EJ(A|])[log \$(!|#)] + ST[O(#|N)||\$(#)] ^T9 ^Z ≈ 1 a B ^ log \$(!|#) ^# / C ^T9 ^[b ≈ 1 2a B ^ log \$(!|#) ^# b/ C
24. 24. ¤ Weight Uncertainty in Neural Networks [Blundell+ 2015] ¤ ¤ O(#|N) ¤ ¤ http://www.slideshare.net/masa_s/weight-uncertainty-in-neural-networks Bayes by backprop ^ ^N ℒ !; N = ^ ^N IJ(A|]) d(#, N) d #, N = log \$ !|# \$(#) O(#|N) = IJ(e) ^d(#, N) ^N ^# ^N + ^d(#, N) ^N Reparameterization trick # = Z + diag([) ⊙ i i~k(0, m)
25. 25. Dropout ¤ Dropout as a Bayesian Approximation [Gal+ 2015] ¤ O # N = ∏ O(nC|oC) ¤ T9 = − EJ # N [log \$(!|#)] ¤ oC 0 ¤ 0 ¤ dropout ¤ drop-connect multiplicative Gaussian noise sults are summarised here n uncertainty estimates for el with L layers and a loss max loss or the Euclidean Wi the NN’s weight ma- 1, and by bi the bias vec- ayer i = 1, ..., L. We de- corresponding to input xi the input and output sets on a regularisation term is egularisation weighted by n a minimisation objective λ L i=1 ||Wi||2 2 + ||bi||2 2 . (1) variables for every input in each layer (apart from le takes value 1 with prob- ropped (i.e. its value is set orresponding binary vari- me values in the backward to the parameters. p(y|x, ω) = N y; y(x, ω), τ ID y x, ω = {W1, ...,WL} = 1 KL WLσ ... 1 K1 W2σ W1x + m1 ... The posterior distribution p(ω|X, Y) in eq. (2) is in- tractable. We use q(ω), a distribution over matrices whose columns are randomly set to zero, to approximate the in- tractable posterior. We deﬁne q(ω) as: Wi = Mi · diag([zi,j]Ki j=1) zi,j ∼ Bernoulli(pi) for i = 1, ..., L, j = 1, ..., Ki−1 given some probabilities pi and matrices Mi as variational parameters. The binary variable zi,j = 0 corresponds then to unit j in layer i − 1 being dropped out as an input to layer i. The variational distribution q(ω) is highly multi- modal, inducing strong joint correlations over the rows of the matrices Wi (which correspond to the frequencies in the sparse spectrum GP approximation). We minimise the KL divergence between the approximate posterior q(ω) above and the posterior of the full deep GP, p(ω|X, Y). This KL is our minimisation objective − q(ω) log p(Y|X, ω)dω + KL(q(ω)||p(ω)). (3)
26. 26. Dropout ¤ dropout ¤ dropout ¤ ¤ MC dropout ¤ http://mlg.eng.cam.ac.uk/yarin/blog_2248.html where ω = {Wi}L i=1 is our set of random variables for a model with L layers. We will perform moment-matching and estimate the ﬁrst two moments of the predictive distribution empirically. More speciﬁcally, we sample T sets of vectors of realisa- tions from the Bernoulli distribution {zt 1, ..., zt L}T t=1 with zt i = [zt i,j]Ki j=1, giving {Wt 1, ..., Wt L}T t=1. We estimate Eq(y∗|x∗)(y∗ ) ≈ 1 T T t=1 y∗ (x∗ , Wt 1, ..., Wt L) (6) following proposition C in the appendix. We refer to this Monte Carlo estimate as MC dropout. In practice this is equivalent to performing T stochastic forward passes through the network and averaging the results. This result has been presented in the literature before as model averaging. We have given a new derivation for this result which allows us to derive mathematically grounded uncertainty estimates as well. Srivastava et al. (2014, sec- tion 7.5) have reasoned empirically that MC dropout can be approximated by averaging the weights of the network (multiplying each Wi by pi at test time, referred to as stan- dard dropout). We estimate the second raw moment in the same way: log p(y∗ |x∗ , X, Y) ≈ with a log-sum-exp o passes through the ne Our predictive distr highly multi-modal, give a glimpse into i proximating variation matrix column is bi- tribution over each la 3.2 in the appendix). Note that the dropo To estimate the predi we simply collect the through the model. used with existing N thermore, the forward sulting in constant run dropout. 5. Experiments T t=1 following proposition C in the appendix. We refer to this Monte Carlo estimate as MC dropout. In practice this is equivalent to performing T stochastic forward passes through the network and averaging the results. This result has been presented in the literature before as model averaging. We have given a new derivation for this result which allows us to derive mathematically grounded uncertainty estimates as well. Srivastava et al. (2014, sec- tion 7.5) have reasoned empirically that MC dropout can be approximated by averaging the weights of the network (multiplying each Wi by pi at test time, referred to as stan- dard dropout). We estimate the second raw moment in the same way: Eq(y∗|x∗) (y∗ )T (y∗ ) ≈ τ−1 ID + 1 T T t=1 y∗ (x∗ , Wt 1, ..., Wt L)T y∗ (x∗ , Wt 1, ..., Wt L) following proposition D in the appendix. To obtain the model’s predictive variance we have: Varq(y∗|x∗) y∗ ≈ τ−1 ID 2 In the appendix (section 4.1) we extend this derivation to classiﬁcation. E(·) is deﬁned as softmax loss and τ is set to 1. proximating variational distributio matrix column is bi-modal, and tribution over each layer’s weight 3.2 in the appendix). Note that the dropout NN mod To estimate the predictive mean a we simply collect the results of s through the model. As a result, used with existing NN models tra thermore, the forward passes can sulting in constant running time id dropout. 5. Experiments We next perform an extensive ass of the uncertainty estimates obta and convnets on the tasks of regr We compare the uncertainty obtai architectures and non-linearities, olation, and show that model unc classiﬁcation tasks using MNIST as an example. We then show th tainty we can obtain a considerabl tive log-likelihood and RMSE co of-the-art methods. We ﬁnish wi
27. 27. ¤ MC dropout Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning (a) Standard dropout with weight averaging (b) Gaussian process with SE covariance function (c) MC dropout with ReLU non-linearities (d) MC dropout with TanH non-linearities Figure 2. Predictive mean and uncertainties on the Mauna Loa CO2 concentrations dataset, for various models. In red is the observed function (left of the dashed blue line); in blue is the predictive mean plus/minus two standard deviations (8 for ﬁg. 2d). Different shades of blue represent half a standard deviation. Marked with a dashed red line is a point far away from the data: standard dropout conﬁdently predicts an insensible value for the point; the other models predict insensible values as well but with the additional information that the models are uncertain about their predictions. model’s uncertainty in a Bayesian pipeline. We give a quantitative assessment of the model’s performance in the setting of reinforcement learning on a task similar to that used in deep reinforcement learning (Mnih et al., 2015). Using the results from the previous section, we begin by qualitatively evaluating the dropout NN uncertainty on two comparison. Fig. 2c shows the results of the same network as in ﬁg. 2a, but with MC dropout used to evaluate the pre- dictive mean and uncertainty for the training and test sets. Lastly, ﬁg. 2d shows the same using the TanH network with 5 layers (plotted with 8 times the standard deviation for vi- sualisation purposes). The shades of blue represent model uncertainty: each colour gradient represents half a standard
28. 28. Dropout ¤ Variational dropout and the local reparameterization trick [Kingma+ 2015] ¤ ¤ 0 ¤ 0 local reparameterization trick ¤ Dropout ¤ ¤ http://www.slideshare.net/masa_s/dl-hacks-variational-dropout-and-the-local- reparameterization-trick 2.2 Variance of the SGVB estimator The theory of stochastic approximation tells us that stochastic gradient ascent using (3) will asymp- totically converge to a local optimum for an appropriately declining step size and sufﬁcient weight updates [18], but in practice the performance of stochastic gradient ascent crucially depends on the variance of the gradients. If this variance is too large, stochastic gradient descent will fail to make much progress in any reasonable amount of time. Our objective function consists of an expected log likelihood term that we approximate using Monte Carlo, and a KL divergence term DKL(qφ(w)||p(w)) that we assume can be calculated analytically and otherwise be approximated with Monte Carlo with similar reparameterization. Assume that we draw minibatches of datapoints with replacement; see appendix F for a similar analysis for minibatches without replacement. Using Li as shorthand for log p(yi |xi , w = f(ϵi , φ)), the contribution to the likelihood for the i-th datapoint in the minibatch, the Monte Carlo estimator (3) may be rewritten as LSGVB D (φ) = N M M i=1 Li, whose variance is given by Var LSGVB D (φ) = N2 M2 M i=1 Var [Li] + 2 M i=1 M j=i+1 Cov [Li, Lj] (4) =N2 1 M Var [Li] + M − 1 M Cov [Li, Lj] , (5) where the variances and covariances are w.r.t. both the data distribution and ϵ distribution, i.e. Var [Li] = Varϵ,xi,yi log p(yi |xi , w = f(ϵ, φ)) , with xi , yi drawn from the empirical distribu- tion deﬁned by the training set. As can be seen from (5), the total contribution to the variance by Var [Li] is inversely proportional to the minibatch size M. However, the total contribution by the covariances does not decrease with M. In practice, this means that the variance of LSGVB D (φ) can be dominated by the covariances for even moderately large M. 2.3 Local Reparameterization Trick We therefore propose an alternative estimator for which we have Cov [Li, Lj] = 0, so that the vari- ance of our stochastic gradients scales as 1/M. We then make this new estimator computationally efﬁcient by not sampling ϵ directly, but only sampling the intermediate variables f(ϵ) through which SGVB
29. 29. Automatic Differentiation Variational Inference
30. 30. ¤ ¤ ¤ ¤ → Automatic differentiation variational inference ADVI
31. 31. Automatic Differentiation Variational Inference ADVI ¤ Automatic Differentiation Variational Inference [Kucukelbir+ 2016] ¤ Stan PyMC3 Edward ¤ ADVI 1. N \$(+, N) \$(+, p) ST[O(p)||\$(p, +))] 2. O MC 3. O
32. 32. Automatic transformation ¤ N ¤ \$(N) support ¤ ¤ N p ¤ p ¤ N −> p r p 0 1 2 3 θ De (a) Latent variable space T−1 −1 0 1 2 ζ (b) Real coordinate space 1: Transforming the latent variable to real coordinate space. The purple line is the pos ne is the approximation. (a) The latent variable space is >0. (a→b) T transforms space to . (b) The variational approximation is a Gaussian in real coordinate space : Transforming the latent variable to real coordinate space. The purple line is the post e is the approximation. (a) The latent variable space is >0. (a→b) T transforms space to . (b) The variational approximation is a Gaussian in real coordinate space. tify the transformed variables as ζ = T(θ). The transformed joint density p(x,ζ) is as the representation p(x,ζ) = p x, T−1 (ζ) det JT−1 (ζ) , (x,θ = T−1 (ζ)) is the joint density in the original latent variable space, and JT−1 of the inverse of T. Transformations of continuous probability densities require a nts for how the transformation warps unit volumes and ensures that the transforme s to one (Olive, 2014). (See Appendix A.)
33. 33. Automatic transformation ¤ ¤ r = log (N) ¤ ¤ p p(x,ζ) = p x, T−1 (ζ) det JT−1 (ζ) , x,θ = T−1 (ζ)) is the joint density in the original latent variable space, and JT−1 (ζ) is the of the inverse of T. Transformations of continuous probability densities require a Jacobian; s for how the transformation warps unit volumes and ensures that the transformed density to one (Olive, 2014). (See Appendix A.) again our running Weibull-Poisson example from Section 2.1. The latent variable θ lives in ogarithm ζ = T(θ) = log(θ) transforms >0 to the real line . Its Jacobian adjustment is the of the inverse of the logarithm |det JT−1(ζ)| = exp(ζ). The transformed density is p(x,ζ) = Poisson(x | exp(ζ)) × Weibull(exp(ζ) ; 1.5,1) × exp(ζ). epicts this transformation. cribe in the introduction, we implement our algorithm in Stan (Stan Development Team, an maintains a library of transformations and their corresponding Jacobians.4 With Stan, tomatically transforms the joint density of any differentiable probability model to one with d latent variables. (See Figure 2.) riational Approximations in Real Coordinate Space ransformation, the latent variables ζ have support in the real coordinate space K . We have f variational approximations in this space. Here, we consider Gaussian distributions; these nduce non-Gaussian variational distributions in the original latent variable space. of ζ; it has the representation p(x,ζ) = p x, T−1 (ζ) det JT−1 (ζ) , where p(x,θ = T−1 (ζ)) is the joint density in the original latent variable space, and JT−1 (ζ) is t Jacobian of the inverse of T. Transformations of continuous probability densities require a Jacobia it accounts for how the transformation warps unit volumes and ensures that the transformed dens integrates to one (Olive, 2014). (See Appendix A.) Consider again our running Weibull-Poisson example from Section 2.1. The latent variable θ lives >0. The logarithm ζ = T(θ) = log(θ) transforms >0 to the real line . Its Jacobian adjustment is t derivative of the inverse of the logarithm |det JT−1(ζ)| = exp(ζ). The transformed density is p(x,ζ) = Poisson(x | exp(ζ)) × Weibull(exp(ζ) ; 1.5,1) × exp(ζ). Figure 1 depicts this transformation. As we describe in the introduction, we implement our algorithm in Stan (Stan Development Tea 2015). Stan maintains a library of transformations and their corresponding Jacobians.4 With St we can automatically transforms the joint density of any differentiable probability model to one w real-valued latent variables. (See Figure 2.) 2.4 Variational Approximations in Real Coordinate Space After the transformation, the latent variables ζ have support in the real coordinate space K . We ha 0 1 2 3 1 θ Density (a) Latent variable space T T−1 −1 0 1 2 1 ζ Prior Posterior Approximation (b) Real coordinate space Figure 1: Transforming the latent variable to real coordinate space. The purple line is the posterior. The green line is the approximation. (a) The latent variable space is >0. (a→b) T transforms the latent variable space to . (b) The variational approximation is a Gaussian in real coordinate space. Figure 1: Transforming the latent variable to real coordinate space. The purple line is the posterior. The green line is the approximation. (a) The latent variable space is >0. (a→b) T transforms the latent variable space to . (b) The variational approximation is a Gaussian in real coordinate space.
34. 34. ¤ ¤ ¤ ¤ L ¤ ¤ .4 Variational Approximations in Real Coordinate Space fter the transformation, the latent variables ζ have support in the real coordinate space K . We hav choice of variational approximations in this space. Here, we consider Gaussian distributions; thes mplicitly induce non-Gaussian variational distributions in the original latent variable space. Mean-ﬁeld Gaussian. One option is to posit a factorized (mean-ﬁeld) Gaussian variational approxima on q(ζ; φ) = ζ; µ,diag(σ2 ) = K k=1 ζk ; µk,σ2 k , where the vector φ = (µ1,··· ,µK ,σ2 1,··· ,σ2 K ) concatenates the mean and variance of each Gaussia actor. Since the variance parameters must always be positive, the variational parameters live in th et Φ = { K , K >0}. Re-parameterizing the mean-ﬁeld Gaussian removes this constraint. Consider th 4 Stan provides various transformations for upper and lower bounds, simplex and ordered vectors, and structured matrices suc covariance matrices and Cholesky factors. 6 N x [ n ] ~ poisson ( theta ) ; } Figure 2: Specifying a simple nonconjugate probability model in Stan. Figure 2: Specifying a simple nonconjugate probability model in Stan. arithm of the standard deviations, ω = log(σ), applied element-wise. The support of ω is n real coordinate space and σ is always positive. The mean-ﬁeld Gaussian becomes q(ζ; φ) ζ; µ,diag(exp(ω)2 ) , where the vector φ = (µ1,··· ,µK ,ω1,··· ,ωK ) concatenates the mean a arithm of the standard deviation of each factor. Now, the variational parameters are unconstrain 2K . l-rank Gaussian. Another option is to posit a full-rank Gaussian variational approximation q(ζ; φ) = ζ; µ,Σ , ere the vector φ = (µ,Σ) concatenates the mean vector µ and covariance matrix Σ. To ensure t always remains positive semideﬁnite, we re-parameterize the covariance matrix using a Chole torization, Σ = LL⊤ . We use the non-unique deﬁnition of the Cholesky factorization where gonal elements of L need not be positively constrained (Pinheiro and Bates, 1996). Therefor s in the unconstrained space of lower-triangular matrices with K(K + 1)/2 real-valued entries. T -rank Gaussian becomes q(ζ; φ) = ζ; µ, LL⊤ , where the variational parameters φ = (µ, L) constrained in K+K(K+1)/2 . Figure 2: Specifying a simple nonconjugate probability model in Stan. Figure 2: Specifying a simple nonconjugate probability model in Stan. logarithm of the standard deviations, ω = log(σ), applied element-wise. The support of ω the real coordinate space and σ is always positive. The mean-ﬁeld Gaussian becomes q(ζ ζ; µ,diag(exp(ω)2 ) , where the vector φ = (µ1,··· ,µK ,ω1,··· ,ωK ) concatenates the m logarithm of the standard deviation of each factor. Now, the variational parameters are uncon in 2K . Full-rank Gaussian. Another option is to posit a full-rank Gaussian variational approximation q(ζ; φ) = ζ; µ,Σ , where the vector φ = (µ,Σ) concatenates the mean vector µ and covariance matrix Σ. To ens Σ always remains positive semideﬁnite, we re-parameterize the covariance matrix using a C factorization, Σ = LL⊤ . We use the non-unique deﬁnition of the Cholesky factorization wh diagonal elements of L need not be positively constrained (Pinheiro and Bates, 1996). The lives in the unconstrained space of lower-triangular matrices with K(K + 1)/2 real-valued entr full-rank Gaussian becomes q(ζ; φ) = ζ; µ, LL⊤ , where the variational parameters φ = (µ unconstrained in K+K(K+1)/2 . The full-rank Gaussian generalizes the mean-ﬁeld Gaussian approximation. The off-diagonal term covariance matrix Σ capture posterior correlations across latent random variables.5 This leads to accurate posterior approximation than the mean-ﬁeld Gaussian; however, it comes at a compu cost. Various low-rank approximations to the covariance matrix reduce this cost, yet limit its a