Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(DL hacks輪読) Variational Inference with Rényi Divergence

輪読日:2016/02/17

  • Login to see the comments

(DL hacks輪読) Variational Inference with Rényi Divergence

  1. 1. Variational Inference with Rényi Divergence D1
  2. 2. ¤ arXiv 2 6 ¤ Yingzhen Li Richard E. Turner ¤ University of Cambridge ¤ Li D3 “Stochastic Expectation Propagation” NIPS ¤ Rényi ¤ VAE importance weighted AE[Burda et al., 2015] ¤ Appendix ¤
  3. 3. ¤ PRML ¤ ¤ ¤ ¤ KL ln #(%) = ℒ % + *+(,||%)
  4. 4. ¤ ¤ SVI [Hoffmann et al, 2013] ¤ SEP [Li et al., 2015] ¤ black-box ¤ [Ranganath et al., 2014] ¤ black-box alpha BB-α [Hernandez-Labato et al., 2015] ¤ ¤ Importance weighted AE (IWAE)[Burda et al., 2015] ¤ VAE ICLR2016
  5. 5. ¤ ¤ #(.|/) ¤ ,(.) ¤ KL # / ¤ principle literature [Gr¨unwald, 2007]. 2.2 Variational Inference Next we review the variational inference algorithm [Jordan et al. perspective, using posterior approximation as a running examp i.i.d. samples D = {xn}N n=1 from a probabilistic model p(x|✓) pa is drawn from a prior p0(✓). Bayesian inference involves comp parameters given the data, p(✓|D) = p(✓, D) p(D) = p0(✓) QN n=1 p p(D) 3 ciple literature [Gr¨unwald, 2007]. Variational Inference we review the variational inference algorithm [Jordan et al., 1999, Beal, 2003] from an optimisat pective, using posterior approximation as a running example. Consider observing a dataset of samples D = {xn}N n=1 from a probabilistic model p(x|✓) parametrised by a random variable ✓ th awn from a prior p0(✓). Bayesian inference involves computing the posterior distribution of t meters given the data, p(✓|D) = p(✓, D) p(D) = p0(✓) QN n=1 p(xn|✓) p(D) , 3 (D) = R p0(✓) QN n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For l models, including Bayesian neural networks, the true posterior is typically intractable. nference introduces an approximation q(✓) to the true posterior, which is obtained by minim divergence in some tractable distribution family Q: q(✓) = arg min q2Q KL[q(✓)||p(✓|D)]. r the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Varia e sidesteps this di culty by considering an equivalent optimisation problem: q(✓) = arg max q2Q LV I (q; D), he variational lower-bound or evidence lower-bound (ELBO) LV I (q; D) is defined by LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)]  re p(D) = R p0(✓) QN n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For m werful models, including Bayesian neural networks, the true posterior is typically intractable. Va al inference introduces an approximation q(✓) to the true posterior, which is obtained by minimi KL divergence in some tractable distribution family Q: q(✓) = arg min q2Q KL[q(✓)||p(✓|D)]. wever the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Variati rence sidesteps this di culty by considering an equivalent optimisation problem: q(✓) = arg max q2Q LV I(q; D), re the variational lower-bound or evidence lower-bound (ELBO) LV I(q; D) is defined by LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]  p(✓, D) n called marginal likelihood or model evidence. For many etworks, the true posterior is typically intractable. Varia- (✓) to the true posterior, which is obtained by minimising on family Q: min 2Q KL[q(✓)||p(✓|D)]. (8) able, mainly because of the di cult term p(D). Variational g an equivalent optimisation problem: arg max q2Q LV I(q; D), (9) lower-bound (ELBO) LV I(q; D) is defined by og p(D) KL[q(✓)||p(✓|D)]  p(✓, D) (10) where p(D) = R p0(✓) QN n=1 p(xn|✓)d✓ is often called marginal likelihood powerful models, including Bayesian neural networks, the true posterior is tional inference introduces an approximation q(✓) to the true posterior, wh the KL divergence in some tractable distribution family Q: q(✓) = arg min q2Q KL[q(✓)||p(✓|D)]. However the KL divergence in (8) is also intractable, mainly because of the d inference sidesteps this di culty by considering an equivalent optimisation q(✓) = arg max q2Q LV I(q; D), where the variational lower-bound or evidence lower-bound (ELBO) LV I(q LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)] = Eq  log p(✓, D) q(✓) .
  6. 6. VAE ¤ [Kingma et al,. 2014] ¤ ¤ ℎ ¤ ¤ 1 Variational Auto-encoder with R´enyi Divergence he variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a re oposed (deep) generative model that parametrizes the variational approximation with a recog twork. The generative model is specified as a hierarchical latent variable model: p(x) = X h(1)...h(L) p(h(L) )p(h(L 1) |h(L) ) · · · p(x|h(1) ). re we drop the parameters ✓ but keep in mind that they will be learned using approximate max elihood. However for these models the exact computation of log p(x) requires marginalisation dden variables and is thus often intractable. Variational expectation-maximisation (EM) me mes to the rescue by approximating log p(x) ⇡ LV I(q; x) = Eq(h|x)  log p(x, h) q(h|x) , here h collects all the hidden variables h(1) , ..., h(L) and the approximate posterior q(h|x) is defi q(h|x) = q(h(1) |x)q(h(2) |h(1) ) · · · q(h(L) |h(L 1) ). variational EM, optimisation for q and p are alternated to guarantee convergence. However th ea of VAE is to jointly optimising p and q, which instead has no guarantee of increasing the ional Auto-encoder with R´enyi Divergence auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a recently ) generative model that parametrizes the variational approximation with a recognition enerative model is specified as a hierarchical latent variable model: p(x) = X h(1)...h(L) p(h(L) )p(h(L 1) |h(L) ) · · · p(x|h(1) ). (14) he parameters ✓ but keep in mind that they will be learned using approximate maximum wever for these models the exact computation of log p(x) requires marginalisation of all s and is thus often intractable. Variational expectation-maximisation (EM) methods scue by approximating log p(x) ⇡ LV I(q; x) = Eq(h|x)  log p(x, h) q(h|x) , (15) s all the hidden variables h(1) , ..., h(L) and the approximate posterior q(h|x) is defined as q(h|x) = q(h(1) |x)q(h(2) |h(1) ) · · · q(h(L) |h(L 1) ). (16) EM, optimisation for q and p are alternated to guarantee convergence. However the core to jointly optimising p and q, which instead has no guarantee of increasing the MLE on in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2011]. This the possibility that alternative surrogate functions might return estimates that are tighter So the VR bound is considered in this context: L (q; x) = 1 log E "✓ p(x, h) ◆1 ↵ # . (17) variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is sed (deep) generative model that parametrizes the variational approximation with a r rk. The generative model is specified as a hierarchical latent variable model: p(x) = X h(1)...h(L) p(h(L) )p(h(L 1) |h(L) ) · · · p(x|h(1) ). we drop the parameters ✓ but keep in mind that they will be learned using approximate ood. However for these models the exact computation of log p(x) requires marginalisa n variables and is thus often intractable. Variational expectation-maximisation (EM to the rescue by approximating log p(x) ⇡ LV I (q; x) = Eq(h|x)  log p(x, h) q(h|x) , h collects all the hidden variables h(1) , ..., h(L) and the approximate posterior q(h|x) is q(h|x) = q(h(1) |x)q(h(2) |h(1) ) · · · q(h(L) |h(L 1) ). iational EM, optimisation for q and p are alternated to guarantee convergence. Howeve of VAE is to jointly optimising p and q, which instead has no guarantee of increasing ive function in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2 explores the possibility that alternative surrogate functions might return estimates that bounds. So the VR bound is considered in this context: "✓ ◆1 ↵ # p(x|✓) = h1,...,hL p(hL |✓)p(hL 1 |hL , ✓) · · · p(x|h1 , ✓). ( Here, ✓ is a vector of parameters of the variational autoencoder, and h = {h1 , . . . , hL } denotes t stochastic hidden units, or latent variables. The dependence on ✓ is often suppressed for clarity. F convenience, we define h0 = x. Each of the terms p(h` |h`+1 ) may denote a complicated nonline relationship, for instance one computed by a multilayer neural network. However, it is assum that sampling and probability evaluation are tractable for each p(h` |h`+1 ). Note that L denot the number of stochastic hidden layers; the deterministic layers are not shown explicitly here. W assume the recognition model q(h|x) is defined in terms of an analogous factorization: q(h|x) = q(h1 |x)q(h2 |h1 ) · · · q(hL |hL 1 ), ( where sampling and probability evaluation are tractable for each of the terms in the product. In this work, we assume the same families of conditional probability distributions as Kingma Welling (2014). In particular, the prior p(hL ) is fixed to be a zero-mean, unit-variance Gaussia In general, each of the conditional distributions p(h` | h`+1 ) and q(h` |h` 1 ) is a Gaussian wi diagonal covariance, where the mean and covariance parameters are computed by a determinis feed-forward neural network. For real-valued observations, p(x|h1 ) is also defined to be such Gaussian; for binary observations, it is defined to be a Bernoulli distribution whose mean paramete are computed by a neural network. The VAE is trained to maximize a variational lower bound on the log-likelihood, as derived fro Jensen’s Inequality: log p(x) = log Eq(h|x)  p(x, h) q(h|x) Eq(h|x)  log p(x, h) q(h|x) = L(x). ( Since L(x) = log p(x) DKL(q(h|x)||p(h|x)), the training procedure is forced to trade off t
  7. 7. VAE ¤ reparameterization trick ¤ ¤ ed a reparameterization of the recognition distribution in terms tributions, such that the samples from the recognition model are s and auxiliary variables. While they presented the reparameter- tions, for convenience we discuss the special case of Gaussians, ork. (The general reparameterization trick can be used with our tribution q(h` |h` 1 , ✓) always takes the form of a Gaussian hose mean and covariance are computed from the the states of 2 Under review as a conference paper at ICLR 2016 the hidden units at the previous layer and the model parameters. This can be by first sampling an auxiliary variable ✏` ⇠ N (0, I), and then applying the d h` (✏` , h` 1 , ✓) = ⌃(h` 1 , ✓)1/2 ✏` + µ(h` 1 , ✓). The joint recognition distribution q(h|x, ✓) over all latent variables can be a deterministic mapping h(✏, x, ✓), with ✏ = (✏1 , . . . , ✏L ), by applying E sequence. Since the distribution of ✏ does not depend on ✓, we can reformu bound L(x) from Eqn. 3 by pushing the gradient operator inside the expecta r✓ log Eh⇠q(h|x,✓)  p(x, h|✓) q(h|x, ✓) = r✓E✏1,...,✏L⇠N (0,I)  log p(x, h q(h(✏ = E✏1,...,✏L⇠N (0,I)  r✓ log p(x, h q(h(✏ Assuming the mapping h is represented as a deterministic feed-forward neu ✏, the gradient inside the expectation can be computed using standard backp one approximates the expectation in Eqn. 6 by generating k samples of ✏ a Carlo estimator 1 k kX r✓ log w (x, h(✏i, x, ✓), ✓) der review as a conference paper at ICLR 2016 hidden units at the previous layer and the model parameters. This can be alternatively ex first sampling an auxiliary variable ✏` ⇠ N (0, I), and then applying the deterministic m h` (✏` , h` 1 , ✓) = ⌃(h` 1 , ✓)1/2 ✏` + µ(h` 1 , ✓). joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in t eterministic mapping h(✏, x, ✓), with ✏ = (✏1 , . . . , ✏L ), by applying Eqn. 4 for each uence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradien nd L(x) from Eqn. 3 by pushing the gradient operator inside the expectation: r✓ log Eh⇠q(h|x,✓)  p(x, h|✓) q(h|x, ✓) = r✓E✏1,...,✏L⇠N (0,I)  log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) = E✏1,...,✏L⇠N (0,I)  r✓ log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) . uming the mapping h is represented as a deterministic feed-forward neural network, fo he gradient inside the expectation can be computed using standard backpropagation. In p approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the lo estimator k Under review as a conference paper at ICLR 2016 the hidden units at the previous layer and the model parameters. This can be alternatively expressed by first sampling an auxiliary variable ✏` ⇠ N(0, I), and then applying the deterministic mapping h` (✏` , h` 1 , ✓) = ⌃(h` 1 , ✓)1/2 ✏` + µ(h` 1 , ✓). (4) The joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in terms of a deterministic mapping h(✏, x, ✓), with ✏ = (✏1 , . . . , ✏L ), by applying Eqn. 4 for each layer in sequence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of the bound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation: r✓ log Eh⇠q(h|x,✓)  p(x, h|✓) q(h|x, ✓) = r✓E✏1,...,✏L⇠N (0,I)  log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) (5) = E✏1,...,✏L⇠N (0,I)  r✓ log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) . (6) Assuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed ✏, the gradient inside the expectation can be computed using standard backpropagation. In practice, one approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the Monte Carlo estimator 1 k kX i=1 r✓ log w (x, h(✏i, x, ✓), ✓) (7) with w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). This is an unbiased estimate of r✓L(x). We note that the VAE update and the basic REINFORCE-like update are both unbiased estimators of the same e hidden units at the previous layer and the model parameters. This can be alternatively expressed y first sampling an auxiliary variable ✏` ⇠ N(0, I), and then applying the deterministic mapping h` (✏` , h` 1 , ✓) = ⌃(h` 1 , ✓)1/2 ✏` + µ(h` 1 , ✓). (4) he joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in terms of deterministic mapping h(✏, x, ✓), with ✏ = (✏1 , . . . , ✏L ), by applying Eqn. 4 for each layer in quence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of the ound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation: r✓ log Eh⇠q(h|x,✓)  p(x, h|✓) q(h|x, ✓) = r✓E✏1,...,✏L⇠N(0,I)  log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) (5) = E✏1,...,✏L⇠N(0,I)  r✓ log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) . (6) ssuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed the gradient inside the expectation can be computed using standard backpropagation. In practice, ne approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the Monte arlo estimator 1 k kX i=1 r✓ log w (x, h(✏i, x, ✓), ✓) (7) ith w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). This is an unbiased estimate of r✓L(x). We note that e VAE update and the basic REINFORCE-like update are both unbiased estimators of the same adient, but the VAE update tends to have lower variance in practice because it makes use of the the hidden units at the previous layer and the by first sampling an auxiliary variable ✏` ⇠ h` (✏` , h` 1 , ✓) = ⌃ The joint recognition distribution q(h|x, ✓) a deterministic mapping h(✏, x, ✓), with ✏ sequence. Since the distribution of ✏ does n bound L(x) from Eqn. 3 by pushing the gra r✓ log Eh⇠q(h|x,✓)  p(x, h|✓) q(h|x, ✓) = = Assuming the mapping h is represented as a ✏, the gradient inside the expectation can be one approximates the expectation in Eqn. 6 Carlo estimator 1 k kX i=1 r✓ lo with w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). T the VAE update and the basic REINFORCE gradient, but the VAE update tends to have
  8. 8. VAE VAE ¤ VAE ¤ ¤ VAE KL In this section we introduce a practical estimator of the lower bound and its derivatives w.r.t. the parameters. We assume an approximate posterior in the form q (z|x), but please note that the technique can be applied to the case q (z), i.e. where we do not condition on x, as well. The fully variational Bayesian method for inferring a posterior over the parameters is given in the appendix. Under certain mild conditions outlined in section 2.4 for a chosen approximate posterior q (z|x) we can reparameterize the random variable ez ⇠ q (z|x) using a differentiable transformation g (✏, x) of an (auxiliary) noise variable ✏: ez = g (✏, x) with ✏ ⇠ p(✏) (4) See section 2.4 for general strategies for chosing such an approriate distribution p(✏) and function g (✏, x). We can now form Monte Carlo estimates of expectations of some function f(z) w.r.t. q (z|x) as follows: Eq (z|x(i)) [f(z)] = Ep(✏) h f(g (✏, x(i) )) i ' 1 L LX l=1 f(g (✏(l) , x(i) )) where ✏(l) ⇠ p(✏) (5) We apply this technique to the variational lower bound (eq. (2)), yielding our generic Stochastic Gradient Variational Bayes (SGVB) estimator eLA (✓, ; x(i) ) ' L(✓, ; x(i) ): eLA (✓, ; x(i) ) = 1 L LX l=1 log p✓(x(i) , z(i,l) ) log q (z(i,l) |x(i) ) where z(i,l) = g (✏(i,l) , x(i) ) and ✏(l) ⇠ p(✏) (6) 3 g r✓, eLM (✓, ; XM , ✏) (Gradients of minibatch estimator (8)) ✓, Update parameters using gradients g (e.g. SGD or Adagrad [DHS10]) until convergence of parameters (✓, ) return ✓, Often, the KL-divergence DKL(q (z|x(i) )||p✓(z)) of eq. (3) can be integrated analytically (see appendix B), such that only the expected reconstruction error Eq (z|x(i)) ⇥ log p✓(x(i) |z) ⇤ requires estimation by sampling. The KL-divergence term can then be interpreted as regularizing , encour- aging the approximate posterior to be close to the prior p✓(z). This yields a second version of the SGVB estimator eLB (✓, ; x(i) ) ' L(✓, ; x(i) ), corresponding to eq. (3), which typically has less variance than the generic estimator: eLB (✓, ; x(i) ) = DKL(q (z|x(i) )||p✓(z)) + 1 L LX l=1 (log p✓(x(i) |z(i,l) )) where z(i,l) = g (✏(i,l) , x(i) ) and ✏(l) ⇠ p(✏) (7) Given multiple datapoints from a dataset X with N datapoints, we can construct an estimator of the marginal likelihood lower bound of the full dataset, based on minibatches: L(✓, ; X) ' eLM (✓, ; XM ) = N M MX i=1 eL(✓, ; x(i) ) (8) where the minibatch XM = {x(i) }M i=1 is a randomly drawn sample of M datapoints from the full dataset X with N datapoints. In our experiments we found that the number of samples L per datapoint can be set to 1 as long as the minibatch size M was large enough, e.g. M = 100. Derivatives r✓, eL(✓; XM ) can be taken, and the resulting gradients can be used in conjunction with stochastic optimization methods such as SGD or Adagrad [DHS10]. See algorithm 1 for a basic approach to compute the stochastic gradients. A connection with auto-encoders becomes clear when looking at the objective function given at eq. (7). The first term is (the KL divergence of the approximate posterior from the prior) acts as a regularizer, while the second term is a an expected negative reconstruction error. The function g (.) is chosen such that it maps a datapoint x(i) and a random noise vector ✏(l) to a sample from the
  9. 9. Importance weighted AE IWAE ¤ VAE ¤ ¤ ¤ ¤ ¤ k=1 VAE ¤ k ution must be approximately factorial and predictable with a feed-forward neural n VAE criterion may be too strict; a recognition network which places only a small 0%) of its samples in the region of high posterior probability region may still be suffi ming accurate inference. If we lower our standards in this way, this may give us ad lity to train a generative network whose posterior distributions do not fit the VAE This is the motivation behind our proposed algorithm, the Importance Weighted Auto E). WAE uses the same architecture as the VAE, with both a generative network and a rec rk. The difference is that it is trained to maximize a different lower bound on log p ular, we use the following lower bound, corresponding to the k-sample importance w te of the log-likelihood: Lk(x) = Eh1,...,hk⇠q(h|x) " log 1 k kX i=1 p(x, hi) q(hi|x) # . h1, . . . , hk are sampled independently from the recognition model. The term inside ponds to the unnormalized importance weights for the joint distribution, which we wil = p(x, hi)/q(hi|x). s a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality at the average importance weights are an unbiased estimator of p(x): Lk = E " log 1 k kX wi #  log E " 1 k kX wi # = log p(x), iew as a conference paper at ICLR 2016 1. For all k, the lower bounds satisfy log p(x) Lk+1 Lk. if p(h, x)/q(h|x) is bounded, then Lk approaches log p(x) as k goes to infinity. e Appendix A.
  10. 10. Rényi α ¤ . # , ¤ 1 1 > 0, 1 ≠ 1 ¤ 1 → 1 KL ¤ 1 = 8 9 tributions p and q on a random variable ✓ 2 ⇥: D↵[p||q] = 1 ↵ 1 log Z p(✓)↵ q(✓)1 ↵ d✓. > 1 the definition is valid when it is finite, and for discrete random variables the integr d by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that role in machine learning and information theory: D1[p||q] = lim ↵!1 D↵[p||q] = Z p(✓) log p(✓) q(✓) d✓ = KL[p||q]. to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵: D0[p||q] = log Z p(✓)>0 q(✓)d✓, D+1[p||q] = log max ✓2⇥ p(✓) q(✓) . two distributions p and q on a random variable ✓ 2 ⇥: D↵[p||q] = 1 ↵ 1 log Z p(✓)↵ q(✓)1 ↵ d✓. For ↵ > 1 the definition is valid when it is finite, and for discrete random variables the integratio replaced by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that pla crucial role in machine learning and information theory: D1[p||q] = lim ↵!1 D↵[p||q] = Z p(✓) log p(✓) q(✓) d✓ = KL[p||q]. Similar to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵: D0[p||q] = log Z p(✓)>0 q(✓)d✓, D+1[p||q] = log max ✓2⇥ p(✓) q(✓) . Another special case is ↵ = 1 2 , where the corresponding R´enyi divergence is a function of the squ 2 R p p the definition is valid when it is finite, and for discrete random variables the int y summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence th in machine learning and information theory: D1[p||q] = lim ↵!1 D↵[p||q] = Z p(✓) log p(✓) q(✓) d✓ = KL[p||q]. ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵: D0[p||q] = log Z p(✓)>0 q(✓)d✓, D+1[p||q] = log max ✓2⇥ p(✓) q(✓) . pecial case is ↵ = 1 2 , where the corresponding R´enyi divergence is a function of istance Hel2 [p||q] = 1 2 R ( p p(✓) p q(✓))2 d✓: D1 2 [p||q] = 2 log(1 Hel2 [p||q]). ven and Harremo¨es, 2014] the definition (1) is also extended to negative ↵ values, t is non-positive and is thus no longer a valid divergence measure. The proposed m
  11. 11. Rényi ¤ # . / ,(.) KL ¤ Rényi ¤ Rényi α ¤ ¤ 1 ≠ 1 LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)] = Eq  log p(✓, D) q(✓) . ational R´enyi Bound Section 2.1 that the family of R´enyi divergences includes the KL divergence. ee-energy approaches be generalised to the R´enyi case? Consider approxima |D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0): q(✓) = arg min q2Q D↵[q(✓)||p(✓|D)]. y the alternative optimization problem q(✓) = arg max q2Q log p(D) D↵[q(✓)||p(✓|D)]. the objective can be rewritten as log p(D) 1 ↵ 1 log Z q(✓)↵ p(✓|D)1 ↵ d✓ "✓ ◆1 ↵ # LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)] = Eq  log p(✓, D) q(✓) . ariational R´enyi Bound m Section 2.1 that the family of R´enyi divergences includes the KL divergence. al free-energy approaches be generalised to the R´enyi case? Consider approxima p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0): q(✓) = arg min q2Q D↵[q(✓)||p(✓|D)]. erify the alternative optimization problem q(✓) = arg max q2Q log p(D) D↵[q(✓)||p(✓|D)]. = 1, the objective can be rewritten as log p(D) 1 ↵ 1 log Z q(✓)↵ p(✓|D)1 ↵ d✓ = log p(D) 1 log E "✓ p(✓, D) ◆1 ↵ # q(✓) Variational R´enyi Bound rom Section 2.1 that the family of R´enyi divergences includes the KL divergence. Perhap nal free-energy approaches be generalised to the R´enyi case? Consider approximating th r p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0): q(✓) = arg min q2Q D↵[q(✓)||p(✓|D)]. verify the alternative optimization problem q(✓) = arg max q2Q log p(D) D↵[q(✓)||p(✓|D)]. 6= 1, the objective can be rewritten as log p(D) 1 ↵ 1 log Z q(✓)↵ p(✓|D)1 ↵ d✓ = log p(D) 1 ↵ 1 log Eq "✓ p(✓, D) q(✓)p(D) ◆1 ↵ # = 1 1 ↵ log Eq "✓ p(✓, D) q(✓) ◆1 ↵ # := L↵(q; D). me this new objective the variational R´enyi bound (VR). Importantly the following theore Rényi VR
  12. 12. VR ¤ VR ¤ ¤ cope if Monte Carlo methods is not resorted to. This section develops a scalable opt or the VR bound by extending the recent advances of traditional VI. Black-box met ssed to enable it applications to arbitrary finite ↵ settings. Monte Carlo Estimation of the VR Bound se a simple Monte Carlo method that uses finite samples ✓k ⇠ q(✓), k = 1, ..., K to app K: ˆL↵,K(q; D) = 1 1 ↵ log 1 K KX k=1 "✓ p(✓k, D) q(✓k) ◆1 ↵ # . aditional VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) thm. However we can bound the bias by the following theorems proved in the supple m 2. E{✓k}K k=1 [ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for fix limiting result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-incre ] [ {|L↵| < +1}. 5 R Bound Optimisation Framework energy methods sidestep intractabilities in a class of intractable models. Recent wor proximations based on Monte Carlo to expend the set of models that can be handled. be deployed on the same model class as Monte Carlo variational methods, but which Monte Carlo methods is not resorted to. This section develops a scalable optimis VR bound by extending the recent advances of traditional VI. Black-box method o enable it applications to arbitrary finite ↵ settings. Carlo Estimation of the VR Bound mple Monte Carlo method that uses finite samples ✓k ⇠ q(✓), k = 1, ..., K to approxi ˆL↵,K(q; D) = 1 1 ↵ log 1 K KX k=1 "✓ p(✓k, D) q(✓k) ◆1 ↵ # . al VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) is i However we can bound the bias by the following theorems proved in the supplemen {✓k}K k=1 [ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for fixed ↵ ng result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-increasin ↵| < +1}. (a) Sampling approximated VR bounds. (b) Simulated values of divergences. Figure 2: (a) An illustration for the bounding properties of sampling approximations to the VR bounds. Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha divergence. In this example p, q are 2-D Gaussian distributions with identity covariance matrix, where the only di↵erence is µp = [0, 0] and µq = [1, 1]. Best viewed in colour. Corollary 1. For K < +1, there exists ↵K < 0 such that for all ↵ ↵K, E{✓k}K k=1 [ ˆL↵,K(q; D)]  log p(D). Furthermore ↵K is non-decreasing in K, with limK!1 ↵K = 1 and limK!+1 ↵K = 0. To better understand the above theorems we plot in Figure 2(a) an illustration of the bounding properties. By definition, the exact VR bound is a lower-bound or upper-bound of the log-likelihood log p(D) when ↵ > 0 or ↵ < 0, respectively (red lines). However for ↵  1 the sampling approximation ˆL↵,K in expectation under-estimates the exact VR bound L↵ (blue dashed lines), where the approximation quality can be improved by using more samples (the blue dashed arrow). Thus for finite samples, negative alpha values (↵2 < 0) can be used to improve the accuracy of the approximation (see the red arrow between the two blue dashed lines visualising ˆL↵1,K1 and ˆL↵2,K1 , respectively). We empirically evaluate the theoretical results in Figure 2(b), by computing the exact and Monte
  13. 13. VR exact approx. (a) Sampling approximated VR bounds. (b) Simula Figure 2: (a) An illustration for the bounding properties of sampling ap Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampl VR 1 ≤ 1 1 k
  14. 14. VR ¤ IWAE ¤ 1 ated VR bounds. (b) Simulated values of divergences. n for the bounding properties of sampling approximations to the VR bounds. 1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha e p, q are 2-D Gaussian distributions with identity covariance matrix, where [0, 0] and µq = [1, 1]. Best viewed in colour. ˆ 1 ≤ 1 1 = 0 IWAE
  15. 15. VR-max ¤ Reparameterization trick ¤ ¤ ¤ 1 = 1 VAE ¤ 1 → −∞ ¤ importance weight ¤ VR-max if ↵ = 1: jn = arg maxk log ˆw(✏k; xn) 4: return the gradients to the optimiser 1 |S| X n2S r log ˆw(✏jn ; xn) to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound L↵(q ; D) = 1 1 ↵ log E✏ "✓ p(g , D) q(g ) ◆1 ↵ # . (19) Then the gradient of the VR bound w.r.t. is r L↵(q ; D) = E✏  w↵(✏; , D)r log p(g , D) q(g ) , (20) where w↵(✏; , D) / ⇣ p(g ,D) q(g ) ⌘1 ↵ denotes the normalised importance weight. For finite samples ✏k ⇠ p(✏), k = 1, ..., K the gradient is approximated by r ˆL↵,K(q ; D) = 1 K KX k=1  ˆw↵,kr log p(g (✏k), D) q(g (✏k)) . (21) with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite samples. One can show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21): r LV I(q ; D) ⇡ 1 K KX k=1 r log p(g (✏k), D) q(g (✏k)) , (22) which means the resulting algorithm unifies the computation for all finite ↵ settings. if ↵ = 1: jn = arg maxk log ˆw(✏k; xn) 4: return the gradients to the optimiser 1 |S| X n2S r log ˆw(✏jn ; xn) to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound L↵(q ; D) = 1 1 ↵ log E✏ "✓ p(g , D) q(g ) ◆1 ↵ # . (19) Then the gradient of the VR bound w.r.t. is r L↵(q ; D) = E✏  w↵(✏; , D)r log p(g , D) q(g ) , (20) where w↵(✏; , D) / ⇣ p(g ,D) q(g ) ⌘1 ↵ denotes the normalised importance weight. For finite samples ✏k ⇠ p(✏), k = 1, ..., K the gradient is approximated by r ˆL↵,K(q ; D) = 1 K KX k=1  ˆw↵,kr log p(g (✏k), D) q(g (✏k)) . (21) with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite samples. One can show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21): r LV I(q ; D) ⇡ 1 K KX k=1 r log p(g (✏k), D) q(g (✏k)) , (22) which means the resulting algorithm unifies the computation for all finite ↵ settings. if ↵ = 1: jn = arg maxk log ˆw(✏k; xn) 4: return the gradients to the optimiser 1 |S| X n2S r log ˆw(✏jn ; xn) to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bou L↵(q ; D) = 1 1 ↵ log E✏ "✓ p(g , D) q(g ) ◆1 ↵ # . Then the gradient of the VR bound w.r.t. is r L↵(q ; D) = E✏  w↵(✏; , D)r log p(g , D) q(g ) , where w↵(✏; , D) / ⇣ p(g ,D) q(g ) ⌘1 ↵ denotes the normalised importance weight. For finite sam p(✏), k = 1, ..., K the gradient is approximated by r ˆL↵,K(q ; D) = 1 K KX k=1  ˆw↵,kr log p(g (✏k), D) q(g (✏k)) . with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite sample show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21): r LV I (q ; D) ⇡ 1 K KX k=1 r log p(g (✏k), D) q(g (✏k)) , which means the resulting algorithm unifies the computation for all finite ↵ settings. To speed-up learning [Burda et al., 2015] suggested back-propagating only one sample ✏j wit Algorithm 1 one gradient step for VR-↵/VR-max 1: sample ✏1, ..., ✏K ⇠ p(✏) 2: for all k = 1, ..., K, and n 2 S the current minibatch, compute the u log ˆw(✏k; xn) = log p(g (✏k), xn) log q(g ( 3: choose the sample ✏jn to back-propagate: if |↵| < 1: jn ⇠ pk where pk / ˆw(✏k; xn)1 ↵ if ↵ = 1: jn = arg maxk log ˆw(✏k; xn) 4: return the gradients to the optimiser 1 |S| X r log ˆw(✏jn ; xn)
  16. 16. ¤ [Li et al., 2015] EP ¤ ¤ ¤ M VR ¤ M ¤ Black-box alpha BB-α VR or VAEs. Note that VR-max does not compute ciple (MDL), since MDL approximates the true upper-bounds the exact log-likelihood function. scale Learning hole dataset D. However for large datasets full [Li et al., 2015] the authors discussed stochastic tion for large-scale learning. Here we propose batch training, which directly applies to the VR “average likelihood” ¯fD(✓) = [ QN n=1 fn(✓)] 1 N , ✓) ¯fD(✓)N . Now we sample M datapoints S = posterior by minimising the exact VR bound L 1 4.3 Stochastic Approximation for La So far we discussed the VR bounds computed on t batch learning will be very ine cient. In the append EP as a way to approximating the VR bound opt another stochastic approximation method to enable bound. Using the notation fn(✓) = p(xn|✓) and definin the joint distribution can be rewritten as p(✓, D) = the minimum description length principle (MDL), since MDL approximates the true sing the exact VR bound L 1 that upper-bounds the exact log-likelihood function. c Approximation for Large-scale Learning the VR bounds computed on the whole dataset D. However for large datasets full e very ine cient. In the appendix of [Li et al., 2015] the authors discussed stochastic proximating the VR bound optimisation for large-scale learning. Here we propose pproximation method to enable minibatch training, which directly applies to the VR on fn(✓) = p(xn|✓) and defining the “average likelihood” ¯fD(✓) = [ QN n=1 fn(✓)] 1 N , n can be rewritten as p(✓, D) = p0(✓) ¯fD(✓)N . Now we sample M datapoints S = 7set average likelihood” ¯fS(✓) = [ QM m=1 fnm (✓)] 1 M . xn}. Then we approximate the VR bound (13) by )↵ p0(✓) ¯fS(✓)N 1 ↵ d✓ 0(✓) ¯fS(✓)N q(✓) ◆1 ↵ ]. (23) wer-bound when ↵ ! 1. For other ↵ 6= 1 settings, the bias of approximation. This is guaranteed by {xn1 , ..., xnM } ⇠ D and define the corresponding “subset average likelihood” ¯fS(✓) = [ QM m=1 fnm (✓)] 1 M . When M = 1 we also write ¯fS(✓) = fn(✓) for S = {xn}. Then we approximate the VR bound (13) by replacing ¯fD(✓) with ¯fS(✓): ˜L↵(q; S) = 1 1 ↵ log Z q(✓)↵ p0(✓) ¯fS(✓)N 1 ↵ d✓ = 1 1 ↵ log Eq[ ✓ p0(✓) ¯fS(✓)N q(✓) ◆1 ↵ ]. (23) This returns a stochastic estimate of the evidence lower-bound when ↵ ! 1. For other ↵ 6= 1 settings, increasing the size of the minibatch M = |S| reduces the bias of approximation. This is guaranteed by the following theorem proved in the supplementary. Theorem 3. If the approximate distribution q(✓) is Gaussian N(µ, ⌃), and the likelihood functions has an exponential family form p(x|✓) = exp[h✓, (x)i A(✓)], then for ↵  1 the stochastic approximation is bounded by
  17. 17. VR
  18. 18. ¤ ¤ ¤
  19. 19. 1 ¤ 3 ¤ VAE 1 = 1 ¤ IWAE 1 = 0 ¤ VR-max 1 = −∞ ¤ 1 = 0 * = 5000 ¤ VR-max IWAE ¤ VR-max ¤ VR-max 25hr29min IWAE 61hr16min e code1 . Note that the original implementation back- hile VR-max only back-propagates the sample with h 101 Silhouettes and MNIST. The experiments were y small Frey Face dataset, while the other two were onsists of L = 1 or 2 stochastic layers with determin- rk architecture is detailed in the supplementary. We n. For MNIST we used settings from [Burda et al., and number of epochs. For other two datasets the the VI setting. We reproduced the experiments for s included in [Burda et al., 2015] mismatches those e 1 by computing log p(x) ⇡ ˆL↵,K(q; x) with ↵ = 0.0, sent some samples from the VR-max trained models d almost indistinguishable to IWAEs on all the three ime to run compared to IWAE with a full backward a Tesla C2075 GPU, and when trained on MNIST R-max and IWAE took 25hr29min and 61hr16min, also implemented the single backward pass version od result for IWAE is -85.02, which is slightly worse he arguments in Section 4.1 that negative ↵ can be mputation resources are limited. alue corresponding to the tightest VR bound becomes q and the true posterior increases. This is the case n q is fitted to approximate the typically multimodal (a) Frey Face (b) Caltech 101 Silhouettes (c) MNIST Figure 3: Sampled images from the VR-max trained auto-encoders. Dataset L K VAE IWAE VR-max Frey Face 1 5 1322.96 1380.30 1377.40 (± std. err.) ±10.03 ±4.60 ±4.59 Caltech 101 1 5 -119.69 -117.89 -118.01 Silhouettes 50 -119.61 -117.21 -117.10 MNIST 1 5 -86.47 -85.41 -85.42 50 -86.35 -84.80 -84.81 2 5 -85.01 -83.92 -84.04 50 -84.78 -83.12 -83.44 Table 1: Average Test log-likelihood. Results for VAE on MNIST are collected from [Burda et al., 2015]. IWAE results are reproduced using the publicly available implementation. method was implemented upon the publicly available code1 . Note that the original implementation back- propagates all the samples to compute gradients, while VR-max only back-propagates the sample with the largest importance weight. Three datasets are considered: Frey Face, Caltech 101 Silhouettes and MNIST. The experiments were
  20. 20. 1 ¤ (a) Frey Face (b) Caltech 101 Silhouettes (c) MNIST Figure 3: Sampled images from the VR-max trained auto-encoders. Dataset L K VAE IWAE VR-max Frey Face 1 5 1322.96 1380.30 1377.40 (± std. err.) ±10.03 ±4.60 ±4.59 Caltech 101 1 5 -119.69 -117.89 -118.01 Silhouettes 50 -119.61 -117.21 -117.10 MNIST 1 5 -86.47 -85.41 -85.42 50 -86.35 -84.80 -84.81 2 5 -85.01 -83.92 -84.04 50 -84.78 -83.12 -83.44
  21. 21. ¤ VR-max ¤ Frey Face ¤ 1
  22. 22. 2 ¤ UCI ¤ ¤ VI[Graves,2011] PBP[Hernandez-Lobato et al., 2015] ¤ BB-α=BO Dataset VI PBP BB-↵=BO* VR-0.5 VR-0.0 VR-max Boston -2.903±0.071 -2.574±0.089 -2.549±0.019 -2.457±0.066 -2.468±0.071 -2.469±0.072 Concrete -3.391±0.017 -3.161±0.019 -3.104±0.015 -3.094±0.016 -3.076±0.018 -3.092±0.018 Energy -2.391±0.029 -2.042±0.019 -0.945±0.012 -1.401±0.029 -1.418±0.020 -1.389±0.018 Wine -0.980±0.013 -0.968±0.014 -0.949±0.009 -0.948±0.011 -0.952±0.012 -0.949±0.012 Yacht -3.439±0.163 -1.634±0.016 -1.102±0.039 -1.816±0.011 -1.829±0.014 -1.817±0.013 Protein -2.992±0.006 -2.973±0.003 NA±NA -2.923±0.006 -2.911±0.005 -2.938±0.005 Year -3.622±NA -3.603±NA NA±NA -3.545±NA -3.550±NA -3.542±NA Table 2: Average test log-likelihood. BB-↵=BO results are not directly comparable and are available only for small datasets. Dataset VI PBP BB-↵=BO* VR-0.5 VR-0.0 VR-max Boston 4.320±0.291 3.104±0.180 3.160±0.109 2.853±0.154 2.852±0.169 2.837±0.181 Concrete 7.128±0.123 5.667±0.093 5.374±0.074 5.343±0.102 5.237±0.114 5.280±0.104 Energy 2.646±0.081 1.804±0.048 0.600±0.018 0.807±0.059 0.883±0.050 0.791±0.041 Wine 0.646±0.008 0.635±0.007 0.632±0.005 0.640±0.009 0.638±0.008 0.639±0.009 Yacht 6.887±0.674 1.015±0.054 0.902±0.051 1.111±0.082 1.239±0.109 1.117±0.085 Protein 4.842±0.003 4.732±0.013 NA±NA 4.505±0.033 4.436±0.030 4.574±0.023 Year 9.034±NA 8.879±NA NA±NA 8.942±NA 9.133±NA 8.949±NA Table 3: Average Test Error. BB-↵=BO results are not directly comparable and are available only for small datasets.
  23. 23. ¤ Rényi ¤ VI/VB EP BB-α VAE IWAE VR-max ¤ ¤ ¤ ¤ ¤ 1

×