Successfully reported this slideshow.
Upcoming SlideShare
×

# (DL hacks輪読) Variational Inference with Rényi Divergence

• Full Name
Comment goes here.

Are you sure you want to Yes No

### (DL hacks輪読) Variational Inference with Rényi Divergence

1. 1. Variational Inference with Rényi Divergence D1
2. 2. ¤ arXiv 2 6 ¤ Yingzhen Li Richard E. Turner ¤ University of Cambridge ¤ Li D3 “Stochastic Expectation Propagation” NIPS ¤ Rényi ¤ VAE importance weighted AE[Burda et al., 2015] ¤ Appendix ¤
3. 3. ¤ PRML ¤ ¤ ¤ ¤ KL ln #(%) = ℒ % + *+(,||%)
4. 4. ¤ ¤ SVI [Hoffmann et al, 2013] ¤ SEP [Li et al., 2015] ¤ black-box ¤ [Ranganath et al., 2014] ¤ black-box alpha BB-α [Hernandez-Labato et al., 2015] ¤ ¤ Importance weighted AE (IWAE)[Burda et al., 2015] ¤ VAE ICLR2016
5. 5. ¤ ¤ #(.|/) ¤ ,(.) ¤ KL # / ¤ principle literature [Gr¨unwald, 2007]. 2.2 Variational Inference Next we review the variational inference algorithm [Jordan et al. perspective, using posterior approximation as a running examp i.i.d. samples D = {xn}N n=1 from a probabilistic model p(x|✓) pa is drawn from a prior p0(✓). Bayesian inference involves comp parameters given the data, p(✓|D) = p(✓, D) p(D) = p0(✓) QN n=1 p p(D) 3 ciple literature [Gr¨unwald, 2007]. Variational Inference we review the variational inference algorithm [Jordan et al., 1999, Beal, 2003] from an optimisat pective, using posterior approximation as a running example. Consider observing a dataset of samples D = {xn}N n=1 from a probabilistic model p(x|✓) parametrised by a random variable ✓ th awn from a prior p0(✓). Bayesian inference involves computing the posterior distribution of t meters given the data, p(✓|D) = p(✓, D) p(D) = p0(✓) QN n=1 p(xn|✓) p(D) , 3 (D) = R p0(✓) QN n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For l models, including Bayesian neural networks, the true posterior is typically intractable. nference introduces an approximation q(✓) to the true posterior, which is obtained by minim divergence in some tractable distribution family Q: q(✓) = arg min q2Q KL[q(✓)||p(✓|D)]. r the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Varia e sidesteps this di culty by considering an equivalent optimisation problem: q(✓) = arg max q2Q LV I (q; D), he variational lower-bound or evidence lower-bound (ELBO) LV I (q; D) is deﬁned by LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)]  re p(D) = R p0(✓) QN n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For m werful models, including Bayesian neural networks, the true posterior is typically intractable. Va al inference introduces an approximation q(✓) to the true posterior, which is obtained by minimi KL divergence in some tractable distribution family Q: q(✓) = arg min q2Q KL[q(✓)||p(✓|D)]. wever the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Variati rence sidesteps this di culty by considering an equivalent optimisation problem: q(✓) = arg max q2Q LV I(q; D), re the variational lower-bound or evidence lower-bound (ELBO) LV I(q; D) is deﬁned by LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]  p(✓, D) n called marginal likelihood or model evidence. For many etworks, the true posterior is typically intractable. Varia- (✓) to the true posterior, which is obtained by minimising on family Q: min 2Q KL[q(✓)||p(✓|D)]. (8) able, mainly because of the di cult term p(D). Variational g an equivalent optimisation problem: arg max q2Q LV I(q; D), (9) lower-bound (ELBO) LV I(q; D) is deﬁned by og p(D) KL[q(✓)||p(✓|D)]  p(✓, D) (10) where p(D) = R p0(✓) QN n=1 p(xn|✓)d✓ is often called marginal likelihood powerful models, including Bayesian neural networks, the true posterior is tional inference introduces an approximation q(✓) to the true posterior, wh the KL divergence in some tractable distribution family Q: q(✓) = arg min q2Q KL[q(✓)||p(✓|D)]. However the KL divergence in (8) is also intractable, mainly because of the d inference sidesteps this di culty by considering an equivalent optimisation q(✓) = arg max q2Q LV I(q; D), where the variational lower-bound or evidence lower-bound (ELBO) LV I(q LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)] = Eq  log p(✓, D) q(✓) .
6. 6. VAE ¤ [Kingma et al,. 2014] ¤ ¤ ℎ ¤ ¤ 1 Variational Auto-encoder with R´enyi Divergence he variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a re oposed (deep) generative model that parametrizes the variational approximation with a recog twork. The generative model is speciﬁed as a hierarchical latent variable model: p(x) = X h(1)...h(L) p(h(L) )p(h(L 1) |h(L) ) · · · p(x|h(1) ). re we drop the parameters ✓ but keep in mind that they will be learned using approximate max elihood. However for these models the exact computation of log p(x) requires marginalisation dden variables and is thus often intractable. Variational expectation-maximisation (EM) me mes to the rescue by approximating log p(x) ⇡ LV I(q; x) = Eq(h|x)  log p(x, h) q(h|x) , here h collects all the hidden variables h(1) , ..., h(L) and the approximate posterior q(h|x) is deﬁ q(h|x) = q(h(1) |x)q(h(2) |h(1) ) · · · q(h(L) |h(L 1) ). variational EM, optimisation for q and p are alternated to guarantee convergence. However th ea of VAE is to jointly optimising p and q, which instead has no guarantee of increasing the ional Auto-encoder with R´enyi Divergence auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a recently ) generative model that parametrizes the variational approximation with a recognition enerative model is speciﬁed as a hierarchical latent variable model: p(x) = X h(1)...h(L) p(h(L) )p(h(L 1) |h(L) ) · · · p(x|h(1) ). (14) he parameters ✓ but keep in mind that they will be learned using approximate maximum wever for these models the exact computation of log p(x) requires marginalisation of all s and is thus often intractable. Variational expectation-maximisation (EM) methods scue by approximating log p(x) ⇡ LV I(q; x) = Eq(h|x)  log p(x, h) q(h|x) , (15) s all the hidden variables h(1) , ..., h(L) and the approximate posterior q(h|x) is deﬁned as q(h|x) = q(h(1) |x)q(h(2) |h(1) ) · · · q(h(L) |h(L 1) ). (16) EM, optimisation for q and p are alternated to guarantee convergence. However the core to jointly optimising p and q, which instead has no guarantee of increasing the MLE on in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2011]. This the possibility that alternative surrogate functions might return estimates that are tighter So the VR bound is considered in this context: L (q; x) = 1 log E "✓ p(x, h) ◆1 ↵ # . (17) variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is sed (deep) generative model that parametrizes the variational approximation with a r rk. The generative model is speciﬁed as a hierarchical latent variable model: p(x) = X h(1)...h(L) p(h(L) )p(h(L 1) |h(L) ) · · · p(x|h(1) ). we drop the parameters ✓ but keep in mind that they will be learned using approximate ood. However for these models the exact computation of log p(x) requires marginalisa n variables and is thus often intractable. Variational expectation-maximisation (EM to the rescue by approximating log p(x) ⇡ LV I (q; x) = Eq(h|x)  log p(x, h) q(h|x) , h collects all the hidden variables h(1) , ..., h(L) and the approximate posterior q(h|x) is q(h|x) = q(h(1) |x)q(h(2) |h(1) ) · · · q(h(L) |h(L 1) ). iational EM, optimisation for q and p are alternated to guarantee convergence. Howeve of VAE is to jointly optimising p and q, which instead has no guarantee of increasing ive function in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2 explores the possibility that alternative surrogate functions might return estimates that bounds. So the VR bound is considered in this context: "✓ ◆1 ↵ # p(x|✓) = h1,...,hL p(hL |✓)p(hL 1 |hL , ✓) · · · p(x|h1 , ✓). ( Here, ✓ is a vector of parameters of the variational autoencoder, and h = {h1 , . . . , hL } denotes t stochastic hidden units, or latent variables. The dependence on ✓ is often suppressed for clarity. F convenience, we deﬁne h0 = x. Each of the terms p(h` |h`+1 ) may denote a complicated nonline relationship, for instance one computed by a multilayer neural network. However, it is assum that sampling and probability evaluation are tractable for each p(h` |h`+1 ). Note that L denot the number of stochastic hidden layers; the deterministic layers are not shown explicitly here. W assume the recognition model q(h|x) is deﬁned in terms of an analogous factorization: q(h|x) = q(h1 |x)q(h2 |h1 ) · · · q(hL |hL 1 ), ( where sampling and probability evaluation are tractable for each of the terms in the product. In this work, we assume the same families of conditional probability distributions as Kingma Welling (2014). In particular, the prior p(hL ) is ﬁxed to be a zero-mean, unit-variance Gaussia In general, each of the conditional distributions p(h` | h`+1 ) and q(h` |h` 1 ) is a Gaussian wi diagonal covariance, where the mean and covariance parameters are computed by a determinis feed-forward neural network. For real-valued observations, p(x|h1 ) is also deﬁned to be such Gaussian; for binary observations, it is deﬁned to be a Bernoulli distribution whose mean paramete are computed by a neural network. The VAE is trained to maximize a variational lower bound on the log-likelihood, as derived fro Jensen’s Inequality: log p(x) = log Eq(h|x)  p(x, h) q(h|x) Eq(h|x)  log p(x, h) q(h|x) = L(x). ( Since L(x) = log p(x) DKL(q(h|x)||p(h|x)), the training procedure is forced to trade off t
8. 8. VAE VAE ¤ VAE ¤ ¤ VAE KL In this section we introduce a practical estimator of the lower bound and its derivatives w.r.t. the parameters. We assume an approximate posterior in the form q (z|x), but please note that the technique can be applied to the case q (z), i.e. where we do not condition on x, as well. The fully variational Bayesian method for inferring a posterior over the parameters is given in the appendix. Under certain mild conditions outlined in section 2.4 for a chosen approximate posterior q (z|x) we can reparameterize the random variable ez ⇠ q (z|x) using a differentiable transformation g (✏, x) of an (auxiliary) noise variable ✏: ez = g (✏, x) with ✏ ⇠ p(✏) (4) See section 2.4 for general strategies for chosing such an approriate distribution p(✏) and function g (✏, x). We can now form Monte Carlo estimates of expectations of some function f(z) w.r.t. q (z|x) as follows: Eq (z|x(i)) [f(z)] = Ep(✏) h f(g (✏, x(i) )) i ' 1 L LX l=1 f(g (✏(l) , x(i) )) where ✏(l) ⇠ p(✏) (5) We apply this technique to the variational lower bound (eq. (2)), yielding our generic Stochastic Gradient Variational Bayes (SGVB) estimator eLA (✓, ; x(i) ) ' L(✓, ; x(i) ): eLA (✓, ; x(i) ) = 1 L LX l=1 log p✓(x(i) , z(i,l) ) log q (z(i,l) |x(i) ) where z(i,l) = g (✏(i,l) , x(i) ) and ✏(l) ⇠ p(✏) (6) 3 g r✓, eLM (✓, ; XM , ✏) (Gradients of minibatch estimator (8)) ✓, Update parameters using gradients g (e.g. SGD or Adagrad [DHS10]) until convergence of parameters (✓, ) return ✓, Often, the KL-divergence DKL(q (z|x(i) )||p✓(z)) of eq. (3) can be integrated analytically (see appendix B), such that only the expected reconstruction error Eq (z|x(i)) ⇥ log p✓(x(i) |z) ⇤ requires estimation by sampling. The KL-divergence term can then be interpreted as regularizing , encour- aging the approximate posterior to be close to the prior p✓(z). This yields a second version of the SGVB estimator eLB (✓, ; x(i) ) ' L(✓, ; x(i) ), corresponding to eq. (3), which typically has less variance than the generic estimator: eLB (✓, ; x(i) ) = DKL(q (z|x(i) )||p✓(z)) + 1 L LX l=1 (log p✓(x(i) |z(i,l) )) where z(i,l) = g (✏(i,l) , x(i) ) and ✏(l) ⇠ p(✏) (7) Given multiple datapoints from a dataset X with N datapoints, we can construct an estimator of the marginal likelihood lower bound of the full dataset, based on minibatches: L(✓, ; X) ' eLM (✓, ; XM ) = N M MX i=1 eL(✓, ; x(i) ) (8) where the minibatch XM = {x(i) }M i=1 is a randomly drawn sample of M datapoints from the full dataset X with N datapoints. In our experiments we found that the number of samples L per datapoint can be set to 1 as long as the minibatch size M was large enough, e.g. M = 100. Derivatives r✓, eL(✓; XM ) can be taken, and the resulting gradients can be used in conjunction with stochastic optimization methods such as SGD or Adagrad [DHS10]. See algorithm 1 for a basic approach to compute the stochastic gradients. A connection with auto-encoders becomes clear when looking at the objective function given at eq. (7). The ﬁrst term is (the KL divergence of the approximate posterior from the prior) acts as a regularizer, while the second term is a an expected negative reconstruction error. The function g (.) is chosen such that it maps a datapoint x(i) and a random noise vector ✏(l) to a sample from the
9. 9. Importance weighted AE IWAE ¤ VAE ¤ ¤ ¤ ¤ ¤ k=1 VAE ¤ k ution must be approximately factorial and predictable with a feed-forward neural n VAE criterion may be too strict; a recognition network which places only a small 0%) of its samples in the region of high posterior probability region may still be sufﬁ ming accurate inference. If we lower our standards in this way, this may give us ad lity to train a generative network whose posterior distributions do not ﬁt the VAE This is the motivation behind our proposed algorithm, the Importance Weighted Auto E). WAE uses the same architecture as the VAE, with both a generative network and a rec rk. The difference is that it is trained to maximize a different lower bound on log p ular, we use the following lower bound, corresponding to the k-sample importance w te of the log-likelihood: Lk(x) = Eh1,...,hk⇠q(h|x) " log 1 k kX i=1 p(x, hi) q(hi|x) # . h1, . . . , hk are sampled independently from the recognition model. The term inside ponds to the unnormalized importance weights for the joint distribution, which we wil = p(x, hi)/q(hi|x). s a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality at the average importance weights are an unbiased estimator of p(x): Lk = E " log 1 k kX wi #  log E " 1 k kX wi # = log p(x), iew as a conference paper at ICLR 2016 1. For all k, the lower bounds satisfy log p(x) Lk+1 Lk. if p(h, x)/q(h|x) is bounded, then Lk approaches log p(x) as k goes to inﬁnity. e Appendix A.
10. 10. Rényi α ¤ . # , ¤ 1 1 > 0, 1 ≠ 1 ¤ 1 → 1 KL ¤ 1 = 8 9 tributions p and q on a random variable ✓ 2 ⇥: D↵[p||q] = 1 ↵ 1 log Z p(✓)↵ q(✓)1 ↵ d✓. > 1 the deﬁnition is valid when it is ﬁnite, and for discrete random variables the integr d by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that role in machine learning and information theory: D1[p||q] = lim ↵!1 D↵[p||q] = Z p(✓) log p(✓) q(✓) d✓ = KL[p||q]. to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is deﬁned by continuity in ↵: D0[p||q] = log Z p(✓)>0 q(✓)d✓, D+1[p||q] = log max ✓2⇥ p(✓) q(✓) . two distributions p and q on a random variable ✓ 2 ⇥: D↵[p||q] = 1 ↵ 1 log Z p(✓)↵ q(✓)1 ↵ d✓. For ↵ > 1 the deﬁnition is valid when it is ﬁnite, and for discrete random variables the integratio replaced by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that pla crucial role in machine learning and information theory: D1[p||q] = lim ↵!1 D↵[p||q] = Z p(✓) log p(✓) q(✓) d✓ = KL[p||q]. Similar to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is deﬁned by continuity in ↵: D0[p||q] = log Z p(✓)>0 q(✓)d✓, D+1[p||q] = log max ✓2⇥ p(✓) q(✓) . Another special case is ↵ = 1 2 , where the corresponding R´enyi divergence is a function of the squ 2 R p p the deﬁnition is valid when it is ﬁnite, and for discrete random variables the int y summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence th in machine learning and information theory: D1[p||q] = lim ↵!1 D↵[p||q] = Z p(✓) log p(✓) q(✓) d✓ = KL[p||q]. ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is deﬁned by continuity in ↵: D0[p||q] = log Z p(✓)>0 q(✓)d✓, D+1[p||q] = log max ✓2⇥ p(✓) q(✓) . pecial case is ↵ = 1 2 , where the corresponding R´enyi divergence is a function of istance Hel2 [p||q] = 1 2 R ( p p(✓) p q(✓))2 d✓: D1 2 [p||q] = 2 log(1 Hel2 [p||q]). ven and Harremo¨es, 2014] the deﬁnition (1) is also extended to negative ↵ values, t is non-positive and is thus no longer a valid divergence measure. The proposed m
11. 11. Rényi ¤ # . / ,(.) KL ¤ Rényi ¤ Rényi α ¤ ¤ 1 ≠ 1 LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)] = Eq  log p(✓, D) q(✓) . ational R´enyi Bound Section 2.1 that the family of R´enyi divergences includes the KL divergence. ee-energy approaches be generalised to the R´enyi case? Consider approxima |D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0): q(✓) = arg min q2Q D↵[q(✓)||p(✓|D)]. y the alternative optimization problem q(✓) = arg max q2Q log p(D) D↵[q(✓)||p(✓|D)]. the objective can be rewritten as log p(D) 1 ↵ 1 log Z q(✓)↵ p(✓|D)1 ↵ d✓ "✓ ◆1 ↵ # LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)] = Eq  log p(✓, D) q(✓) . ariational R´enyi Bound m Section 2.1 that the family of R´enyi divergences includes the KL divergence. al free-energy approaches be generalised to the R´enyi case? Consider approxima p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0): q(✓) = arg min q2Q D↵[q(✓)||p(✓|D)]. erify the alternative optimization problem q(✓) = arg max q2Q log p(D) D↵[q(✓)||p(✓|D)]. = 1, the objective can be rewritten as log p(D) 1 ↵ 1 log Z q(✓)↵ p(✓|D)1 ↵ d✓ = log p(D) 1 log E "✓ p(✓, D) ◆1 ↵ # q(✓) Variational R´enyi Bound rom Section 2.1 that the family of R´enyi divergences includes the KL divergence. Perhap nal free-energy approaches be generalised to the R´enyi case? Consider approximating th r p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0): q(✓) = arg min q2Q D↵[q(✓)||p(✓|D)]. verify the alternative optimization problem q(✓) = arg max q2Q log p(D) D↵[q(✓)||p(✓|D)]. 6= 1, the objective can be rewritten as log p(D) 1 ↵ 1 log Z q(✓)↵ p(✓|D)1 ↵ d✓ = log p(D) 1 ↵ 1 log Eq "✓ p(✓, D) q(✓)p(D) ◆1 ↵ # = 1 1 ↵ log Eq "✓ p(✓, D) q(✓) ◆1 ↵ # := L↵(q; D). me this new objective the variational R´enyi bound (VR). Importantly the following theore Rényi VR
12. 12. VR ¤ VR ¤ ¤ cope if Monte Carlo methods is not resorted to. This section develops a scalable opt or the VR bound by extending the recent advances of traditional VI. Black-box met ssed to enable it applications to arbitrary ﬁnite ↵ settings. Monte Carlo Estimation of the VR Bound se a simple Monte Carlo method that uses ﬁnite samples ✓k ⇠ q(✓), k = 1, ..., K to app K: ˆL↵,K(q; D) = 1 1 ↵ log 1 K KX k=1 "✓ p(✓k, D) q(✓k) ◆1 ↵ # . aditional VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) thm. However we can bound the bias by the following theorems proved in the supple m 2. E{✓k}K k=1 [ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for ﬁx limiting result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-incre ] [ {|L↵| < +1}. 5 R Bound Optimisation Framework energy methods sidestep intractabilities in a class of intractable models. Recent wor proximations based on Monte Carlo to expend the set of models that can be handled. be deployed on the same model class as Monte Carlo variational methods, but which Monte Carlo methods is not resorted to. This section develops a scalable optimis VR bound by extending the recent advances of traditional VI. Black-box method o enable it applications to arbitrary ﬁnite ↵ settings. Carlo Estimation of the VR Bound mple Monte Carlo method that uses ﬁnite samples ✓k ⇠ q(✓), k = 1, ..., K to approxi ˆL↵,K(q; D) = 1 1 ↵ log 1 K KX k=1 "✓ p(✓k, D) q(✓k) ◆1 ↵ # . al VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) is i However we can bound the bias by the following theorems proved in the supplemen {✓k}K k=1 [ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for ﬁxed ↵ ng result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-increasin ↵| < +1}. (a) Sampling approximated VR bounds. (b) Simulated values of divergences. Figure 2: (a) An illustration for the bounding properties of sampling approximations to the VR bounds. Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha divergence. In this example p, q are 2-D Gaussian distributions with identity covariance matrix, where the only di↵erence is µp = [0, 0] and µq = [1, 1]. Best viewed in colour. Corollary 1. For K < +1, there exists ↵K < 0 such that for all ↵ ↵K, E{✓k}K k=1 [ ˆL↵,K(q; D)]  log p(D). Furthermore ↵K is non-decreasing in K, with limK!1 ↵K = 1 and limK!+1 ↵K = 0. To better understand the above theorems we plot in Figure 2(a) an illustration of the bounding properties. By deﬁnition, the exact VR bound is a lower-bound or upper-bound of the log-likelihood log p(D) when ↵ > 0 or ↵ < 0, respectively (red lines). However for ↵  1 the sampling approximation ˆL↵,K in expectation under-estimates the exact VR bound L↵ (blue dashed lines), where the approximation quality can be improved by using more samples (the blue dashed arrow). Thus for ﬁnite samples, negative alpha values (↵2 < 0) can be used to improve the accuracy of the approximation (see the red arrow between the two blue dashed lines visualising ˆL↵1,K1 and ˆL↵2,K1 , respectively). We empirically evaluate the theoretical results in Figure 2(b), by computing the exact and Monte
13. 13. VR exact approx. (a) Sampling approximated VR bounds. (b) Simula Figure 2: (a) An illustration for the bounding properties of sampling ap Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampl VR 1 ≤ 1 1 k
14. 14. VR ¤ IWAE ¤ 1 ated VR bounds. (b) Simulated values of divergences. n for the bounding properties of sampling approximations to the VR bounds. 1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha e p, q are 2-D Gaussian distributions with identity covariance matrix, where [0, 0] and µq = [1, 1]. Best viewed in colour. ˆ 1 ≤ 1 1 = 0 IWAE