Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks

輪読日:2016/05/13

  • Login to see the comments

(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks

  1. 1. How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks D2
  2. 2. ¤ ICML 2016 ¤ Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, Ole Winther (Technical University of Denmark) ¤ Maaløe VAE ¤ VAE+GAN ¤ state-of-the-art ¤ VAE ¤ Ladder network VAE ¤ warm-up batch normalization ¤ VAE ¤
  3. 3. ¤ VAE ¤ VAE ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤
  4. 4. ¤ VAE ¤ VAE ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤
  5. 5. Variaitonal Autoencoder ¤ Variational Autoencoder [Kingma+ 13][Rezende+ 14] ¤ ¤ !(#|%) ¤ '(#|%) ¤ ( ) * x ⇠ p✓(x|z) z ⇠ p✓(z) q (z|x) +
  6. 6. ¤ ', # 2 ¤ - ¤ !. Processes (Tran e & Mohamed, rs (Burda et al., nd warm-up al- suggesting that o show that the good or better model for fur- ned latent rep- oposed here are ations utilizing tive assessment s that the multi- in the datasets ised learning. inspired by the better than the abilistic ladder network allows direct integration (+ in figure, see Eq. (21) ) of bottom-up and top-down information in the infer- ence model. In the VAE the top-down information is incorporated indirectly through the conditional priors in the generative model. The generative model p✓ is specified as follows: p✓(x|z1) = N x|µ✓(z1), 2 ✓(z1) or (1) P✓(x|z1) = B (x|µ✓(z1)) (2) for continuous-valued (Gaussian N) or binary-valued (Bernoulli B) data, respectively. The latent variables z are split into L layers zi, i = 1 . . . L: p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2 ✓,i(zi+1) (3) p✓(zL) = N (zL|0, I) . (4) The hierarchical specification allows the lower layers of the latent variables to be highly correlated but still maintain the computational efficiency of fully factorized models. Each layer in the inference model q (z|x) is specified using a fully factorized Gaussian distribution: pared to equally or more ng flexible variational dis- Gaussian Processes (Tran s (Rezende & Mohamed, Autoencoders (Burda et al., alization and warm-up al- rformance, suggesting that ul. We also show that the erforms as good or better interesting model for fur- dy the learned latent rep- methods proposed here are nt representations utilizing . A qualitative assessment her indicates that the multi- el structure in the datasets emi-supervised learning. e: the VAE inspired by the g as well or better than the n training increasing both e across several different of active stochastic latent Figure 2. Flow of information in the inference and generative models of a) probabilistic ladder network and b) VAE. The prob- abilistic ladder network allows direct integration (+ in figure, see Eq. (21) ) of bottom-up and top-down information in the infer- ence model. In the VAE the top-down information is incorporated indirectly through the conditional priors in the generative model. The generative model p✓ is specified as follows: p✓(x|z1) = N x|µ✓(z1), 2 ✓(z1) or (1) P✓(x|z1) = B (x|µ✓(z1)) (2) for continuous-valued (Gaussian N) or binary-valued (Bernoulli B) data, respectively. The latent variables z are split into L layers zi, i = 1 . . . L: p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2 ✓,i(zi+1) (3) p✓(zL) = N (zL|0, I) . (4) The hierarchical specification allows the lower layers of the latent variables to be highly correlated but still maintain the computational efficiency of fully factorized models. Each layer in the inference model q (z|x) is specified using a fully factorized Gaussian distribution: q (z1|x) = N z1|µ ,1(x), 2 ,1(x) (5) q (zi|zi 1) = N zi|µ ,i(zi 1), 2 ,i(zi 1) (6) ve generative performance, measured in terms of likelihood, when compared to equally or more ted methods for creating flexible variational dis- s such as the Variational Gaussian Processes (Tran 15) Normalizing Flows (Rezende & Mohamed, Importance Weighted Autoencoders (Burda et al., We find that batch normalization and warm-up al- rease the generative performance, suggesting that thods are broadly useful. We also show that the stic ladder network performs as good or better ng VAEs making it an interesting model for fur- ies. Secondly, we study the learned latent rep- ons. We find that the methods proposed here are y for learning rich latent representations utilizing ayers of latent variables. A qualitative assessment ent representations further indicates that the multi- DGMs capture high level structure in the datasets likely to be useful for semi-supervised learning. ary our contributions are: ew parametrization of the VAE inspired by the der network performing as well or better than the ent best models. ovel warm-up period in training increasing both Figure 2. Flow of information in the inference and ge models of a) probabilistic ladder network and b) VAE. Th abilistic ladder network allows direct integration (+ in fig Eq. (21) ) of bottom-up and top-down information in th ence model. In the VAE the top-down information is incor indirectly through the conditional priors in the generative The generative model p✓ is specified as follows: p✓(x|z1) = N x|µ✓(z1), 2 ✓ (z1) or P✓(x|z1) = B (x|µ✓(z1)) for continuous-valued (Gaussian N ) or binary- (Bernoulli B) data, respectively. The latent variable split into L layers zi, i = 1 . . . L: p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2 ✓,i(zi+1) p✓(zL) = N (zL|0, I) . The hierarchical specification allows the lower layer latent variables to be highly correlated but still main computational efficiency of fully factorized models. Each layer in the inference model q (z|x) is specifie a fully factorized Gaussian distribution: q (z1|x) = N z1|µ ,1(x), 2 ,1(x) Autoencoders (Burda et al., alization and warm-up al- formance, suggesting that ul. We also show that the rforms as good or better interesting model for fur- dy the learned latent rep- methods proposed here are t representations utilizing A qualitative assessment er indicates that the multi- el structure in the datasets emi-supervised learning. e: the VAE inspired by the as well or better than the n training increasing both e across several different of active stochastic latent ence model. In the VAE the top-down information is incorporated indirectly through the conditional priors in the generative model. The generative model p✓ is specified as follows: p✓(x|z1) = N x|µ✓(z1), 2 ✓(z1) or (1) P✓(x|z1) = B (x|µ✓(z1)) (2) for continuous-valued (Gaussian N) or binary-valued (Bernoulli B) data, respectively. The latent variables z are split into L layers zi, i = 1 . . . L: p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2 ✓,i(zi+1) (3) p✓(zL) = N (zL|0, I) . (4) The hierarchical specification allows the lower layers of the latent variables to be highly correlated but still maintain the computational efficiency of fully factorized models. Each layer in the inference model q (z|x) is specified using a fully factorized Gaussian distribution: q (z1|x) = N z1|µ ,1(x), 2 ,1(x) (5) q (zi|zi 1) = N zi|µ ,i(zi 1), 2 ,i(zi 1) (6) for i = 2 . . . L. Functions µ(·) and 2 (·) in both the generative and the in- ference models are implemented as:
  7. 7. ¤ /(0|1, 34 ) ¤ ℬ(0|1) increasing both several different stochastic latent is essential for stochastic latent rain a generative a x using auxil- model q (z|x)1 to the likelihood recognition model decoder. q (z1|x) = N z1|µ ,1(x), 2 ,1(x) (5) q (zi|zi 1) = N zi|µ ,i(zi 1), 2 ,i(zi 1) (6) for i = 2 . . . L. Functions µ(·) and 2 (·) in both the generative and the in- ference models are implemented as: d(y) =MLP(y) (7) µ(y) =Linear(d(y)) (8) 2 (y) =Softplus(Linear(d(y))) , (9) where MLP is a two layered multilayer perceptron network, Linear is a single linear layer, and Softplus applies log(1 + exp(·)) non linearity to each component of its ar- gument vector. In our notation, each MLP(·) or Linear(·) gives a new mapping with its own parameters, so the de- terministic variable d is used to mark that the MLP-part is shared between µ and 2 whereas the last Linear layer is not shared. nspired by the better than the ncreasing both veral different ochastic latent s essential for tochastic latent in a generative x using auxil- p✓(zL) = N (zL|0, I) . (4) The hierarchical specification allows the lower layers of the latent variables to be highly correlated but still maintain the computational efficiency of fully factorized models. Each layer in the inference model q (z|x) is specified using a fully factorized Gaussian distribution: q (z1|x) = N z1|µ ,1(x), 2 ,1(x) (5) q (zi|zi 1) = N zi|µ ,i(zi 1), 2 ,i(zi 1) (6) for i = 2 . . . L. Functions µ(·) and 2 (·) in both the generative and the in- ference models are implemented as: d(y) =MLP(y) (7) µ(y) =Linear(d(y)) (8) 2 (y) =Softplus(Linear(d(y))) , (9) where MLP is a two layered multilayer perceptron network, Linear is a single linear layer, and Softplus applies Sigmoid Abstract Variational autoencoders are a powerful frame- work for unsupervised learning. However, pre- vious work has been restricted to shallow mod- els with one or two layers of fully factorized stochastic latent variables, limiting the flexibil- ity of the latent representation. We propose three advances in training algorithms of variational au- toencoders, for the first time allowing to train deep models of up to five stochastic layers, (1) using a structure similar to the Ladder network as the inference model, (2) warm-up period to support stochastic units staying active in early training, and (3) use of batch normalization. Us- ing these improvements we show state-of-the-art log-likelihood results for generative modeling on several benchmark datasets. 1. Introduction The recently introduced variational autoencoder (VAE) (Kingma & Welling, 2013; Rezende et al., 2014) provides a framework for deep generative models (DGM). DGMs have later been shown to be a powerful framework for semi-supervised learning (Kingma et al., 2014; Maaloee X !" z z X !" d " d " a) b) 1 2 1 2 2 2 2 1 1 1 Figure 1. Inference (or encoder/rec decoder) models. a) VAE inference der inference model and c) generati variables sampled from the approxi with mean and variances parameteri ables, each conditioned on the l highly flexible latent distribution model parameterizations: the fir the VAE to multiple layers of la ond is parameterized in such a w as a probabilistic variational vari arXiv:1602.02282v1[stat.ML]6
  8. 8. ¤ ¤ ¤ log' 0 ≥ :;< = 0 log ', 0, = !. = 0 = ℒ(), (; 0) A,,.ℒ ), (; 0 = A,,.:;< = 0 log ', 0, = !. = 0 = :/(B,C) A,,.log ', 0, = !. = 0 = 1 E F A,,.log G HIC ', 0, =(J) !. =(J) 0
  9. 9. ¤ 2 ¤ A ¤ B KL ¤ B ¤ & ¤ IWAE A ¤ A ℒ ), (; 0 = :;< = 0 log ', 0, = !. = 0 ℒ ), (; 0 = −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|=
  10. 10. ¤ ¤ A ¤ B KL ¤ B ' 0|= = ' =O ' =OPC |=O '(x|=C ) ! =|0 = ! =C |0 ! =4 |=C !(=O |=OPC ) :;< = 0 log RS 0,= ;< = 0 = :;< = 0 log R =T R =TUV|=T R x|=V ; =V|0 ; =W|=V ;(=T|=TUV) −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|= = −L-[!. = 0 ∥ ', =O ] + :;< = 0 log' =OPC |=O ' x|=C
  11. 11. ¤ VAE ¤ VAE ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤
  12. 12. VAE ¤ VAE ¤ ¤ CVAE [Kingma+ 2014] ¤ Variational Fair Auto Encoder [Louizos+ 15] ¤ X sensitive '(%|X) MMD The inference network q (z|x) (3) is used during training of the model using both the labelled and unlabelled data sets. This approximate posterior is then used as a feature extractor for the labelled data set, and the features used for training the classifier. 3.1.2 Generative Semi-supervised Model Objective For this model, we have two cases to consider. In the first case, the label corresponding to a data point is observed and the variational bound is a simple extension of equation (5): log p✓(x, y) Eq (z|x,y) [log p✓(x|y, z) + log p✓(y) + log p(z) log q (z|x, y)]= L(x, y), (6) For the case where the label is missing, it is treated as a latent variable over which we perform posterior inference and the resulting bound for handling data points with an unobserved label y is: log p✓(x) Eq (y,z|x) [log p✓(x|y, z) + log p✓(y) + log p(z) log q (y, z|x)] = X y q (y|x)( L(x, y)) + H(q (y|x)) = U(x). (7) The bound on the marginal likelihood for the entire dataset is now: J = X (x,y)⇠epl L(x, y) + X x⇠epu U(x) (8) The distribution q (y|x) (4) for the missing labels has the form a discriminative classifier, and we can use this knowledge to construct the best classifier possible as our inference model. This distribution is also used at test time for predictions of any unseen data. In the objective function (8), the label predictive distribution q (y|x) contributes only to the second term relating to the unlabelled data, which is an undesirable property if we wish to use this distribu- tion as a classifier. Ideally, all model and variational parameters should learn in all cases. To remedy % # X
  13. 13. ¤ Auxiliary Deep Generative Models [Maaløe+ 16] ¤ Auxiliary variables [Agakov+ 2004 ] ¤ ¤ state-of-the-art Auxiliary Deep Generative Models ry deep generative models ngma (2013); Rezende et al. (2014) have cou- oach of variational inference with deep learn- e to powerful probabilistic models constructed nce neural network q(z|x) and a generative rk p(x|z). This approach can be perceived as equivalent to the deep auto-encoder, in which as the encoder and p(x|z) the decoder. How- ference is that these models ensures efficient er various continuous distributions in the la- and complex input datasets x, where the pos- ution p(x|z) is intractable. Furthermore, the the variational upper bound are easily defined agation through the network(s). To keep the al requirements low the variational distribution ally chosen to be a diagonal Gaussian, limiting e power of the inference model. er we propose a variational auxiliary vari- ch (Agakov and Barber, 2004) to improve al distribution: The generative model is ex- yz a x (a) Generative model P. yz a x (b) Inference model Q. Figure 1. Probabilistic graphical model of the ADGM for semi- supervised learning. The incoming joint connections to each vari- able are deep neural networks with parameters ✓ and . 2.2. Auxiliary variables We propose to extend the variational distribution with aux- iliary variables a: q(a, z|x) = q(z|a, x)q(a|x) such that the marginal distribution q(z|x) can fit more complicated posteriors p(z|x). In order to have an unchanged gen- (x|z). This approach can be perceived as valent to the deep auto-encoder, in which e encoder and p(x|z) the decoder. How- ce is that these models ensures efficient arious continuous distributions in the la- complex input datasets x, where the pos- n p(x|z) is intractable. Furthermore, the variational upper bound are easily defined on through the network(s). To keep the quirements low the variational distribution chosen to be a diagonal Gaussian, limiting wer of the inference model. e propose a variational auxiliary vari- Agakov and Barber, 2004) to improve istribution: The generative model is ex- bles a to p(x, z, a) such that the original t to marginalization over a: p(x, z, a) = In the variational distribution, on the s used such that marginal q(z|x) = )da is a general non-Gaussian distribution. specification allows the latent variables to ough a, while maintaining the computa- (a) Generative model P. (b) Inference model Figure 1. Probabilistic graphical model of the ADGM supervised learning. The incoming joint connections to e able are deep neural networks with parameters ✓ and . 2.2. Auxiliary variables We propose to extend the variational distribution w iliary variables a: q(a, z|x) = q(z|a, x)q(a|x) s the marginal distribution q(z|x) can fit more com posteriors p(z|x). In order to have an unchang erative model, p(x|z), it is required that the joi p(x, z, a) gives back the original p(x, z) under m ization over a, thus p(x, z, a) = p(a|x, z)p(x, z iliary variables are used in the EM algorithm an sampling and has previously been considered fo tional learning by Agakov and Barber (2004). R Ranganath et al. (2015) has proposed to make the q(z|x) acts as the encoder and p(x|z) the decoder ever, the difference is that these models ensures e inference over various continuous distributions in tent space z and complex input datasets x, where t terior distribution p(x|z) is intractable. Furtherm gradients of the variational upper bound are easily by backpropagation through the network(s). To k computational requirements low the variational dist q(z|x) is usually chosen to be a diagonal Gaussian, the expressive power of the inference model. In this paper we propose a variational auxiliar able approach (Agakov and Barber, 2004) to i the variational distribution: The generative mode tended with variables a to p(x, z, a) such that the model is invariant to marginalization over a: p(x, p(a|x, z)p(x, z). In the variational distribution, other hand, a is used such that marginal q(zR q(z|a, x)p(a|x)da is a general non-Gaussian distr This hierarchical specification allows the latent vari be correlated through a, while maintaining the co tional efficiency of fully factorized models (cf. Fig
  14. 14. ¤ VAE ¤ VAE ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤
  15. 15. Importance Weighted AE ¤ Importance Weighted Autoencoder [Burda+ 15] ¤ ¤ k ¤ Rényi [Li+ 16] ¤ Y log' 0 ≥ : ;< %(C) # ;< %(Z) # log F ', 0, =([) !. =([) 0 Z IC ≥ ℒ(), (; 0) ecall from Section 2.1 that the family of R´enyi divergences includes the KL divergence. Perhaps c ariational free-energy approaches be generalised to the R´enyi case? Consider approximating the tr osterior p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0): q(✓) = arg min q2Q D↵[q(✓)||p(✓|D)]. (1 ow we verify the alternative optimization problem q(✓) = arg max q2Q log p(D) D↵[q(✓)||p(✓|D)]. (1 When ↵ 6= 1, the objective can be rewritten as log p(D) 1 ↵ 1 log Z q(✓)↵ p(✓|D)1 ↵ d✓ = log p(D) 1 ↵ 1 log Eq "✓ p(✓, D) q(✓)p(D) ◆1 ↵ # = 1 1 ↵ log Eq "✓ p(✓, D) q(✓) ◆1 ↵ # := L↵(q; D). (1 We name this new objective the variational R´enyi bound (VR). Importantly the following theorem i rect result of Proposition 1. heorem 1. The objective L↵(q; D) is continuous and non-increasing on ↵ 2 [0, 1] [ {|L↵| < +1 specially for all 0 < ↵ < 1,
  16. 16. ¤ Normalizing flows [Rezende+ 15] ¤ ¤ Variational Gaussian Process [Tran+ 15] ¤ Variational Inference with Normalizing Flows and involve matrix inverses that can be numerically unsta- ble. We therefore require normalizing flows that allow for low-cost computation of the determinant, or where the Ja- cobian is not needed at all. 4.1. Invertible Linear-time Transformations We consider a family of transformations of the form: f(z) = z + uh(w> z + b), (10) where = {w 2 IRD , u 2 IRD , b 2 IR} are free pa- rameters and h(·) is a smooth element-wise non-linearity, with derivative h0 (·). For this mapping we can compute the logdet-Jacobian term in O(D) time (using the matrix determinant lemma): (z) = h0 (w> z + b)w (11) det @f @z = | det(I + u (z)> )| = |1 + u> (z)|. (12) From (7) we conclude that the density qK(z) obtained by transforming an arbitrary initial density q0(z) through the sequence of maps fk of the form (10) is implicitly given by: zK = fK fK 1 . . . f1(z) ln qK(zK) = ln q0(z) KX k=1 ln |1 + u> k k(zk)|. (13) The flow defined by the transformation (13) modifies the initial density q0 by applying a series of contractions and expansions in the direction perpendicular to the hyperplane w> z+b = 0, hence we refer to these maps as planar flows. As an alternative, we can consider a family of transforma- tions that modify an initial density q0 around a reference point z0. The transformation family is: f(z) = z + h(↵, r)(z z0), (14) @f d 1 0 K=1 K=2 Planar Radial q0 K=1 K=2K=10 K=10 UnitGaussianUniform Figure 1. Effect of normalizing flow on two distributions. Inference network Generative model Figure 2. Inference and generative models. Left: Inference net- work maps the observations to the parameters of the flow; Right: generative model which receives the posterior samples from the inference network during training time. Round containers repre- sent layers of stochastic variables whereas square containers rep- resent deterministic layers. 4.2. Flow-Based Free Energy Bound If we parameterize the approximate posterior distribution with a flow of length K, q (z|x) := qK(zK), the free en- ergy (3) can be written as an expectation over the initial distribution q0(z): F(x) = Eq (z|x)[log q (z|x) log p(x, z)] = Eq0(z0) [ln qK(zK) log p(x, zK)] = Eq0(z0) [ln q0(z0)] Eq0(z0) [log p(x, zK)] " K # Variational Inference with Normalizi distribution : q(z0 ) = q(z) det @f 1 @z0 = q(z) det @f @z 1 , (5) where the last equality can be seen by applying the chain rule (inverse function theorem) and is a property of Jaco- bians of invertible functions. We can construct arbitrarily complex densities by composing several simple maps and successively applying (5). The density qK(z) obtained by successively transforming a random variable z0 with distri- bution q0 through a chain of K transformations fk is: zK = fK . . . f2 f1(z0) (6) ln qK(zK) = ln q0(z0) KX k=1 ln det @fk @zk , (7) where equation (6) will be used throughout the paper as a shorthand for the composition fK(fK 1(. . . f1(x))). The path traversed by the random variables zk = fk(zk 1) with initial distribution q0(z0) is called the flow and the path formed by the successive distributions qk is a normalizing flow. A property of such transformations, often referred to as the law of the unconscious statistician (LOTUS), is that expectations w.r.t. the transformed density qK can be computed without explicitly knowing qK. Any expectation EqK [h(z)] can be written as an expectation under q0 as: EqK [h(z)] = Eq0 [h(fK fK 1 . . . f1(z0))], (8) which does not require computation of the the logdet- Jacobian terms when h(z) does not depend on qK. We can understand the effect of invertible flows as a se- quence of expansions or contractions on the initial density. For an expansion, the map z0 = f(z) pulls the points z away from a region in IRd , reducing the density in that re- gion while increasing the density outside the region. Con- partial diffe sity q0(z) e T describes Langevin F the Langev dz where d⇠(t E[⇠i(t)⇠j(t D = GG random var Langevin fl of densities Kolmogoro qt(z) of the @ @t qt(z)= In machine with F(z, t L(z) is an u Importantly is given by That is, if evolve its s sulting poin e L(z) , i.e. plored for s Teh (2011); Hamiltonia described in space ˜z = ( tonian H(z used in mac Under review as a conference paper at ICLR 2016 zifi⇠ ✓ D = {(s, t)} d (a) VARIATIONAL MODEL zi x d (b) GENERATIVE MODEL Figure 1: (a) Graphical model of the variational Gaussian process. The VGP generates samples of latent variables z by evaluating random non-linear mappings of latent inputs ⇠, and then drawing mean-field samples parameterized by the mapping. These latent variables aim to follow the posterior distribution for a generative model (b), conditioned on data x.
  17. 17. ¤ VAE ¤ VAE ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤
  18. 18. ¤ VAE ¤ VAE ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤
  19. 19. ¤ 1 2 ¤ ¤ 3 ¤ ¤ ¤
  20. 20. ¤ Ladder network [Valpola, 14][Rasmus+ 15] probabilistic ladder network ¤ ladder network ¤ ariational Autoencoders and Probabilistic Ladder Networks s we propose (1) rm-up period to rly training, and zing DGMs and up to five layers ls, consisting of h highly expres- fficiency of fully ese models have ured in terms of qually or more variational dis- Processes (Tran e & Mohamed, zd z z a) b) n n nn "Likelihood" Deterministic bottom up pathway Stochastic top down pathway "Posterior" "Prior" + "Copy" Top down pathway through KL-divergences in generative model Bottom up pathway in inference model Indirect top down information through prior Direct flow of information zn Generative model "Copy" Figure 2. Flow of information in the inference and generative models of a) probabilistic ladder network and b) VAE. The prob- abilistic ladder network allows direct integration (+ in figure, see Eq. (21) ) of bottom-up and top-down information in the infer- Probabilistic ladder network VAE bottom-up top-down top-down
  21. 21. Top-down ¤ bottom-up ¤ top-down :=~;< = 0 log R =T R =TUV|=T R x|=V ; =V|0 ; =W|=V ;(=T|=TUV) =C~! =C|0 =4~! =4|=C :=~;< = 0 log R =T R =TUV|=T R x|=V ; =V|0 ; =W|=V ;(=T|=TUV) ' =C |=4 ' =4 ' 0|=C=C ~! =C |0 =4 ~! =4 |=C
  22. 22. ¤ ¤ ¤ th different number of latent lay- Warm-up WU vides a tractable lower bound an be used as a training crite- ✓(x, z) (z|x) = L(✓, ; x) (10) |p✓(z)) + Eq (z|x) [p✓(x|z)] , (11) ibler divergence. e likelihood may be obtained crease of samples by using the Burda et al., 2015): (z(K)|x) " log KX k=1 p✓(x, z(k) ) q (z(k)|x) # (12) ve parameters, ✓ and , are g Eq. (11) using stochastic se the reparametrization trick n through the Gaussian latent , 2013; Rezende et al., 2014). mance during training. The test set performance was estimated using 5000 importance weighted samples providing a tighter bound than the training bound explaining the better performance here. 2015) as the inference model of a VAE, as shown in Figure 1. The generative model is the same as before. The inference is constructed to first make a deterministic upward pass: d1 =MLP(x) (13) µd,i =Linear(di), i = 1 . . . L (14) 2 d,i =Softplus(Linear(di)), i = 1 . . . L (15) di =MLP(µd,i 1), i = 2 . . . L (16) followed by a stochastic downward pass: q (zL|x) =N µd,L, 2 d,L (17) ti =MLP(zi+1), i = 1 . . . L 1 (18) µt,i =Linear(ti) (19) 2 t,i =Softplus(Linear(ti)) (20) q✓(zi|zi+1, x) =N µt,i 2 t,i + µd,i 2 d,i 2 t,i + 2 d,i , 1 2 t,i + 2 d,i ! . (21) kelihood values for VAEs and the ith different number of latent lay- d Warm-up WU ovides a tractable lower bound can be used as a training crite- p✓(x, z) q (z|x) = L(✓, ; x) (10) ||p✓(z)) + Eq (z|x) [p✓(x|z)] , (11) eibler divergence. he likelihood may be obtained crease of samples by using the (Burda et al., 2015): q (z(K)|x) " log KX k=1 p✓(x, z(k) ) q (z(k)|x) # (12) ve parameters, ✓ and , are ng Eq. (11) using stochastic use the reparametrization trick on through the Gaussian latent g, 2013; Rezende et al., 2014). mated using Monte Carlo sam- orresponding q distribution. mance during training. The test set performance was estimated using 5000 importance weighted samples providing a tighter bound than the training bound explaining the better performance here. 2015) as the inference model of a VAE, as shown in Figure 1. The generative model is the same as before. The inference is constructed to first make a deterministic upward pass: d1 =MLP(x) (13) µd,i =Linear(di), i = 1 . . . L (14) 2 d,i =Softplus(Linear(di)), i = 1 . . . L (15) di =MLP(µd,i 1), i = 2 . . . L (16) followed by a stochastic downward pass: q (zL|x) =N µd,L, 2 d,L (17) ti =MLP(zi+1), i = 1 . . . L 1 (18) µt,i =Linear(ti) (19) 2 t,i =Softplus(Linear(ti)) (20) q✓(zi|zi+1, x) =N µt,i 2 t,i + µd,i 2 d,i 2 t,i + 2 d,i , 1 2 t,i + 2 d,i ! . (21) 2 2 a tractable lower bound used as a training crite- ) ) = L(✓, ; x) (10) ) + Eq (z|x) [p✓(x|z)] , (11) ivergence. lihood may be obtained of samples by using the a et al., 2015): |x) " log KX k=1 p✓(x, z(k) ) q (z(k)|x) # (12) ameters, ✓ and , are (11) using stochastic reparametrization trick ugh the Gaussian latent 3; Rezende et al., 2014). using Monte Carlo sam- here. 2015) as the inference model of a VAE, as shown in Figure 1. The generative model is the same as before. The inference is constructed to first make a deterministic upward pass: d1 =MLP(x) (13) µd,i =Linear(di), i = 1 . . . L (14) 2 d,i =Softplus(Linear(di)), i = 1 . . . L (15) di =MLP(µd,i 1), i = 2 . . . L (16) followed by a stochastic downward pass: q (zL|x) =N µd,L, 2 d,L (17) ti =MLP(zi+1), i = 1 . . . L 1 (18) µt,i =Linear(ti) (19) 2 t,i =Softplus(Linear(ti)) (20) q✓(zi|zi+1, x) =N µt,i 2 t,i + µd,i 2 d,i 2 t,i + 2 d,i , 1 2 t,i + 2 d,i ! . (21) PRML
  23. 23. ¤ VAE ¤ VAE ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤
  24. 24. ¤ B ¤ [MacKey 01] ¤ ^ ¤ ¤ ¤ ℒ ), (; 0 = −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|= phenomenon. The probabilistic ladder network provides a framework with the wanted interaction, while keeping complications manageable. A further extension could be to make the inference in k steps over an iterative inference procedure (Raiko et al., 2014). 2.2. Warm-up from deterministic to variational autoencoder The variational training criterion in Eq. (11) contains the reconstruction term p✓(x|z) and the variational regular- ization term. The variational regularization term causes some of the latent units to become inactive during train- ing (MacKay, 2001) because the approximate posterior for unit k, q(zi,k| . . . ) is regularized towards its own prior p(zi,k| . . . ), a phenomenon also recognized in the VAE set- ting (Burda et al., 2015). This can be seen as a virtue of automatic relevance determination, but also as a prob- lem when many units are pruned away early in training before they learned a useful representation. We observed that such units remain inactive for the rest of the training, presumably trapped in a local minima or saddle point at KL(qi,k|pi,k) ⇡ 0, with the optimization algorithm unable to re-activate them. We propose to alleviate the problem by initializing train- tr fr v th o 3 T M C ch 3 al M q 6 b re phenomenon. The a framework with complications man to make the inferen procedure (Raiko et 2.2. Warm-up from autoencoder The variational trai reconstruction term ization term. The some of the latent ing (MacKay, 2001 unit k, q(zi,k| . . . ) p(zi,k| . . . ), a pheno ting (Burda et al., of automatic releva lem when many un before they learned that such units rem presumably trapped KL(qi,k|pi,k) ⇡ 0, to re-activate them. We propose to alle ing using the recon training a standard
  25. 25. ¤ ¤ ¤ →Warm-up ¤ Batch Normalization[Ioffe+ 15] ¤ ¤ 2 before they learned a useful representation. We observed that such units remain inactive for the rest of the training, presumably trapped in a local minima or saddle point at KL(qi,k|pi,k) ⇡ 0, with the optimization algorithm unable to re-activate them. We propose to alleviate the problem by initializing train- ing using the reconstruction error only (corresponding to training a standard deterministic auto-encoder), and then gradually introducing the variational regularization term: L(✓, ; x)T = (22) KL(q (z|x)||p✓(z)) + Eq (z|x) [p✓(x|z)] , where is increased linearly from 0 to 1 during the first Nt epochs of training. We denote this scheme warm-up (ab- breviated WU in tables and graphs) because the objective goes from having a delta-function solution (correspond- ing to zero temperature) and then move towards the fully stochastic variational objective. A similar idea has previ- ously been considered in Raiko et al. (2007, Section 6.2), however here used for Bayesian models trained with a co- ordinate descent algorithm. 32, 16, 8 and 4, goin all mappings using MLP’s between x an quent layers were c 64 and 32 for all co bilistic ladder netwo removing latent vari sometimes refer to t the four layer mode models were trained & Ba, 2014) optimiz reported test log-like (12) with 5000 impo et al. (2015). The mo (Bastien et al., 2012 Parmesan3 framewo For MNIST we used mean of a Bernoulli (max(x, 0.1x)) as n were trained for 200 on the complete trai Nt = 200. Simila ple the binarized tra ages using a Bernou
  26. 26. ¤ 3 ¤ MNIST ¤ OMNIGLOT [Lake+ 13] ¤ NORB [LeCun+ 04] ¤ 64-32-16-8-4 5 ¤ NN 2 ¤ 1 512 2 256,128,64,32 ¤ tanh leaky rectifiers ¤ k=5000 importance weighted ¤
  27. 27. MNIST ¤ VAE 2 ¤ Batch normalization & warm-up ¤ probabilistic ladder network How to Train Deep Variational Autoencoders and Probabilistic Ladder Netwo Figure 3. MNIST test-set log-likelihood values for VAEs and the probabilistic ladder networks with different number of latent lay- ers, Batch normalizationBN and Warm-up WU The variational principle provides a tractable lower bound on the log likelihood which can be used as a training crite- rion L. log p(x) E  log p✓(x, z) = L(✓, ; x) (10) 200 400 600 800 1000 1 Epoc -90 -88 -86 -84 -82 L(x) VAE VAE+BN Figure 4. MNIST train (full lines) an mance during training. The test set using 5000 importance weighted s bound than the training bound expla here. 2015) as the inference model of a 1. The generative model is the sa The inference is constructed to
  28. 28. ¤ MC importance weighted IW ¤ MC IW ¤ permutation invariant MNIST ¤ -82.90 [Burda+ 15] ¤ -81.90 [Tran+ 15] ¤ tional Autoencoders and Probabilistic Ladder Networks , where iter- own signals e 2. Notably ce networks see van den sion on this ork provides hile keeping on could be ve inference nal contains the nal regular- term causes during train- Table 1. Fine-tuned test log-likelihood values for 5 layered VAE and probabilistic ladder networks trained on MNIST. ANN. LR: Annealed Learning rate, MC: Monte Carlo samples to approxi- mate Eq(·)[·], IW: Importance weighted samples FINETUNING NONE ANN. LR. MC=1 IW=1 ANN. LR. MC=10 IW=1 ANN. LR. MC=1 IW=10 ANN. LR. MC=10 IW=10 VAE 82.14 81.97 81.84 81.41 81.30 PROB. LADDER 81.87 81.54 81.46 81.35 -81.20 training in deep neural networks by normalizing the outputs from each layer. We show that batch normalization (abbre- viated BN in tables and graphs), applied to all layers except the output layers, is essential for learning deep hierarchies of latent variables for L > 2.
  29. 29. ¤ KL 0.01 ¤ VAE ¤ Batch normalization (BN) ¤ Warm-up (WU) ¤ Probabilistic ladder network How to Train Deep Variational Autoencoders and Probabilistic Ladde Table 2. Number of active latent units in five layer VAE and prob- abilistic ladder networks trained on MNIST. A unit was defined as active if KL(qi,k||pi,k) > 0.01 VAE VAE +BN VAE +BN +WU PROB. LADDER +BN +WU LAYER 1 20 20 34 46 LAYER 2 1 9 18 22 LAYER 3 0 3 6 8 LAYER 4 0 3 2 3 LAYER 5 0 2 1 2 TOTAL 21 37 61 81 importance weighted samples to 10 to reduce the variance in the approximation of the expectations in Eq. (10) and improve the inference model, respectively. Models trained on the OMNIGLOT dataset4 , consisting of 28x28 binary images images were trained similar to above except that the number of training epochs was 1500. Models trained on the NORB dataset5 , consisting of 32x32 grays-scale images with color-coding rescaled to [0, 1], Table 3. Test set Log-likelih OMNIGLOT and NORB da dataset and the number of la VAE OMNIGLOT 64 114.45 64-32 112.60 64-32-16 112.13 64-32-16-8 112.49 64-32-16-8-4 112.10 NORB 64 2630.8 64-32 2830.8 64-32-16 2757.5 64-32-16-8 2832.0 64-32-16-8-4 3064.1 tic layers. The performan not improve with more th variables. Contrary to thi
  30. 30. OMNIGLOT NORB ¤ BN WU ¤ NORB ladder ¤ ¤ tanh w to Train Deep Variational Autoencoders and Probabilistic Ladder Networks ent units in five layer VAE and prob- ined on MNIST. A unit was defined > 0.01 AE BN VAE +BN +WU PROB. LADDER +BN +WU 0 34 46 9 18 22 3 6 8 3 2 3 2 1 2 7 61 81 ples to 10 to reduce the variance he expectations in Eq. (10) and del, respectively. MNIGLOT dataset4 , consisting of ges were trained similar to above training epochs was 1500. RB dataset5 , consisting of 32x32 color-coding rescaled to [0, 1], tion model with mean and vari- near and a softplus output layer were similar to the models above ngent was used as nonlinearities rate was 0.002, Nt = 1000 and ochs were 4000. Table 3. Test set Log-likelihood values for models trained on the OMNIGLOT and NORB datasets. The left most column show dataset and the number of latent variables i each model. VAE VAE +BN VAE +BN +WU PROB. LADDER +BN +WU OMNIGLOT 64 114.45 108.79 104.63 64-32 112.60 106.86 102.03 102.12 64-32-16 112.13 107.09 101.60 -101.26 64-32-16-8 112.49 107.66 101.68 101.27 64-32-16-8-4 112.10 107.94 101.86 101.59 NORB 64 2630.8 3263.7 3481.5 64-32 2830.8 3140.1 3532.9 3522.7 64-32-16 2757.5 3247.3 3346.7 3458.7 64-32-16-8 2832.0 3302.3 3393.6 3499.4 64-32-16-8-4 3064.1 3258.7 3393.6 3430.3 tic layers. The performance of the vanilla VAE model did not improve with more than two layers of stochastic latent variables. Contrary to this, models trained with batch nor- malization and warm-up consistently increase the model performance for additional layers of stochastic latent vari- ables. As expected the improvement in performance is de- creasing for each additional layer, but we emphasize that the improvements are consistent even for the addition of
  31. 31. ¤ VAE ¤ ¤ KL ¤ KL 0 ¤ BN WU Ladder To study this effect we calculated the KL-divergence be- tween q(zi,k|zi 1,k) tent variable k during training as seen in Figure To study this effect we calculated the KL-divergence be- and p(zi|zi+1) for each stochastic la- during training as seen in Figure term is zero if the inference model is independent of the data, i.e. q(zi,k|zi 1,k) = q(zi,k), and hence collapsed
  32. 32. ¤ PCA ¤ BN ¤ Ladder
  33. 33. ¤ ¤ KL ¤ VAE 2 ¤ Ladder BN WU ¤ structured high level latent representations that are likely useful for semi-supervised learning. The hierarchical latent variable models used here allows highly flexible distributions of the lower layers conditioned on the layers above. We measure the divergence between these conditional distributions and the restrictive mean field approximation by calculating the KL-divergence between q(zi|zi 1) and a standard normal distribution for several models trained on MNIST, see Figure 6 a). As expected the lower layers have highly non (standard) Gaussian dis- tributions when conditioned on the layers above. Interest- ingly the probabilistic ladder network seems to have more active intermediate layers than t,he VAE with batch nor- malization and warm-up. Again this might be explained by the deterministic upward pass easing flow of informa- tion to the intermediate and upper layers. We further note that the KL-divergence is approximately zero in the vanilla VAE model above the second layer confirming the inactiv- ity of these layers. Figure 6 b) shows generative samples from the probabilistic ladder network created by injecting ⇤ Bouchard, Yoshua. Th arXiv prepr Burda, Yuri, lan. Impor arXiv:1509 Dayan, Peter, Zemel, Ric putation, 7 Dieleman, Sa ren Kaae S Aaron, and lease., Aug 10.5281/ Ioffe, Sergey Acceleratin covariate sh Kingma, Di
  34. 34. ¤ ¤ Probabilistic ladder network ¤
  35. 35. ¤ VAE ¤ VAE ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤
  36. 36. ¤ VAE 3 ¤ ¤ ¤ ¤ ¤ MNIST OMNIGLOT state-of-the-art ¤ Probabilistic ladder network ¤ Probabilistic ladder network
  37. 37. ¤ VAE ¤ VAE ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤
  38. 38. VAE ¤ ¤ VAE ¤ ¤ … ¤ →NN ¤ VAE ¤ Lasagne Parmesan ¤ ¤
  39. 39. VAE ¤ VAE ¤ VAE Ladder RNN ¤ DL #recognition model (encoder) q_h = [Dense(rng,28*28,200,activation=T.nnet.relu), Dense(rng,200,200,activation=T.nnet.relu)] q_mean = [Dense(rng,200,50,activation=None)] q_sigma = [Dense(rng,200,50,activation=T.nnet.softplus)] q = [Gaussian(q_h,q_mean,q_sigma)] #generate model (decoder) p_h = [Dense(rng,50,200,activation=T.nnet.relu), Dense(rng,200,200,activation=T.nnet.relu)] p_mean = [Dense(rng,200,28*28,activation=T.nnet.sigmoid)] p = [Bernoulli(p_h,p_mean)] #VAE model = VAE_z_x(q,p,k=1,alpha=1,random=rseed) # model.train(train_x) # log_likelihood_test = model.log_likelihood_test(test_x,k=1000,mode='iw') # sample_x = model.p_sample_mean_x(sample_z) Renyi CNN Lasagne

×