SlideShare a Scribd company logo
How to Train Deep Variational
Autoencoders and Probabilistic
Ladder Networks
¤ ICML 2016
¤ Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren
Kaae Sønderby, Ole Winther (Technical University of Denmark)
¤ Maaløe VAE
¤ state-of-the-art
¤ Ladder network VAE
¤ warm-up batch normalization
Variaitonal Autoencoder
¤ Variational Autoencoder [Kingma+ 13][Rezende+ 14]
¤ !(#|%)
¤ '(#|%)
¤ ( )
x ⇠ p✓(x|z)
z ⇠ p✓(z)
q (z|x)
¤ ', # 2
¤ -
¤ !.
Processes (Tran
e & Mohamed,
rs (Burda et al.,
nd warm-up al-
suggesting that
o show that the
good or better
model for fur-
ned latent rep-
oposed here are
ations utilizing
tive assessment
s that the multi-
in the datasets
ised learning.
inspired by the
better than the
abilistic ladder network allows direct integration (+ in figure, see
Eq. (21) ) of bottom-up and top-down information in the infer-
ence model. In the VAE the top-down information is incorporated
indirectly through the conditional priors in the generative model.
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓(z1) or (1)
P✓(x|z1) = B (x|µ✓(z1)) (2)
for continuous-valued (Gaussian N) or binary-valued
(Bernoulli B) data, respectively. The latent variables z are
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
✓,i(zi+1) (3)
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
pared to equally or more
ng flexible variational dis-
Gaussian Processes (Tran
s (Rezende & Mohamed,
Autoencoders (Burda et al.,
alization and warm-up al-
rformance, suggesting that
ul. We also show that the
erforms as good or better
interesting model for fur-
dy the learned latent rep-
methods proposed here are
nt representations utilizing
. A qualitative assessment
her indicates that the multi-
el structure in the datasets
emi-supervised learning.
the VAE inspired by the
g as well or better than the
n training increasing both
e across several different
of active stochastic latent
Figure 2. Flow of information in the inference and generative
models of a) probabilistic ladder network and b) VAE. The prob-
abilistic ladder network allows direct integration (+ in figure, see
Eq. (21) ) of bottom-up and top-down information in the infer-
ence model. In the VAE the top-down information is incorporated
indirectly through the conditional priors in the generative model.
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓(z1) or (1)
P✓(x|z1) = B (x|µ✓(z1)) (2)
for continuous-valued (Gaussian N) or binary-valued
(Bernoulli B) data, respectively. The latent variables z are
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
✓,i(zi+1) (3)
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
ve generative performance, measured in terms of
likelihood, when compared to equally or more
ted methods for creating flexible variational dis-
s such as the Variational Gaussian Processes (Tran
15) Normalizing Flows (Rezende & Mohamed,
Importance Weighted Autoencoders (Burda et al.,
We find that batch normalization and warm-up al-
rease the generative performance, suggesting that
thods are broadly useful. We also show that the
stic ladder network performs as good or better
ng VAEs making it an interesting model for fur-
ies. Secondly, we study the learned latent rep-
ons. We find that the methods proposed here are
y for learning rich latent representations utilizing
ayers of latent variables. A qualitative assessment
ent representations further indicates that the multi-
DGMs capture high level structure in the datasets
likely to be useful for semi-supervised learning.
ary our contributions are:
ew parametrization of the VAE inspired by the
der network performing as well or better than the
ent best models.
ovel warm-up period in training increasing both
Figure 2. Flow of information in the inference and ge
models of a) probabilistic ladder network and b) VAE. Th
abilistic ladder network allows direct integration (+ in fig
Eq. (21) ) of bottom-up and top-down information in th
ence model. In the VAE the top-down information is incor
indirectly through the conditional priors in the generative
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓ (z1) or
P✓(x|z1) = B (x|µ✓(z1))
for continuous-valued (Gaussian N ) or binary-
(Bernoulli B) data, respectively. The latent variable
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
p✓(zL) = N (zL|0, I) .
The hierarchical specification allows the lower layer
latent variables to be highly correlated but still main
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specifie
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
Autoencoders (Burda et al.,
alization and warm-up al-
formance, suggesting that
ul. We also show that the
rforms as good or better
interesting model for fur-
dy the learned latent rep-
methods proposed here are
t representations utilizing
A qualitative assessment
er indicates that the multi-
el structure in the datasets
emi-supervised learning.
the VAE inspired by the
as well or better than the
n training increasing both
e across several different
of active stochastic latent
ence model. In the VAE the top-down information is incorporated
indirectly through the conditional priors in the generative model.
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓(z1) or (1)
P✓(x|z1) = B (x|µ✓(z1)) (2)
for continuous-valued (Gaussian N) or binary-valued
(Bernoulli B) data, respectively. The latent variables z are
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
✓,i(zi+1) (3)
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
for i = 2 . . . L.
Functions µ(·) and 2
(·) in both the generative and the in-
ference models are implemented as:
¤ /(0|1, 34
¤ ℬ(0|1)
increasing both
several different
stochastic latent
is essential for
stochastic latent
rain a generative
a x using auxil-
model q (z|x)1
to the likelihood
recognition model
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
for i = 2 . . . L.
Functions µ(·) and 2
(·) in both the generative and the in-
ference models are implemented as:
d(y) =MLP(y) (7)
µ(y) =Linear(d(y)) (8)
(y) =Softplus(Linear(d(y))) , (9)
where MLP is a two layered multilayer perceptron network,
Linear is a single linear layer, and Softplus applies
log(1 + exp(·)) non linearity to each component of its ar-
gument vector. In our notation, each MLP(·) or Linear(·)
gives a new mapping with its own parameters, so the de-
terministic variable d is used to mark that the MLP-part is
shared between µ and 2
whereas the last Linear layer is
not shared.
nspired by the
better than the
ncreasing both
veral different
ochastic latent
s essential for
tochastic latent
in a generative
x using auxil-
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
for i = 2 . . . L.
Functions µ(·) and 2
(·) in both the generative and the in-
ference models are implemented as:
d(y) =MLP(y) (7)
µ(y) =Linear(d(y)) (8)
(y) =Softplus(Linear(d(y))) , (9)
where MLP is a two layered multilayer perceptron network,
Linear is a single linear layer, and Softplus applies
Variational autoencoders are a powerful frame-
work for unsupervised learning. However, pre-
vious work has been restricted to shallow mod-
els with one or two layers of fully factorized
stochastic latent variables, limiting the flexibil-
ity of the latent representation. We propose three
advances in training algorithms of variational au-
toencoders, for the first time allowing to train
deep models of up to five stochastic layers, (1)
using a structure similar to the Ladder network
as the inference model, (2) warm-up period to
support stochastic units staying active in early
training, and (3) use of batch normalization. Us-
ing these improvements we show state-of-the-art
log-likelihood results for generative modeling on
several benchmark datasets.
1. Introduction
The recently introduced variational autoencoder (VAE)
(Kingma & Welling, 2013; Rezende et al., 2014) provides
a framework for deep generative models (DGM). DGMs
have later been shown to be a powerful framework for
semi-supervised learning (Kingma et al., 2014; Maaloee
a) b)
2 2
1 1
Figure 1. Inference (or encoder/rec
decoder) models. a) VAE inference
der inference model and c) generati
variables sampled from the approxi
with mean and variances parameteri
ables, each conditioned on the l
highly flexible latent distribution
model parameterizations: the fir
the VAE to multiple layers of la
ond is parameterized in such a w
as a probabilistic variational vari
log' 0 ≥ :;< = 0 log
', 0, =
!. = 0
= ℒ(), (; 0)
A,,.ℒ ), (; 0 = A,,.:;< = 0 log
', 0, =
!. = 0
= :/(B,C) A,,.log
', 0, =
!. = 0
F A,,.log
', 0, =(J)
!. =(J) 0
¤ 2
¤ A
¤ B KL
¤ B
¤ &
¤ A
ℒ ), (; 0 = :;< = 0 log
', 0, =
!. = 0
ℒ ), (; 0 = −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|=
¤ A
¤ B KL
¤ B
' 0|= = ' =O
' =OPC
! =|0 = ! =C
|0 ! =4
:;< = 0 log
RS 0,=
;< = 0
= :;< = 0 log
R =T R =TUV|=T R x|=V
; =V|0 ; =W|=V ;(=T|=TUV)
−L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|=
= −L-[!. = 0 ∥ ', =O
] + :;< = 0 log' =OPC
' x|=C
¤ CVAE [Kingma+ 2014]
¤ Variational Fair Auto Encoder [Louizos+ 15]
¤ X sensitive '(%|X) MMD
The inference network q (z|x) (3) is used during training of the model using both the labelled and
unlabelled data sets. This approximate posterior is then used as a feature extractor for the labelled
data set, and the features used for training the classifier.
3.1.2 Generative Semi-supervised Model Objective
For this model, we have two cases to consider. In the first case, the label corresponding to a data
point is observed and the variational bound is a simple extension of equation (5):
log p✓(x, y) Eq (z|x,y) [log p✓(x|y, z) + log p✓(y) + log p(z) log q (z|x, y)]= L(x, y), (6)
For the case where the label is missing, it is treated as a latent variable over which we perform
posterior inference and the resulting bound for handling data points with an unobserved label y is:
log p✓(x) Eq (y,z|x) [log p✓(x|y, z) + log p✓(y) + log p(z) log q (y, z|x)]
q (y|x)( L(x, y)) + H(q (y|x)) = U(x). (7)
The bound on the marginal likelihood for the entire dataset is now:
J =
L(x, y) +
U(x) (8)
The distribution q (y|x) (4) for the missing labels has the form a discriminative classifier, and
we can use this knowledge to construct the best classifier possible as our inference model. This
distribution is also used at test time for predictions of any unseen data.
In the objective function (8), the label predictive distribution q (y|x) contributes only to the second
term relating to the unlabelled data, which is an undesirable property if we wish to use this distribu-
tion as a classifier. Ideally, all model and variational parameters should learn in all cases. To remedy
¤ Auxiliary Deep Generative Models [Maaløe+ 16]
¤ Auxiliary variables [Agakov+
2004 ]
¤ state-of-the-art
Auxiliary Deep Generative Models
ry deep generative models
ngma (2013); Rezende et al. (2014) have cou-
oach of variational inference with deep learn-
e to powerful probabilistic models constructed
nce neural network q(z|x) and a generative
rk p(x|z). This approach can be perceived as
equivalent to the deep auto-encoder, in which
as the encoder and p(x|z) the decoder. How-
ference is that these models ensures efficient
er various continuous distributions in the la-
and complex input datasets x, where the pos-
ution p(x|z) is intractable. Furthermore, the
the variational upper bound are easily defined
agation through the network(s). To keep the
al requirements low the variational distribution
ally chosen to be a diagonal Gaussian, limiting
e power of the inference model.
er we propose a variational auxiliary vari-
ch (Agakov and Barber, 2004) to improve
al distribution: The generative model is ex-
(a) Generative model P.
(b) Inference model Q.
Figure 1. Probabilistic graphical model of the ADGM for semi-
supervised learning. The incoming joint connections to each vari-
able are deep neural networks with parameters ✓ and .
2.2. Auxiliary variables
We propose to extend the variational distribution with aux-
iliary variables a: q(a, z|x) = q(z|a, x)q(a|x) such that
the marginal distribution q(z|x) can fit more complicated
posteriors p(z|x). In order to have an unchanged gen-
(x|z). This approach can be perceived as
valent to the deep auto-encoder, in which
e encoder and p(x|z) the decoder. How-
ce is that these models ensures efficient
arious continuous distributions in the la-
complex input datasets x, where the pos-
n p(x|z) is intractable. Furthermore, the
variational upper bound are easily defined
on through the network(s). To keep the
quirements low the variational distribution
chosen to be a diagonal Gaussian, limiting
wer of the inference model.
e propose a variational auxiliary vari-
Agakov and Barber, 2004) to improve
istribution: The generative model is ex-
bles a to p(x, z, a) such that the original
t to marginalization over a: p(x, z, a) =
In the variational distribution, on the
s used such that marginal q(z|x) =
)da is a general non-Gaussian distribution.
specification allows the latent variables to
ough a, while maintaining the computa-
(a) Generative model P. (b) Inference model
Figure 1. Probabilistic graphical model of the ADGM
supervised learning. The incoming joint connections to e
able are deep neural networks with parameters ✓ and .
2.2. Auxiliary variables
We propose to extend the variational distribution w
iliary variables a: q(a, z|x) = q(z|a, x)q(a|x) s
the marginal distribution q(z|x) can fit more com
posteriors p(z|x). In order to have an unchang
erative model, p(x|z), it is required that the joi
p(x, z, a) gives back the original p(x, z) under m
ization over a, thus p(x, z, a) = p(a|x, z)p(x, z
iliary variables are used in the EM algorithm an
sampling and has previously been considered fo
tional learning by Agakov and Barber (2004). R
Ranganath et al. (2015) has proposed to make the
q(z|x) acts as the encoder and p(x|z) the decoder
ever, the difference is that these models ensures e
inference over various continuous distributions in
tent space z and complex input datasets x, where t
terior distribution p(x|z) is intractable. Furtherm
gradients of the variational upper bound are easily
by backpropagation through the network(s). To k
computational requirements low the variational dist
q(z|x) is usually chosen to be a diagonal Gaussian,
the expressive power of the inference model.
In this paper we propose a variational auxiliar
able approach (Agakov and Barber, 2004) to i
the variational distribution: The generative mode
tended with variables a to p(x, z, a) such that the
model is invariant to marginalization over a: p(x,
p(a|x, z)p(x, z). In the variational distribution,
other hand, a is used such that marginal q(zR
q(z|a, x)p(a|x)da is a general non-Gaussian distr
This hierarchical specification allows the latent vari
be correlated through a, while maintaining the co
tional efficiency of fully factorized models (cf. Fig
Importance Weighted AE
¤ Importance Weighted Autoencoder [Burda+ 15]
¤ k
¤ Rényi [Li+ 16]
¤ Y
log' 0 ≥ :
;< %(C)
# ;< %(Z)
log F
', 0, =([)
!. =([)
≥ ℒ(), (; 0)
ecall from Section 2.1 that the family of R´enyi divergences includes the KL divergence. Perhaps c
ariational free-energy approaches be generalised to the R´enyi case? Consider approximating the tr
osterior p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
D↵[q(✓)||p(✓|D)]. (1
ow we verify the alternative optimization problem
q(✓) = arg max
log p(D) D↵[q(✓)||p(✓|D)]. (1
When ↵ 6= 1, the objective can be rewritten as
log p(D)
↵ 1
p(✓|D)1 ↵
= log p(D)
↵ 1
log Eq
p(✓, D)
◆1 ↵
1 ↵
log Eq
p(✓, D)
◆1 ↵
:= L↵(q; D).
We name this new objective the variational R´enyi bound (VR). Importantly the following theorem i
rect result of Proposition 1.
heorem 1. The objective L↵(q; D) is continuous and non-increasing on ↵ 2 [0, 1] [ {|L↵| < +1
specially for all 0 < ↵ < 1,
¤ Normalizing flows [Rezende+ 15]
¤ Variational Gaussian Process [Tran+ 15]
Variational Inference with Normalizing Flows
and involve matrix inverses that can be numerically unsta-
ble. We therefore require normalizing flows that allow for
low-cost computation of the determinant, or where the Ja-
cobian is not needed at all.
4.1. Invertible Linear-time Transformations
We consider a family of transformations of the form:
f(z) = z + uh(w>
z + b), (10)
where = {w 2 IRD
, u 2 IRD
, b 2 IR} are free pa-
rameters and h(·) is a smooth element-wise non-linearity,
with derivative h0
(·). For this mapping we can compute
the logdet-Jacobian term in O(D) time (using the matrix
determinant lemma):
(z) = h0
z + b)w (11)
det @f
@z = | det(I + u (z)>
)| = |1 + u>
(z)|. (12)
From (7) we conclude that the density qK(z) obtained by
transforming an arbitrary initial density q0(z) through the
sequence of maps fk of the form (10) is implicitly given
zK = fK fK 1 . . . f1(z)
ln qK(zK) = ln q0(z)
ln |1 + u>
k k(zk)|. (13)
The flow defined by the transformation (13) modifies the
initial density q0 by applying a series of contractions and
expansions in the direction perpendicular to the hyperplane
z+b = 0, hence we refer to these maps as planar flows.
As an alternative, we can consider a family of transforma-
tions that modify an initial density q0 around a reference
point z0. The transformation family is:
f(z) = z + h(↵, r)(z z0), (14)
@f d 1 0
K=1 K=2
Planar Radial
q0 K=1 K=2K=10 K=10
Figure 1. Effect of normalizing flow on two distributions.
Inference network Generative model
Figure 2. Inference and generative models. Left: Inference net-
work maps the observations to the parameters of the flow; Right:
generative model which receives the posterior samples from the
inference network during training time. Round containers repre-
sent layers of stochastic variables whereas square containers rep-
resent deterministic layers.
4.2. Flow-Based Free Energy Bound
If we parameterize the approximate posterior distribution
with a flow of length K, q (z|x) := qK(zK), the free en-
ergy (3) can be written as an expectation over the initial
distribution q0(z):
F(x) = Eq (z|x)[log q (z|x) log p(x, z)]
= Eq0(z0) [ln qK(zK) log p(x, zK)]
= Eq0(z0) [ln q0(z0)] Eq0(z0) [log p(x, zK)]
" K
Variational Inference with Normalizi
distribution :
) = q(z) det
@f 1
= q(z) det
, (5)
where the last equality can be seen by applying the chain
rule (inverse function theorem) and is a property of Jaco-
bians of invertible functions. We can construct arbitrarily
complex densities by composing several simple maps and
successively applying (5). The density qK(z) obtained by
successively transforming a random variable z0 with distri-
bution q0 through a chain of K transformations fk is:
zK = fK . . . f2 f1(z0) (6)
ln qK(zK) = ln q0(z0)
ln det
, (7)
where equation (6) will be used throughout the paper as a
shorthand for the composition fK(fK 1(. . . f1(x))). The
path traversed by the random variables zk = fk(zk 1) with
initial distribution q0(z0) is called the flow and the path
formed by the successive distributions qk is a normalizing
flow. A property of such transformations, often referred
to as the law of the unconscious statistician (LOTUS), is
that expectations w.r.t. the transformed density qK can be
computed without explicitly knowing qK. Any expectation
[h(z)] can be written as an expectation under q0 as:
[h(z)] = Eq0
[h(fK fK 1 . . . f1(z0))], (8)
which does not require computation of the the logdet-
Jacobian terms when h(z) does not depend on qK.
We can understand the effect of invertible flows as a se-
quence of expansions or contractions on the initial density.
For an expansion, the map z0
= f(z) pulls the points z
away from a region in IRd
, reducing the density in that re-
gion while increasing the density outside the region. Con-
partial diffe
sity q0(z) e
T describes
Langevin F
the Langev
where d⇠(t
D = GG
random var
Langevin fl
of densities
qt(z) of the
In machine
with F(z, t
L(z) is an u
is given by
That is, if
evolve its s
sulting poin
e L(z)
, i.e.
plored for s
Teh (2011);
described in
space ˜z = (
tonian H(z
used in mac
Under review as a conference paper at ICLR 2016
✓ D = {(s, t)}
zi x
Figure 1: (a) Graphical model of the variational Gaussian process. The VGP generates samples of
latent variables z by evaluating random non-linear mappings of latent inputs ⇠, and then drawing
mean-field samples parameterized by the mapping. These latent variables aim to follow the posterior
distribution for a generative model (b), conditioned on data x.
¤ 1 2
¤ 3
¤ Ladder network [Valpola, 14][Rasmus+ 15]
probabilistic ladder network
¤ ladder network
ariational Autoencoders and Probabilistic Ladder Networks
s we propose (1)
rm-up period to
rly training, and
zing DGMs and
up to five layers
ls, consisting of
h highly expres-
fficiency of fully
ese models have
ured in terms of
qually or more
variational dis-
Processes (Tran
e & Mohamed,
a) b)
bottom up
top down
Top down pathway
through KL-divergences
in generative model
Bottom up pathway
in inference model
Indirect top
through prior
Direct flow of
Figure 2. Flow of information in the inference and generative
models of a) probabilistic ladder network and b) VAE. The prob-
abilistic ladder network allows direct integration (+ in figure, see
Eq. (21) ) of bottom-up and top-down information in the infer-
Probabilistic ladder network VAE
bottom-up top-down
¤ bottom-up
¤ top-down
:=~;< = 0 log
R =T R =TUV|=T R x|=V
; =V|0 ; =W|=V ;(=T|=TUV)
=C~! =C|0
=4~! =4|=C
:=~;< = 0 log
R =T R =TUV|=T R x|=V
; =V|0 ; =W|=V ;(=T|=TUV)
' =C
' =4
' 0|=C=C
~! =C
~! =4
th different number of latent lay-
Warm-up WU
vides a tractable lower bound
an be used as a training crite-
✓(x, z)
= L(✓, ; x) (10)
|p✓(z)) + Eq (z|x) [p✓(x|z)] ,
ibler divergence.
e likelihood may be obtained
crease of samples by using the
Burda et al., 2015):
p✓(x, z(k)
q (z(k)|x)
ve parameters, ✓ and , are
g Eq. (11) using stochastic
se the reparametrization trick
n through the Gaussian latent
, 2013; Rezende et al., 2014).
mance during training. The test set performance was estimated
using 5000 importance weighted samples providing a tighter
bound than the training bound explaining the better performance
2015) as the inference model of a VAE, as shown in Figure
1. The generative model is the same as before.
The inference is constructed to first make a deterministic
upward pass:
d1 =MLP(x) (13)
µd,i =Linear(di), i = 1 . . . L (14)
d,i =Softplus(Linear(di)), i = 1 . . . L (15)
di =MLP(µd,i 1), i = 2 . . . L (16)
followed by a stochastic downward pass:
q (zL|x) =N µd,L, 2
d,L (17)
ti =MLP(zi+1), i = 1 . . . L 1 (18)
µt,i =Linear(ti) (19)
t,i =Softplus(Linear(ti)) (20)
q✓(zi|zi+1, x) =N
t,i + µd,i
t,i + 2
t,i + 2
kelihood values for VAEs and the
ith different number of latent lay-
d Warm-up WU
ovides a tractable lower bound
can be used as a training crite-
p✓(x, z)
q (z|x)
= L(✓, ; x) (10)
||p✓(z)) + Eq (z|x) [p✓(x|z)] ,
eibler divergence.
he likelihood may be obtained
crease of samples by using the
(Burda et al., 2015):
q (z(K)|x)
p✓(x, z(k)
q (z(k)|x)
ve parameters, ✓ and , are
ng Eq. (11) using stochastic
use the reparametrization trick
on through the Gaussian latent
g, 2013; Rezende et al., 2014).
mated using Monte Carlo sam-
orresponding q distribution.
mance during training. The test set performance was estimated
using 5000 importance weighted samples providing a tighter
bound than the training bound explaining the better performance
2015) as the inference model of a VAE, as shown in Figure
1. The generative model is the same as before.
The inference is constructed to first make a deterministic
upward pass:
d1 =MLP(x) (13)
µd,i =Linear(di), i = 1 . . . L (14)
d,i =Softplus(Linear(di)), i = 1 . . . L (15)
di =MLP(µd,i 1), i = 2 . . . L (16)
followed by a stochastic downward pass:
q (zL|x) =N µd,L, 2
d,L (17)
ti =MLP(zi+1), i = 1 . . . L 1 (18)
µt,i =Linear(ti) (19)
t,i =Softplus(Linear(ti)) (20)
q✓(zi|zi+1, x) =N
t,i + µd,i
t,i + 2
t,i + 2
2 2
a tractable lower bound
used as a training crite-
= L(✓, ; x) (10)
) + Eq (z|x) [p✓(x|z)] ,
lihood may be obtained
of samples by using the
a et al., 2015):
p✓(x, z(k)
q (z(k)|x)
ameters, ✓ and , are
(11) using stochastic
reparametrization trick
ugh the Gaussian latent
3; Rezende et al., 2014).
using Monte Carlo sam-
2015) as the inference model of a VAE, as shown in Figure
1. The generative model is the same as before.
The inference is constructed to first make a deterministic
upward pass:
d1 =MLP(x) (13)
µd,i =Linear(di), i = 1 . . . L (14)
d,i =Softplus(Linear(di)), i = 1 . . . L (15)
di =MLP(µd,i 1), i = 2 . . . L (16)
followed by a stochastic downward pass:
q (zL|x) =N µd,L, 2
d,L (17)
ti =MLP(zi+1), i = 1 . . . L 1 (18)
µt,i =Linear(ti) (19)
t,i =Softplus(Linear(ti)) (20)
q✓(zi|zi+1, x) =N
t,i + µd,i
t,i + 2
t,i + 2
¤ B
[MacKey 01]
¤ ^
ℒ ), (; 0 = −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|=
phenomenon. The probabilistic ladder network provides
a framework with the wanted interaction, while keeping
complications manageable. A further extension could be
to make the inference in k steps over an iterative inference
procedure (Raiko et al., 2014).
2.2. Warm-up from deterministic to variational
The variational training criterion in Eq. (11) contains the
reconstruction term p✓(x|z) and the variational regular-
ization term. The variational regularization term causes
some of the latent units to become inactive during train-
ing (MacKay, 2001) because the approximate posterior for
unit k, q(zi,k| . . . ) is regularized towards its own prior
p(zi,k| . . . ), a phenomenon also recognized in the VAE set-
ting (Burda et al., 2015). This can be seen as a virtue
of automatic relevance determination, but also as a prob-
lem when many units are pruned away early in training
before they learned a useful representation. We observed
that such units remain inactive for the rest of the training,
presumably trapped in a local minima or saddle point at
KL(qi,k|pi,k) ⇡ 0, with the optimization algorithm unable
to re-activate them.
We propose to alleviate the problem by initializing train-
phenomenon. The
a framework with
complications man
to make the inferen
procedure (Raiko et
2.2. Warm-up from
The variational trai
reconstruction term
ization term. The
some of the latent
ing (MacKay, 2001
unit k, q(zi,k| . . . )
p(zi,k| . . . ), a pheno
ting (Burda et al.,
of automatic releva
lem when many un
before they learned
that such units rem
presumably trapped
KL(qi,k|pi,k) ⇡ 0,
to re-activate them.
We propose to alle
ing using the recon
training a standard
¤ Batch Normalization[Ioffe+ 15]
¤ 2
before they learned a useful representation. We observed
that such units remain inactive for the rest of the training,
presumably trapped in a local minima or saddle point at
KL(qi,k|pi,k) ⇡ 0, with the optimization algorithm unable
to re-activate them.
We propose to alleviate the problem by initializing train-
ing using the reconstruction error only (corresponding to
training a standard deterministic auto-encoder), and then
gradually introducing the variational regularization term:
L(✓, ; x)T = (22)
KL(q (z|x)||p✓(z)) + Eq (z|x) [p✓(x|z)] ,
where is increased linearly from 0 to 1 during the first Nt
epochs of training. We denote this scheme warm-up (ab-
breviated WU in tables and graphs) because the objective
goes from having a delta-function solution (correspond-
ing to zero temperature) and then move towards the fully
stochastic variational objective. A similar idea has previ-
ously been considered in Raiko et al. (2007, Section 6.2),
however here used for Bayesian models trained with a co-
ordinate descent algorithm.
32, 16, 8 and 4, goin
all mappings using
MLP’s between x an
quent layers were c
64 and 32 for all co
bilistic ladder netwo
removing latent vari
sometimes refer to t
the four layer mode
models were trained
& Ba, 2014) optimiz
reported test log-like
(12) with 5000 impo
et al. (2015). The mo
(Bastien et al., 2012
For MNIST we used
mean of a Bernoulli
(max(x, 0.1x)) as n
were trained for 200
on the complete trai
Nt = 200. Simila
ple the binarized tra
ages using a Bernou
¤ 3
¤ OMNIGLOT [Lake+ 13]
¤ NORB [LeCun+ 04]
¤ 64-32-16-8-4 5
¤ NN 2
¤ 1 512 2 256,128,64,32
¤ tanh leaky rectifiers
¤ k=5000 importance weighted
¤ VAE 2
¤ Batch normalization & warm-up
¤ probabilistic ladder network
How to Train Deep Variational Autoencoders and Probabilistic Ladder Netwo
Figure 3. MNIST test-set log-likelihood values for VAEs and the
probabilistic ladder networks with different number of latent lay-
ers, Batch normalizationBN and Warm-up WU
The variational principle provides a tractable lower bound
on the log likelihood which can be used as a training crite-
rion L.
log p(x) E
p✓(x, z)
= L(✓, ; x) (10)
200 400 600 800 1000 1
Figure 4. MNIST train (full lines) an
mance during training. The test set
using 5000 importance weighted s
bound than the training bound expla
2015) as the inference model of a
1. The generative model is the sa
The inference is constructed to
¤ MC importance weighted
¤ permutation invariant MNIST
¤ -82.90 [Burda+ 15]
¤ -81.90 [Tran+ 15]
tional Autoencoders and Probabilistic Ladder Networks
, where iter-
own signals
e 2. Notably
ce networks
see van den
sion on this
ork provides
hile keeping
on could be
ve inference
contains the
nal regular-
term causes
during train-
Table 1. Fine-tuned test log-likelihood values for 5 layered VAE
and probabilistic ladder networks trained on MNIST. ANN. LR:
Annealed Learning rate, MC: Monte Carlo samples to approxi-
mate Eq(·)[·], IW: Importance weighted samples
VAE 82.14 81.97 81.84 81.41 81.30
PROB. LADDER 81.87 81.54 81.46 81.35 -81.20
training in deep neural networks by normalizing the outputs
from each layer. We show that batch normalization (abbre-
viated BN in tables and graphs), applied to all layers except
the output layers, is essential for learning deep hierarchies
of latent variables for L > 2.
¤ KL 0.01
¤ Batch normalization (BN)
¤ Warm-up (WU)
¤ Probabilistic ladder network
How to Train Deep Variational Autoencoders and Probabilistic Ladde
Table 2. Number of active latent units in five layer VAE and prob-
abilistic ladder networks trained on MNIST. A unit was defined
as active if KL(qi,k||pi,k) > 0.01
LAYER 1 20 20 34 46
LAYER 2 1 9 18 22
LAYER 3 0 3 6 8
LAYER 4 0 3 2 3
LAYER 5 0 2 1 2
TOTAL 21 37 61 81
importance weighted samples to 10 to reduce the variance
in the approximation of the expectations in Eq. (10) and
improve the inference model, respectively.
Models trained on the OMNIGLOT dataset4
, consisting of
28x28 binary images images were trained similar to above
except that the number of training epochs was 1500.
Models trained on the NORB dataset5
, consisting of 32x32
grays-scale images with color-coding rescaled to [0, 1],
Table 3. Test set Log-likelih
dataset and the number of la
64 114.45
64-32 112.60
64-32-16 112.13
64-32-16-8 112.49
64-32-16-8-4 112.10
64 2630.8
64-32 2830.8
64-32-16 2757.5
64-32-16-8 2832.0
64-32-16-8-4 3064.1
tic layers. The performan
not improve with more th
variables. Contrary to thi
¤ NORB ladder
¤ tanh
w to Train Deep Variational Autoencoders and Probabilistic Ladder Networks
ent units in five layer VAE and prob-
ined on MNIST. A unit was defined
> 0.01
0 34 46
9 18 22
3 6 8
3 2 3
2 1 2
7 61 81
ples to 10 to reduce the variance
he expectations in Eq. (10) and
del, respectively.
MNIGLOT dataset4
, consisting of
ges were trained similar to above
training epochs was 1500.
RB dataset5
, consisting of 32x32
color-coding rescaled to [0, 1],
tion model with mean and vari-
near and a softplus output layer
were similar to the models above
ngent was used as nonlinearities
rate was 0.002, Nt = 1000 and
ochs were 4000.
Table 3. Test set Log-likelihood values for models trained on the
OMNIGLOT and NORB datasets. The left most column show
dataset and the number of latent variables i each model.
64 114.45 108.79 104.63
64-32 112.60 106.86 102.03 102.12
64-32-16 112.13 107.09 101.60 -101.26
64-32-16-8 112.49 107.66 101.68 101.27
64-32-16-8-4 112.10 107.94 101.86 101.59
64 2630.8 3263.7 3481.5
64-32 2830.8 3140.1 3532.9 3522.7
64-32-16 2757.5 3247.3 3346.7 3458.7
64-32-16-8 2832.0 3302.3 3393.6 3499.4
64-32-16-8-4 3064.1 3258.7 3393.6 3430.3
tic layers. The performance of the vanilla VAE model did
not improve with more than two layers of stochastic latent
variables. Contrary to this, models trained with batch nor-
malization and warm-up consistently increase the model
performance for additional layers of stochastic latent vari-
ables. As expected the improvement in performance is de-
creasing for each additional layer, but we emphasize that
the improvements are consistent even for the addition of
¤ KL
¤ KL 0
¤ BN WU Ladder
To study this effect we calculated the KL-divergence be-
tween q(zi,k|zi 1,k)
tent variable k during training as seen in Figure
To study this effect we calculated the KL-divergence be-
and p(zi|zi+1) for each stochastic la-
during training as seen in Figure
term is zero if the inference model is independent of the
data, i.e. q(zi,k|zi 1,k) = q(zi,k), and hence collapsed
¤ BN
¤ Ladder
¤ KL
¤ VAE 2
¤ Ladder BN WU
structured high level latent representations that are likely
useful for semi-supervised learning.
The hierarchical latent variable models used here allows
highly flexible distributions of the lower layers conditioned
on the layers above. We measure the divergence between
these conditional distributions and the restrictive mean field
approximation by calculating the KL-divergence between
q(zi|zi 1) and a standard normal distribution for several
models trained on MNIST, see Figure 6 a). As expected
the lower layers have highly non (standard) Gaussian dis-
tributions when conditioned on the layers above. Interest-
ingly the probabilistic ladder network seems to have more
active intermediate layers than t,he VAE with batch nor-
malization and warm-up. Again this might be explained
by the deterministic upward pass easing flow of informa-
tion to the intermediate and upper layers. We further note
that the KL-divergence is approximately zero in the vanilla
VAE model above the second layer confirming the inactiv-
ity of these layers. Figure 6 b) shows generative samples
from the probabilistic ladder network created by injecting
Yoshua. Th
arXiv prepr
Burda, Yuri,
lan. Impor
Dayan, Peter,
Zemel, Ric
putation, 7
Dieleman, Sa
ren Kaae S
Aaron, and
lease., Aug
Ioffe, Sergey
covariate sh
Kingma, Di
¤ Probabilistic ladder network
¤ VAE 3
¤ MNIST OMNIGLOT state-of-the-art
¤ Probabilistic ladder network
¤ Probabilistic ladder network
¤ …
¤ Lasagne Parmesan
¤ VAE Ladder RNN
¤ DL
#recognition model (encoder)
q_h = [Dense(rng,28*28,200,activation=T.nnet.relu),
q_mean = [Dense(rng,200,50,activation=None)]
q_sigma = [Dense(rng,200,50,activation=T.nnet.softplus)]
q = [Gaussian(q_h,q_mean,q_sigma)]
#generate model (decoder)
p_h = [Dense(rng,50,200,activation=T.nnet.relu),
p_mean = [Dense(rng,200,28*28,activation=T.nnet.sigmoid)]
p = [Bernoulli(p_h,p_mean)]
model = VAE_z_x(q,p,k=1,alpha=1,random=rseed)
log_likelihood_test = model.log_likelihood_test(test_x,k=1000,mode='iw')
sample_x = model.p_sample_mean_x(sample_z)

More Related Content

What's hot

Arithmer Inc.
tmtm otm
【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models
Deep Learning JP
正志 坪坂
Masahiro Suzuki
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
Deep Learning JP
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
Deep Learning JP
Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2
Eiji Uchibe
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
Hideki Tsunashima
[DL輪読会]Disentangling by Factorising
[DL輪読会]Disentangling by Factorising[DL輪読会]Disentangling by Factorising
[DL輪読会]Disentangling by Factorising
Deep Learning JP
Bayesian Neural Networks : Survey
Bayesian Neural Networks : SurveyBayesian Neural Networks : Survey
Bayesian Neural Networks : Survey
tmtm otm
Deep Learning JP
Masahiro Suzuki
KDD Cup 2021 時系列異常検知コンペ 参加報告
KDD Cup 2021 時系列異常検知コンペ 参加報告KDD Cup 2021 時系列異常検知コンペ 参加報告
KDD Cup 2021 時系列異常検知コンペ 参加報告
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
Deep Learning JP
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
Deep Learning JP
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
Generative Models(メタサーベイ )
Generative Models(メタサーベイ )Generative Models(メタサーベイ )
Generative Models(メタサーベイ )
cvpaper. challenge
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
Deep Learning JP

What's hot (20)

【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
[DL輪読会]Disentangling by Factorising
[DL輪読会]Disentangling by Factorising[DL輪読会]Disentangling by Factorising
[DL輪読会]Disentangling by Factorising
Bayesian Neural Networks : Survey
Bayesian Neural Networks : SurveyBayesian Neural Networks : Survey
Bayesian Neural Networks : Survey
KDD Cup 2021 時系列異常検知コンペ 参加報告
KDD Cup 2021 時系列異常検知コンペ 参加報告KDD Cup 2021 時系列異常検知コンペ 参加報告
KDD Cup 2021 時系列異常検知コンペ 参加報告
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
Generative Models(メタサーベイ )
Generative Models(メタサーベイ )Generative Models(メタサーベイ )
Generative Models(メタサーベイ )
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...

Viewers also liked

(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
Masahiro Suzuki
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
Masahiro Suzuki
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning
Masahiro Suzuki
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation
Masahiro Suzuki
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
Masahiro Suzuki
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
Masahiro Suzuki
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning
Masahiro Suzuki
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
Masahiro Suzuki
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?
Masahiro Suzuki
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
Masahiro Suzuki
(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters
Masahiro Suzuki
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
Masahiro Suzuki
[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読
Deep Learning JP
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
 Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De... Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
Ohsawa Goodfellow
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
Kaoru Nasuno
Deep learning勉強会20121214ochi
Deep learning勉強会20121214ochiDeep learning勉強会20121214ochi
Deep learning勉強会20121214ochi
Ohsawa Goodfellow
Variational autoencoder talk
Variational autoencoder talkVariational autoencoder talk
Variational autoencoder talk
Shai Harel
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
Introduction to "Facial Landmark Detection by Deep Multi-task Learning"
Introduction to "Facial Landmark Detection by Deep Multi-task Learning"Introduction to "Facial Landmark Detection by Deep Multi-task Learning"
Introduction to "Facial Landmark Detection by Deep Multi-task Learning"
Yukiyoshi Sasao
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」

Viewers also liked (20)

(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
 Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De... Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
Deep learning勉強会20121214ochi
Deep learning勉強会20121214ochiDeep learning勉強会20121214ochi
Deep learning勉強会20121214ochi
Variational autoencoder talk
Variational autoencoder talkVariational autoencoder talk
Variational autoencoder talk
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
Introduction to "Facial Landmark Detection by Deep Multi-task Learning"
Introduction to "Facial Landmark Detection by Deep Multi-task Learning"Introduction to "Facial Landmark Detection by Deep Multi-task Learning"
Introduction to "Facial Landmark Detection by Deep Multi-task Learning"
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」

Similar to (DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks

A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Quantum Deep Learning
Quantum Deep LearningQuantum Deep Learning
Quantum Deep Learning
Willy Marroquin (WillyDevNET)
Masahiro Suzuki
An introduction to deep learning
An introduction to deep learningAn introduction to deep learning
An introduction to deep learning
Van Thanh
論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels
Toru Tamaki
Deep learning ensembles loss landscape
Deep learning ensembles loss landscapeDeep learning ensembles loss landscape
Deep learning ensembles loss landscape
IJCSEA Journal
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...
Improving Performance of Back propagation Learning Algorithm
Improving Performance of Back propagation Learning AlgorithmImproving Performance of Back propagation Learning Algorithm
Improving Performance of Back propagation Learning Algorithm
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...Cemal Ardil
Dycops2019 Dycops2019
Jéssyca Bessa
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy

Similar to (DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks (20)

A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Quantum Deep Learning
Quantum Deep LearningQuantum Deep Learning
Quantum Deep Learning
An introduction to deep learning
An introduction to deep learningAn introduction to deep learning
An introduction to deep learning
論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels
Deep learning ensembles loss landscape
Deep learning ensembles loss landscapeDeep learning ensembles loss landscape
Deep learning ensembles loss landscape
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...
Improving Performance of Back propagation Learning Algorithm
Improving Performance of Back propagation Learning AlgorithmImproving Performance of Back propagation Learning Algorithm
Improving Performance of Back propagation Learning Algorithm
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
Dycops2019 Dycops2019
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9

Recently uploaded

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
Abida Shariff
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3

Recently uploaded (20)

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3

(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks

  • 1. How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks D2
  • 2. ¤ ICML 2016 ¤ Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, Ole Winther (Technical University of Denmark) ¤ Maaløe VAE ¤ VAE+GAN ¤ state-of-the-art ¤ VAE ¤ Ladder network VAE ¤ warm-up batch normalization ¤ VAE ¤
  • 5. Variaitonal Autoencoder ¤ Variational Autoencoder [Kingma+ 13][Rezende+ 14] ¤ ¤ !(#|%) ¤ '(#|%) ¤ ( ) * x ⇠ p✓(x|z) z ⇠ p✓(z) q (z|x) +
  • 6. ¤ ', # 2 ¤ - ¤ !. Processes (Tran e & Mohamed, rs (Burda et al., nd warm-up al- suggesting that o show that the good or better model for fur- ned latent rep- oposed here are ations utilizing tive assessment s that the multi- in the datasets ised learning. inspired by the better than the abilistic ladder network allows direct integration (+ in figure, see Eq. (21) ) of bottom-up and top-down information in the infer- ence model. In the VAE the top-down information is incorporated indirectly through the conditional priors in the generative model. The generative model p✓ is specified as follows: p✓(x|z1) = N x|µ✓(z1), 2 ✓(z1) or (1) P✓(x|z1) = B (x|µ✓(z1)) (2) for continuous-valued (Gaussian N) or binary-valued (Bernoulli B) data, respectively. The latent variables z are split into L layers zi, i = 1 . . . L: p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2 ✓,i(zi+1) (3) p✓(zL) = N (zL|0, I) . (4) The hierarchical specification allows the lower layers of the latent variables to be highly correlated but still maintain the computational efficiency of fully factorized models. Each layer in the inference model q (z|x) is specified using a fully factorized Gaussian distribution: pared to equally or more ng flexible variational dis- Gaussian Processes (Tran s (Rezende & Mohamed, Autoencoders (Burda et al., alization and warm-up al- rformance, suggesting that ul. We also show that the erforms as good or better interesting model for fur- dy the learned latent rep- methods proposed here are nt representations utilizing . A qualitative assessment her indicates that the multi- el structure in the datasets emi-supervised learning. e: the VAE inspired by the g as well or better than the n training increasing both e across several different of active stochastic latent Figure 2. Flow of information in the inference and generative models of a) probabilistic ladder network and b) VAE. The prob- abilistic ladder network allows direct integration (+ in figure, see Eq. (21) ) of bottom-up and top-down information in the infer- ence model. In the VAE the top-down information is incorporated indirectly through the conditional priors in the generative model. The generative model p✓ is specified as follows: p✓(x|z1) = N x|µ✓(z1), 2 ✓(z1) or (1) P✓(x|z1) = B (x|µ✓(z1)) (2) for continuous-valued (Gaussian N) or binary-valued (Bernoulli B) data, respectively. The latent variables z are split into L layers zi, i = 1 . . . L: p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2 ✓,i(zi+1) (3) p✓(zL) = N (zL|0, I) . (4) The hierarchical specification allows the lower layers of the latent variables to be highly correlated but still maintain the computational efficiency of fully factorized models. Each layer in the inference model q (z|x) is specified using a fully factorized Gaussian distribution: q (z1|x) = N z1|µ ,1(x), 2 ,1(x) (5) q (zi|zi 1) = N zi|µ ,i(zi 1), 2 ,i(zi 1) (6) ve generative performance, measured in terms of likelihood, when compared to equally or more ted methods for creating flexible variational dis- s such as the Variational Gaussian Processes (Tran 15) Normalizing Flows (Rezende & Mohamed, Importance Weighted Autoencoders (Burda et al., We find that batch normalization and warm-up al- rease the generative performance, suggesting that thods are broadly useful. We also show that the stic ladder network performs as good or better ng VAEs making it an interesting model for fur- ies. Secondly, we study the learned latent rep- ons. We find that the methods proposed here are y for learning rich latent representations utilizing ayers of latent variables. A qualitative assessment ent representations further indicates that the multi- DGMs capture high level structure in the datasets likely to be useful for semi-supervised learning. ary our contributions are: ew parametrization of the VAE inspired by the der network performing as well or better than the ent best models. ovel warm-up period in training increasing both Figure 2. Flow of information in the inference and ge models of a) probabilistic ladder network and b) VAE. Th abilistic ladder network allows direct integration (+ in fig Eq. (21) ) of bottom-up and top-down information in th ence model. In the VAE the top-down information is incor indirectly through the conditional priors in the generative The generative model p✓ is specified as follows: p✓(x|z1) = N x|µ✓(z1), 2 ✓ (z1) or P✓(x|z1) = B (x|µ✓(z1)) for continuous-valued (Gaussian N ) or binary- (Bernoulli B) data, respectively. The latent variable split into L layers zi, i = 1 . . . L: p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2 ✓,i(zi+1) p✓(zL) = N (zL|0, I) . The hierarchical specification allows the lower layer latent variables to be highly correlated but still main computational efficiency of fully factorized models. Each layer in the inference model q (z|x) is specifie a fully factorized Gaussian distribution: q (z1|x) = N z1|µ ,1(x), 2 ,1(x) Autoencoders (Burda et al., alization and warm-up al- formance, suggesting that ul. We also show that the rforms as good or better interesting model for fur- dy the learned latent rep- methods proposed here are t representations utilizing A qualitative assessment er indicates that the multi- el structure in the datasets emi-supervised learning. e: the VAE inspired by the as well or better than the n training increasing both e across several different of active stochastic latent ence model. In the VAE the top-down information is incorporated indirectly through the conditional priors in the generative model. The generative model p✓ is specified as follows: p✓(x|z1) = N x|µ✓(z1), 2 ✓(z1) or (1) P✓(x|z1) = B (x|µ✓(z1)) (2) for continuous-valued (Gaussian N) or binary-valued (Bernoulli B) data, respectively. The latent variables z are split into L layers zi, i = 1 . . . L: p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2 ✓,i(zi+1) (3) p✓(zL) = N (zL|0, I) . (4) The hierarchical specification allows the lower layers of the latent variables to be highly correlated but still maintain the computational efficiency of fully factorized models. Each layer in the inference model q (z|x) is specified using a fully factorized Gaussian distribution: q (z1|x) = N z1|µ ,1(x), 2 ,1(x) (5) q (zi|zi 1) = N zi|µ ,i(zi 1), 2 ,i(zi 1) (6) for i = 2 . . . L. Functions µ(·) and 2 (·) in both the generative and the in- ference models are implemented as:
  • 7. ¤ /(0|1, 34 ) ¤ ℬ(0|1) increasing both several different stochastic latent is essential for stochastic latent rain a generative a x using auxil- model q (z|x)1 to the likelihood recognition model decoder. q (z1|x) = N z1|µ ,1(x), 2 ,1(x) (5) q (zi|zi 1) = N zi|µ ,i(zi 1), 2 ,i(zi 1) (6) for i = 2 . . . L. Functions µ(·) and 2 (·) in both the generative and the in- ference models are implemented as: d(y) =MLP(y) (7) µ(y) =Linear(d(y)) (8) 2 (y) =Softplus(Linear(d(y))) , (9) where MLP is a two layered multilayer perceptron network, Linear is a single linear layer, and Softplus applies log(1 + exp(·)) non linearity to each component of its ar- gument vector. In our notation, each MLP(·) or Linear(·) gives a new mapping with its own parameters, so the de- terministic variable d is used to mark that the MLP-part is shared between µ and 2 whereas the last Linear layer is not shared. nspired by the better than the ncreasing both veral different ochastic latent s essential for tochastic latent in a generative x using auxil- p✓(zL) = N (zL|0, I) . (4) The hierarchical specification allows the lower layers of the latent variables to be highly correlated but still maintain the computational efficiency of fully factorized models. Each layer in the inference model q (z|x) is specified using a fully factorized Gaussian distribution: q (z1|x) = N z1|µ ,1(x), 2 ,1(x) (5) q (zi|zi 1) = N zi|µ ,i(zi 1), 2 ,i(zi 1) (6) for i = 2 . . . L. Functions µ(·) and 2 (·) in both the generative and the in- ference models are implemented as: d(y) =MLP(y) (7) µ(y) =Linear(d(y)) (8) 2 (y) =Softplus(Linear(d(y))) , (9) where MLP is a two layered multilayer perceptron network, Linear is a single linear layer, and Softplus applies Sigmoid Abstract Variational autoencoders are a powerful frame- work for unsupervised learning. However, pre- vious work has been restricted to shallow mod- els with one or two layers of fully factorized stochastic latent variables, limiting the flexibil- ity of the latent representation. We propose three advances in training algorithms of variational au- toencoders, for the first time allowing to train deep models of up to five stochastic layers, (1) using a structure similar to the Ladder network as the inference model, (2) warm-up period to support stochastic units staying active in early training, and (3) use of batch normalization. Us- ing these improvements we show state-of-the-art log-likelihood results for generative modeling on several benchmark datasets. 1. Introduction The recently introduced variational autoencoder (VAE) (Kingma & Welling, 2013; Rezende et al., 2014) provides a framework for deep generative models (DGM). DGMs have later been shown to be a powerful framework for semi-supervised learning (Kingma et al., 2014; Maaloee X !" z z X !" d " d " a) b) 1 2 1 2 2 2 2 1 1 1 Figure 1. Inference (or encoder/rec decoder) models. a) VAE inference der inference model and c) generati variables sampled from the approxi with mean and variances parameteri ables, each conditioned on the l highly flexible latent distribution model parameterizations: the fir the VAE to multiple layers of la ond is parameterized in such a w as a probabilistic variational vari arXiv:1602.02282v1[stat.ML]6
  • 8. ¤ ¤ ¤ log' 0 ≥ :;< = 0 log ', 0, = !. = 0 = ℒ(), (; 0) A,,.ℒ ), (; 0 = A,,.:;< = 0 log ', 0, = !. = 0 = :/(B,C) A,,.log ', 0, = !. = 0 = 1 E F A,,.log G HIC ', 0, =(J) !. =(J) 0
  • 9. ¤ 2 ¤ A ¤ B KL ¤ B ¤ & ¤ IWAE A ¤ A ℒ ), (; 0 = :;< = 0 log ', 0, = !. = 0 ℒ ), (; 0 = −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|=
  • 10. ¤ ¤ A ¤ B KL ¤ B ' 0|= = ' =O ' =OPC |=O '(x|=C ) ! =|0 = ! =C |0 ! =4 |=C !(=O |=OPC ) :;< = 0 log RS 0,= ;< = 0 = :;< = 0 log R =T R =TUV|=T R x|=V ; =V|0 ; =W|=V ;(=T|=TUV) −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|= = −L-[!. = 0 ∥ ', =O ] + :;< = 0 log' =OPC |=O ' x|=C
  • 12. VAE ¤ VAE ¤ ¤ CVAE [Kingma+ 2014] ¤ Variational Fair Auto Encoder [Louizos+ 15] ¤ X sensitive '(%|X) MMD The inference network q (z|x) (3) is used during training of the model using both the labelled and unlabelled data sets. This approximate posterior is then used as a feature extractor for the labelled data set, and the features used for training the classifier. 3.1.2 Generative Semi-supervised Model Objective For this model, we have two cases to consider. In the first case, the label corresponding to a data point is observed and the variational bound is a simple extension of equation (5): log p✓(x, y) Eq (z|x,y) [log p✓(x|y, z) + log p✓(y) + log p(z) log q (z|x, y)]= L(x, y), (6) For the case where the label is missing, it is treated as a latent variable over which we perform posterior inference and the resulting bound for handling data points with an unobserved label y is: log p✓(x) Eq (y,z|x) [log p✓(x|y, z) + log p✓(y) + log p(z) log q (y, z|x)] = X y q (y|x)( L(x, y)) + H(q (y|x)) = U(x). (7) The bound on the marginal likelihood for the entire dataset is now: J = X (x,y)⇠epl L(x, y) + X x⇠epu U(x) (8) The distribution q (y|x) (4) for the missing labels has the form a discriminative classifier, and we can use this knowledge to construct the best classifier possible as our inference model. This distribution is also used at test time for predictions of any unseen data. In the objective function (8), the label predictive distribution q (y|x) contributes only to the second term relating to the unlabelled data, which is an undesirable property if we wish to use this distribu- tion as a classifier. Ideally, all model and variational parameters should learn in all cases. To remedy % # X
  • 13. ¤ Auxiliary Deep Generative Models [Maaløe+ 16] ¤ Auxiliary variables [Agakov+ 2004 ] ¤ ¤ state-of-the-art Auxiliary Deep Generative Models ry deep generative models ngma (2013); Rezende et al. (2014) have cou- oach of variational inference with deep learn- e to powerful probabilistic models constructed nce neural network q(z|x) and a generative rk p(x|z). This approach can be perceived as equivalent to the deep auto-encoder, in which as the encoder and p(x|z) the decoder. How- ference is that these models ensures efficient er various continuous distributions in the la- and complex input datasets x, where the pos- ution p(x|z) is intractable. Furthermore, the the variational upper bound are easily defined agation through the network(s). To keep the al requirements low the variational distribution ally chosen to be a diagonal Gaussian, limiting e power of the inference model. er we propose a variational auxiliary vari- ch (Agakov and Barber, 2004) to improve al distribution: The generative model is ex- yz a x (a) Generative model P. yz a x (b) Inference model Q. Figure 1. Probabilistic graphical model of the ADGM for semi- supervised learning. The incoming joint connections to each vari- able are deep neural networks with parameters ✓ and . 2.2. Auxiliary variables We propose to extend the variational distribution with aux- iliary variables a: q(a, z|x) = q(z|a, x)q(a|x) such that the marginal distribution q(z|x) can fit more complicated posteriors p(z|x). In order to have an unchanged gen- (x|z). This approach can be perceived as valent to the deep auto-encoder, in which e encoder and p(x|z) the decoder. How- ce is that these models ensures efficient arious continuous distributions in the la- complex input datasets x, where the pos- n p(x|z) is intractable. Furthermore, the variational upper bound are easily defined on through the network(s). To keep the quirements low the variational distribution chosen to be a diagonal Gaussian, limiting wer of the inference model. e propose a variational auxiliary vari- Agakov and Barber, 2004) to improve istribution: The generative model is ex- bles a to p(x, z, a) such that the original t to marginalization over a: p(x, z, a) = In the variational distribution, on the s used such that marginal q(z|x) = )da is a general non-Gaussian distribution. specification allows the latent variables to ough a, while maintaining the computa- (a) Generative model P. (b) Inference model Figure 1. Probabilistic graphical model of the ADGM supervised learning. The incoming joint connections to e able are deep neural networks with parameters ✓ and . 2.2. Auxiliary variables We propose to extend the variational distribution w iliary variables a: q(a, z|x) = q(z|a, x)q(a|x) s the marginal distribution q(z|x) can fit more com posteriors p(z|x). In order to have an unchang erative model, p(x|z), it is required that the joi p(x, z, a) gives back the original p(x, z) under m ization over a, thus p(x, z, a) = p(a|x, z)p(x, z iliary variables are used in the EM algorithm an sampling and has previously been considered fo tional learning by Agakov and Barber (2004). R Ranganath et al. (2015) has proposed to make the q(z|x) acts as the encoder and p(x|z) the decoder ever, the difference is that these models ensures e inference over various continuous distributions in tent space z and complex input datasets x, where t terior distribution p(x|z) is intractable. Furtherm gradients of the variational upper bound are easily by backpropagation through the network(s). To k computational requirements low the variational dist q(z|x) is usually chosen to be a diagonal Gaussian, the expressive power of the inference model. In this paper we propose a variational auxiliar able approach (Agakov and Barber, 2004) to i the variational distribution: The generative mode tended with variables a to p(x, z, a) such that the model is invariant to marginalization over a: p(x, p(a|x, z)p(x, z). In the variational distribution, other hand, a is used such that marginal q(zR q(z|a, x)p(a|x)da is a general non-Gaussian distr This hierarchical specification allows the latent vari be correlated through a, while maintaining the co tional efficiency of fully factorized models (cf. Fig
  • 15. Importance Weighted AE ¤ Importance Weighted Autoencoder [Burda+ 15] ¤ ¤ k ¤ Rényi [Li+ 16] ¤ Y log' 0 ≥ : ;< %(C) # ;< %(Z) # log F ', 0, =([) !. =([) 0 Z IC ≥ ℒ(), (; 0) ecall from Section 2.1 that the family of R´enyi divergences includes the KL divergence. Perhaps c ariational free-energy approaches be generalised to the R´enyi case? Consider approximating the tr osterior p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0): q(✓) = arg min q2Q D↵[q(✓)||p(✓|D)]. (1 ow we verify the alternative optimization problem q(✓) = arg max q2Q log p(D) D↵[q(✓)||p(✓|D)]. (1 When ↵ 6= 1, the objective can be rewritten as log p(D) 1 ↵ 1 log Z q(✓)↵ p(✓|D)1 ↵ d✓ = log p(D) 1 ↵ 1 log Eq "✓ p(✓, D) q(✓)p(D) ◆1 ↵ # = 1 1 ↵ log Eq "✓ p(✓, D) q(✓) ◆1 ↵ # := L↵(q; D). (1 We name this new objective the variational R´enyi bound (VR). Importantly the following theorem i rect result of Proposition 1. heorem 1. The objective L↵(q; D) is continuous and non-increasing on ↵ 2 [0, 1] [ {|L↵| < +1 specially for all 0 < ↵ < 1,
  • 16. ¤ Normalizing flows [Rezende+ 15] ¤ ¤ Variational Gaussian Process [Tran+ 15] ¤ Variational Inference with Normalizing Flows and involve matrix inverses that can be numerically unsta- ble. We therefore require normalizing flows that allow for low-cost computation of the determinant, or where the Ja- cobian is not needed at all. 4.1. Invertible Linear-time Transformations We consider a family of transformations of the form: f(z) = z + uh(w> z + b), (10) where = {w 2 IRD , u 2 IRD , b 2 IR} are free pa- rameters and h(·) is a smooth element-wise non-linearity, with derivative h0 (·). For this mapping we can compute the logdet-Jacobian term in O(D) time (using the matrix determinant lemma): (z) = h0 (w> z + b)w (11) det @f @z = | det(I + u (z)> )| = |1 + u> (z)|. (12) From (7) we conclude that the density qK(z) obtained by transforming an arbitrary initial density q0(z) through the sequence of maps fk of the form (10) is implicitly given by: zK = fK fK 1 . . . f1(z) ln qK(zK) = ln q0(z) KX k=1 ln |1 + u> k k(zk)|. (13) The flow defined by the transformation (13) modifies the initial density q0 by applying a series of contractions and expansions in the direction perpendicular to the hyperplane w> z+b = 0, hence we refer to these maps as planar flows. As an alternative, we can consider a family of transforma- tions that modify an initial density q0 around a reference point z0. The transformation family is: f(z) = z + h(↵, r)(z z0), (14) @f d 1 0 K=1 K=2 Planar Radial q0 K=1 K=2K=10 K=10 UnitGaussianUniform Figure 1. Effect of normalizing flow on two distributions. Inference network Generative model Figure 2. Inference and generative models. Left: Inference net- work maps the observations to the parameters of the flow; Right: generative model which receives the posterior samples from the inference network during training time. Round containers repre- sent layers of stochastic variables whereas square containers rep- resent deterministic layers. 4.2. Flow-Based Free Energy Bound If we parameterize the approximate posterior distribution with a flow of length K, q (z|x) := qK(zK), the free en- ergy (3) can be written as an expectation over the initial distribution q0(z): F(x) = Eq (z|x)[log q (z|x) log p(x, z)] = Eq0(z0) [ln qK(zK) log p(x, zK)] = Eq0(z0) [ln q0(z0)] Eq0(z0) [log p(x, zK)] " K # Variational Inference with Normalizi distribution : q(z0 ) = q(z) det @f 1 @z0 = q(z) det @f @z 1 , (5) where the last equality can be seen by applying the chain rule (inverse function theorem) and is a property of Jaco- bians of invertible functions. We can construct arbitrarily complex densities by composing several simple maps and successively applying (5). The density qK(z) obtained by successively transforming a random variable z0 with distri- bution q0 through a chain of K transformations fk is: zK = fK . . . f2 f1(z0) (6) ln qK(zK) = ln q0(z0) KX k=1 ln det @fk @zk , (7) where equation (6) will be used throughout the paper as a shorthand for the composition fK(fK 1(. . . f1(x))). The path traversed by the random variables zk = fk(zk 1) with initial distribution q0(z0) is called the flow and the path formed by the successive distributions qk is a normalizing flow. A property of such transformations, often referred to as the law of the unconscious statistician (LOTUS), is that expectations w.r.t. the transformed density qK can be computed without explicitly knowing qK. Any expectation EqK [h(z)] can be written as an expectation under q0 as: EqK [h(z)] = Eq0 [h(fK fK 1 . . . f1(z0))], (8) which does not require computation of the the logdet- Jacobian terms when h(z) does not depend on qK. We can understand the effect of invertible flows as a se- quence of expansions or contractions on the initial density. For an expansion, the map z0 = f(z) pulls the points z away from a region in IRd , reducing the density in that re- gion while increasing the density outside the region. Con- partial diffe sity q0(z) e T describes Langevin F the Langev dz where d⇠(t E[⇠i(t)⇠j(t D = GG random var Langevin fl of densities Kolmogoro qt(z) of the @ @t qt(z)= In machine with F(z, t L(z) is an u Importantly is given by That is, if evolve its s sulting poin e L(z) , i.e. plored for s Teh (2011); Hamiltonia described in space ˜z = ( tonian H(z used in mac Under review as a conference paper at ICLR 2016 zifi⇠ ✓ D = {(s, t)} d (a) VARIATIONAL MODEL zi x d (b) GENERATIVE MODEL Figure 1: (a) Graphical model of the variational Gaussian process. The VGP generates samples of latent variables z by evaluating random non-linear mappings of latent inputs ⇠, and then drawing mean-field samples parameterized by the mapping. These latent variables aim to follow the posterior distribution for a generative model (b), conditioned on data x.
  • 19. ¤ 1 2 ¤ ¤ 3 ¤ ¤ ¤
  • 20. ¤ Ladder network [Valpola, 14][Rasmus+ 15] probabilistic ladder network ¤ ladder network ¤ ariational Autoencoders and Probabilistic Ladder Networks s we propose (1) rm-up period to rly training, and zing DGMs and up to five layers ls, consisting of h highly expres- fficiency of fully ese models have ured in terms of qually or more variational dis- Processes (Tran e & Mohamed, zd z z a) b) n n nn "Likelihood" Deterministic bottom up pathway Stochastic top down pathway "Posterior" "Prior" + "Copy" Top down pathway through KL-divergences in generative model Bottom up pathway in inference model Indirect top down information through prior Direct flow of information zn Generative model "Copy" Figure 2. Flow of information in the inference and generative models of a) probabilistic ladder network and b) VAE. The prob- abilistic ladder network allows direct integration (+ in figure, see Eq. (21) ) of bottom-up and top-down information in the infer- Probabilistic ladder network VAE bottom-up top-down top-down
  • 21. Top-down ¤ bottom-up ¤ top-down :=~;< = 0 log R =T R =TUV|=T R x|=V ; =V|0 ; =W|=V ;(=T|=TUV) =C~! =C|0 =4~! =4|=C :=~;< = 0 log R =T R =TUV|=T R x|=V ; =V|0 ; =W|=V ;(=T|=TUV) ' =C |=4 ' =4 ' 0|=C=C ~! =C |0 =4 ~! =4 |=C
  • 22. ¤ ¤ ¤ th different number of latent lay- Warm-up WU vides a tractable lower bound an be used as a training crite- ✓(x, z) (z|x) = L(✓, ; x) (10) |p✓(z)) + Eq (z|x) [p✓(x|z)] , (11) ibler divergence. e likelihood may be obtained crease of samples by using the Burda et al., 2015): (z(K)|x) " log KX k=1 p✓(x, z(k) ) q (z(k)|x) # (12) ve parameters, ✓ and , are g Eq. (11) using stochastic se the reparametrization trick n through the Gaussian latent , 2013; Rezende et al., 2014). mance during training. The test set performance was estimated using 5000 importance weighted samples providing a tighter bound than the training bound explaining the better performance here. 2015) as the inference model of a VAE, as shown in Figure 1. The generative model is the same as before. The inference is constructed to first make a deterministic upward pass: d1 =MLP(x) (13) µd,i =Linear(di), i = 1 . . . L (14) 2 d,i =Softplus(Linear(di)), i = 1 . . . L (15) di =MLP(µd,i 1), i = 2 . . . L (16) followed by a stochastic downward pass: q (zL|x) =N µd,L, 2 d,L (17) ti =MLP(zi+1), i = 1 . . . L 1 (18) µt,i =Linear(ti) (19) 2 t,i =Softplus(Linear(ti)) (20) q✓(zi|zi+1, x) =N µt,i 2 t,i + µd,i 2 d,i 2 t,i + 2 d,i , 1 2 t,i + 2 d,i ! . (21) kelihood values for VAEs and the ith different number of latent lay- d Warm-up WU ovides a tractable lower bound can be used as a training crite- p✓(x, z) q (z|x) = L(✓, ; x) (10) ||p✓(z)) + Eq (z|x) [p✓(x|z)] , (11) eibler divergence. he likelihood may be obtained crease of samples by using the (Burda et al., 2015): q (z(K)|x) " log KX k=1 p✓(x, z(k) ) q (z(k)|x) # (12) ve parameters, ✓ and , are ng Eq. (11) using stochastic use the reparametrization trick on through the Gaussian latent g, 2013; Rezende et al., 2014). mated using Monte Carlo sam- orresponding q distribution. mance during training. The test set performance was estimated using 5000 importance weighted samples providing a tighter bound than the training bound explaining the better performance here. 2015) as the inference model of a VAE, as shown in Figure 1. The generative model is the same as before. The inference is constructed to first make a deterministic upward pass: d1 =MLP(x) (13) µd,i =Linear(di), i = 1 . . . L (14) 2 d,i =Softplus(Linear(di)), i = 1 . . . L (15) di =MLP(µd,i 1), i = 2 . . . L (16) followed by a stochastic downward pass: q (zL|x) =N µd,L, 2 d,L (17) ti =MLP(zi+1), i = 1 . . . L 1 (18) µt,i =Linear(ti) (19) 2 t,i =Softplus(Linear(ti)) (20) q✓(zi|zi+1, x) =N µt,i 2 t,i + µd,i 2 d,i 2 t,i + 2 d,i , 1 2 t,i + 2 d,i ! . (21) 2 2 a tractable lower bound used as a training crite- ) ) = L(✓, ; x) (10) ) + Eq (z|x) [p✓(x|z)] , (11) ivergence. lihood may be obtained of samples by using the a et al., 2015): |x) " log KX k=1 p✓(x, z(k) ) q (z(k)|x) # (12) ameters, ✓ and , are (11) using stochastic reparametrization trick ugh the Gaussian latent 3; Rezende et al., 2014). using Monte Carlo sam- here. 2015) as the inference model of a VAE, as shown in Figure 1. The generative model is the same as before. The inference is constructed to first make a deterministic upward pass: d1 =MLP(x) (13) µd,i =Linear(di), i = 1 . . . L (14) 2 d,i =Softplus(Linear(di)), i = 1 . . . L (15) di =MLP(µd,i 1), i = 2 . . . L (16) followed by a stochastic downward pass: q (zL|x) =N µd,L, 2 d,L (17) ti =MLP(zi+1), i = 1 . . . L 1 (18) µt,i =Linear(ti) (19) 2 t,i =Softplus(Linear(ti)) (20) q✓(zi|zi+1, x) =N µt,i 2 t,i + µd,i 2 d,i 2 t,i + 2 d,i , 1 2 t,i + 2 d,i ! . (21) PRML
  • 24. ¤ B ¤ [MacKey 01] ¤ ^ ¤ ¤ ¤ ℒ ), (; 0 = −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|= phenomenon. The probabilistic ladder network provides a framework with the wanted interaction, while keeping complications manageable. A further extension could be to make the inference in k steps over an iterative inference procedure (Raiko et al., 2014). 2.2. Warm-up from deterministic to variational autoencoder The variational training criterion in Eq. (11) contains the reconstruction term p✓(x|z) and the variational regular- ization term. The variational regularization term causes some of the latent units to become inactive during train- ing (MacKay, 2001) because the approximate posterior for unit k, q(zi,k| . . . ) is regularized towards its own prior p(zi,k| . . . ), a phenomenon also recognized in the VAE set- ting (Burda et al., 2015). This can be seen as a virtue of automatic relevance determination, but also as a prob- lem when many units are pruned away early in training before they learned a useful representation. We observed that such units remain inactive for the rest of the training, presumably trapped in a local minima or saddle point at KL(qi,k|pi,k) ⇡ 0, with the optimization algorithm unable to re-activate them. We propose to alleviate the problem by initializing train- tr fr v th o 3 T M C ch 3 al M q 6 b re phenomenon. The a framework with complications man to make the inferen procedure (Raiko et 2.2. Warm-up from autoencoder The variational trai reconstruction term ization term. The some of the latent ing (MacKay, 2001 unit k, q(zi,k| . . . ) p(zi,k| . . . ), a pheno ting (Burda et al., of automatic releva lem when many un before they learned that such units rem presumably trapped KL(qi,k|pi,k) ⇡ 0, to re-activate them. We propose to alle ing using the recon training a standard
  • 25. ¤ ¤ ¤ →Warm-up ¤ Batch Normalization[Ioffe+ 15] ¤ ¤ 2 before they learned a useful representation. We observed that such units remain inactive for the rest of the training, presumably trapped in a local minima or saddle point at KL(qi,k|pi,k) ⇡ 0, with the optimization algorithm unable to re-activate them. We propose to alleviate the problem by initializing train- ing using the reconstruction error only (corresponding to training a standard deterministic auto-encoder), and then gradually introducing the variational regularization term: L(✓, ; x)T = (22) KL(q (z|x)||p✓(z)) + Eq (z|x) [p✓(x|z)] , where is increased linearly from 0 to 1 during the first Nt epochs of training. We denote this scheme warm-up (ab- breviated WU in tables and graphs) because the objective goes from having a delta-function solution (correspond- ing to zero temperature) and then move towards the fully stochastic variational objective. A similar idea has previ- ously been considered in Raiko et al. (2007, Section 6.2), however here used for Bayesian models trained with a co- ordinate descent algorithm. 32, 16, 8 and 4, goin all mappings using MLP’s between x an quent layers were c 64 and 32 for all co bilistic ladder netwo removing latent vari sometimes refer to t the four layer mode models were trained & Ba, 2014) optimiz reported test log-like (12) with 5000 impo et al. (2015). The mo (Bastien et al., 2012 Parmesan3 framewo For MNIST we used mean of a Bernoulli (max(x, 0.1x)) as n were trained for 200 on the complete trai Nt = 200. Simila ple the binarized tra ages using a Bernou
  • 26. ¤ 3 ¤ MNIST ¤ OMNIGLOT [Lake+ 13] ¤ NORB [LeCun+ 04] ¤ 64-32-16-8-4 5 ¤ NN 2 ¤ 1 512 2 256,128,64,32 ¤ tanh leaky rectifiers ¤ k=5000 importance weighted ¤
  • 27. MNIST ¤ VAE 2 ¤ Batch normalization & warm-up ¤ probabilistic ladder network How to Train Deep Variational Autoencoders and Probabilistic Ladder Netwo Figure 3. MNIST test-set log-likelihood values for VAEs and the probabilistic ladder networks with different number of latent lay- ers, Batch normalizationBN and Warm-up WU The variational principle provides a tractable lower bound on the log likelihood which can be used as a training crite- rion L. log p(x) E  log p✓(x, z) = L(✓, ; x) (10) 200 400 600 800 1000 1 Epoc -90 -88 -86 -84 -82 L(x) VAE VAE+BN Figure 4. MNIST train (full lines) an mance during training. The test set using 5000 importance weighted s bound than the training bound expla here. 2015) as the inference model of a 1. The generative model is the sa The inference is constructed to
  • 28. ¤ MC importance weighted IW ¤ MC IW ¤ permutation invariant MNIST ¤ -82.90 [Burda+ 15] ¤ -81.90 [Tran+ 15] ¤ tional Autoencoders and Probabilistic Ladder Networks , where iter- own signals e 2. Notably ce networks see van den sion on this ork provides hile keeping on could be ve inference nal contains the nal regular- term causes during train- Table 1. Fine-tuned test log-likelihood values for 5 layered VAE and probabilistic ladder networks trained on MNIST. ANN. LR: Annealed Learning rate, MC: Monte Carlo samples to approxi- mate Eq(·)[·], IW: Importance weighted samples FINETUNING NONE ANN. LR. MC=1 IW=1 ANN. LR. MC=10 IW=1 ANN. LR. MC=1 IW=10 ANN. LR. MC=10 IW=10 VAE 82.14 81.97 81.84 81.41 81.30 PROB. LADDER 81.87 81.54 81.46 81.35 -81.20 training in deep neural networks by normalizing the outputs from each layer. We show that batch normalization (abbre- viated BN in tables and graphs), applied to all layers except the output layers, is essential for learning deep hierarchies of latent variables for L > 2.
  • 29. ¤ KL 0.01 ¤ VAE ¤ Batch normalization (BN) ¤ Warm-up (WU) ¤ Probabilistic ladder network How to Train Deep Variational Autoencoders and Probabilistic Ladde Table 2. Number of active latent units in five layer VAE and prob- abilistic ladder networks trained on MNIST. A unit was defined as active if KL(qi,k||pi,k) > 0.01 VAE VAE +BN VAE +BN +WU PROB. LADDER +BN +WU LAYER 1 20 20 34 46 LAYER 2 1 9 18 22 LAYER 3 0 3 6 8 LAYER 4 0 3 2 3 LAYER 5 0 2 1 2 TOTAL 21 37 61 81 importance weighted samples to 10 to reduce the variance in the approximation of the expectations in Eq. (10) and improve the inference model, respectively. Models trained on the OMNIGLOT dataset4 , consisting of 28x28 binary images images were trained similar to above except that the number of training epochs was 1500. Models trained on the NORB dataset5 , consisting of 32x32 grays-scale images with color-coding rescaled to [0, 1], Table 3. Test set Log-likelih OMNIGLOT and NORB da dataset and the number of la VAE OMNIGLOT 64 114.45 64-32 112.60 64-32-16 112.13 64-32-16-8 112.49 64-32-16-8-4 112.10 NORB 64 2630.8 64-32 2830.8 64-32-16 2757.5 64-32-16-8 2832.0 64-32-16-8-4 3064.1 tic layers. The performan not improve with more th variables. Contrary to thi
  • 30. OMNIGLOT NORB ¤ BN WU ¤ NORB ladder ¤ ¤ tanh w to Train Deep Variational Autoencoders and Probabilistic Ladder Networks ent units in five layer VAE and prob- ined on MNIST. A unit was defined > 0.01 AE BN VAE +BN +WU PROB. LADDER +BN +WU 0 34 46 9 18 22 3 6 8 3 2 3 2 1 2 7 61 81 ples to 10 to reduce the variance he expectations in Eq. (10) and del, respectively. MNIGLOT dataset4 , consisting of ges were trained similar to above training epochs was 1500. RB dataset5 , consisting of 32x32 color-coding rescaled to [0, 1], tion model with mean and vari- near and a softplus output layer were similar to the models above ngent was used as nonlinearities rate was 0.002, Nt = 1000 and ochs were 4000. Table 3. Test set Log-likelihood values for models trained on the OMNIGLOT and NORB datasets. The left most column show dataset and the number of latent variables i each model. VAE VAE +BN VAE +BN +WU PROB. LADDER +BN +WU OMNIGLOT 64 114.45 108.79 104.63 64-32 112.60 106.86 102.03 102.12 64-32-16 112.13 107.09 101.60 -101.26 64-32-16-8 112.49 107.66 101.68 101.27 64-32-16-8-4 112.10 107.94 101.86 101.59 NORB 64 2630.8 3263.7 3481.5 64-32 2830.8 3140.1 3532.9 3522.7 64-32-16 2757.5 3247.3 3346.7 3458.7 64-32-16-8 2832.0 3302.3 3393.6 3499.4 64-32-16-8-4 3064.1 3258.7 3393.6 3430.3 tic layers. The performance of the vanilla VAE model did not improve with more than two layers of stochastic latent variables. Contrary to this, models trained with batch nor- malization and warm-up consistently increase the model performance for additional layers of stochastic latent vari- ables. As expected the improvement in performance is de- creasing for each additional layer, but we emphasize that the improvements are consistent even for the addition of
  • 31. ¤ VAE ¤ ¤ KL ¤ KL 0 ¤ BN WU Ladder To study this effect we calculated the KL-divergence be- tween q(zi,k|zi 1,k) tent variable k during training as seen in Figure To study this effect we calculated the KL-divergence be- and p(zi|zi+1) for each stochastic la- during training as seen in Figure term is zero if the inference model is independent of the data, i.e. q(zi,k|zi 1,k) = q(zi,k), and hence collapsed
  • 32. ¤ PCA ¤ BN ¤ Ladder
  • 33. ¤ ¤ KL ¤ VAE 2 ¤ Ladder BN WU ¤ structured high level latent representations that are likely useful for semi-supervised learning. The hierarchical latent variable models used here allows highly flexible distributions of the lower layers conditioned on the layers above. We measure the divergence between these conditional distributions and the restrictive mean field approximation by calculating the KL-divergence between q(zi|zi 1) and a standard normal distribution for several models trained on MNIST, see Figure 6 a). As expected the lower layers have highly non (standard) Gaussian dis- tributions when conditioned on the layers above. Interest- ingly the probabilistic ladder network seems to have more active intermediate layers than t,he VAE with batch nor- malization and warm-up. Again this might be explained by the deterministic upward pass easing flow of informa- tion to the intermediate and upper layers. We further note that the KL-divergence is approximately zero in the vanilla VAE model above the second layer confirming the inactiv- ity of these layers. Figure 6 b) shows generative samples from the probabilistic ladder network created by injecting ⇤ Bouchard, Yoshua. Th arXiv prepr Burda, Yuri, lan. Impor arXiv:1509 Dayan, Peter, Zemel, Ric putation, 7 Dieleman, Sa ren Kaae S Aaron, and lease., Aug 10.5281/ Ioffe, Sergey Acceleratin covariate sh Kingma, Di
  • 36. ¤ VAE 3 ¤ ¤ ¤ ¤ ¤ MNIST OMNIGLOT state-of-the-art ¤ Probabilistic ladder network ¤ Probabilistic ladder network
  • 38. VAE ¤ ¤ VAE ¤ ¤ … ¤ →NN ¤ VAE ¤ Lasagne Parmesan ¤ ¤
  • 39. VAE ¤ VAE ¤ VAE Ladder RNN ¤ DL #recognition model (encoder) q_h = [Dense(rng,28*28,200,activation=T.nnet.relu), Dense(rng,200,200,activation=T.nnet.relu)] q_mean = [Dense(rng,200,50,activation=None)] q_sigma = [Dense(rng,200,50,activation=T.nnet.softplus)] q = [Gaussian(q_h,q_mean,q_sigma)] #generate model (decoder) p_h = [Dense(rng,50,200,activation=T.nnet.relu), Dense(rng,200,200,activation=T.nnet.relu)] p_mean = [Dense(rng,200,28*28,activation=T.nnet.sigmoid)] p = [Bernoulli(p_h,p_mean)] #VAE model = VAE_z_x(q,p,k=1,alpha=1,random=rseed) # model.train(train_x) # log_likelihood_test = model.log_likelihood_test(test_x,k=1000,mode='iw') # sample_x = model.p_sample_mean_x(sample_z) Renyi CNN Lasagne