Meta-learning and the ELBO

1 / 28
Meta-learning and the ELBO
eddy.l
January 02, 2019

2 / 28
Why unsupervised/generative?
Intelligence existed before labels did
Unsupervised=cake, supervised=icing, RL=cherry (Lecun)
A human brain has 1015
connections and lives for 109
seconds. (Hinton)
”What I cannot create, I do not understand” (Feynman)

3 / 28
A latent variable model
x1
z1
x2
z2
x3
z3
x4
z4

4 / 28
A latent variable model
plate notation
x
z
N

5 / 28
Learning a generative model
Assume z ∼ p(z) and x ∼ p(x|z; θ). A good generative model is one that
generates real-looking data, so let’s ﬁnd
θ∗
= arg max
θ
p(xreal
; θ) (1)

6 / 28
First attempt
Maybe we can ﬁnd θ∗
using gradient ascent on something that increases with
p(x; θ)?
θ∗
= arg max
θ
p(x; θ) = arg max
θ
log p(x; θ) (2)
= arg max
θ
log p(z)p(x|z; θ)dz (3)
= arg max
θ
log Ep(z) [p(x|z; θ)] (4)
Even evaluating p(x; θ) requires an integration; we can’t monte-carlo
approximate it because of the log.

8 / 28
Variational Audoencoders
x
z
N
Networks p(x|z; θ), q(z|x; φ).
Maximize ELBO = log p(z)p(x|z;θ)
q(z|x;φ) ,
where z ∼ q(z|x; φ).
“Auto-encoding variational bayes” by Kingma and Welling,
“Stochastic backpropagation and approximate inference in
deep generative models” by Rezende, Mohamed, and
Wierstra

10 / 28
Interpretations of ELBO
1. Lower bound of evidence
We have already shown that the ELBO is a lower bound of the evidence:
log p(x) ≥Eq(z|x) log
p(x, z)
q(z|x)
= ELBO (8)
Thus, we can view optimizing the ELBO as approximately optimizing p(x):
arg max
θ
p(x) ≈ arg max
θ
ELBO (9)

11 / 28
2. Distance to posterior
Let’s take a closer look at the gap btw p(x) and the ELBO:
log p(x) = log
p(x, z)
p(z|x)
= Eq(z|x) log
p(x, z)
p(z|x)
(10)
= Eq(z|x) log
p(x, z)
q(z|x)
+ DKL (q(z|x)||p(z|x)) (11)
= ELBO + DKL (q(z|x)||p(z|x)) (12)
From the point of view of the inference network, maximizing the ELBO is
equivalent to minimizing the KL divergence to the posterior:
arg min
φ
DKL (q(z|x)||p(z|x)) = arg max
φ
ELBO (13)

12 / 28
3. Autoencoder
ELBO = Eq(z|x) [log p(x, z) − log q(z)]
=
1
N
N
n=1
Eq(zn|xn) [log p(xn|zn)] − DKL (q(zn|xn)||p(z))
Can be used as a loss function if the KL divergence term can be computed
analytically.
For each datapoint xn, we make the model reconstruct xn while keeping each
embedding zn close to the prior p(z).

13 / 28
3. Autoencoder
1
N
N
n=1
DKL (q(zn)||p(z)) =
1
N
N
n=1
q(zn) (log q(zn) − log q(z) + log q(z) − log p(z))
=
1
N
N
n=1
DKL (q(zn)||q(z)) +
N
n=1 q(zn)
N
(log q(z) − log p(z))
=
1
N
N
n=1
DKL (q(zn)||q(z)) + DKL (q(z)||p(z))
so
ELBO =
1
N
N
n=1
Eq(zn) [log p(xn|zn)] − DKL (q(zn)||q(z)) − DKL (q(z)||p(z))

14 / 28
4. Free energy
ELBO = Eq(z) [log p(x, z) − log q(z)]
= −Eq(z) [− log p(x, z)] + H(q(z))
This is composed of a negative energy term plus the entropy of the distribution
of states.
Physical states are values of z and the energy of the state z is − log p(x, z).
Therefore, the ELBO takes the form of a negative Helmholtz energy.

15 / 28
4. Free energy
ELBO = −Eq(z) [− log p(x, z)] + H(q(z))
We know that the distribution over states that minimizes the free energy is the
Boltzmann distribution:
p(z) ∝ exp(−E(z)) = p(x, z) ∝ p(z|x). (14)
We see that the distribution of z that minimizes ELBO is p(z|x), the true
posterior.

16 / 28
5. Minimum Description Length (the ”bits-back” argument)
Suppose we want to describe a datapoint x using as few bits as possible. We can
ﬁrst describe z, and then describe x. Shannon’s source coding theorem says that
this scheme takes at least
Ex∼data,z∼q(z|x) [− log p(z)− log p(x|z)]
bits of data.

17 / 28
5. Minimum Description Length (the ”bits-back” argument)
Suppose we want to describe a datapoint x using as few bits as possible. We can
ﬁrst describe z, and then describe x. Shannon’s source coding theorem says that
this scheme takes at least
Ex∼data,z∼q(z|x) [− log p(z)− log p(x|z)]
bits of data.
The ”bits-back” argument is that after we know x, we can directly compute
q(z|x), so we should subtract this extra information from the cost of describing
z. The description length is then
Ex∼data,z∼q(z|x) [− log p(z)+ log q(z|x) − log p(x|z)] = ELBO
.

18 / 28
1. lower bound of p(x)
2. learning to output p(z|x)
3. autoencoder with regularization
4. free energy
5. communication cost

19 / 28
2 latent variables?
x
z1
z2
N

21 / 28
2 latent variables?
A better way
x
z1
z2
N
“Ladder variational autoencoders” by Sønderby et al.

23 / 28
Neural Statistician
x
z
c
N
T
“Towards a neural statistician” by Edwards and Storkey

24 / 28
Conditional VAE
x y
z
N
“Learning structured output representation using deep conditional generative models”
by Sohn, Lee, and Yan

25 / 28
semi-supervised VAE
x
z
y
N
Semi-Supervised Learning with Deep Generative Models by Kingma et al.

26 / 28
Few-shot classiﬁcation
x y
zc
N
M
T
“Siamese neural networks for one-shot image recognition” by Koch, “Matching
networks for one shot learning” by Vinyals et al., “Prototypical networks for few-shot
learning” by Snell, Swersky, and Zemel

27 / 28
special case: triplet loss
x y
zc
3
2
T
“Deep Metric Learning Using Triplet Network” by Hoﬀer and Ailon

28 / 28
Triplet loss
L ∼ d(a, p) − d(a, n) (15)
= − log p(a|N(p, αI)) + log p(a|N(n, αI)) (16)
Prototypical Net logits:
(−d2
(a, c1), −d2
(a, c2), · · · , −d2
(a, cn)) (17)
= (log p(a|N(c1, αI)), · · · , log p(a|N(cn, αI)) (18)

29 / 28
References I
[1] Harrison Edwards and Amos Storkey. “Towards a neural statistician”. In:
arXiv preprint arXiv:1606.02185 (2016).
[2] Elad Hoﬀer and Nir Ailon. “Deep Metric Learning Using Triplet Network”.
In: Lecture Notes in Computer Science (2015), 84–92. issn: 1611-3349.
doi: 10.1007/978-3-319-24261-3_7. url:
http://dx.doi.org/10.1007/978-3-319-24261-3_7.
[3] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”.
In: arXiv preprint arXiv:1312.6114 (2013).
[4] Diederik P. Kingma et al. Semi-Supervised Learning with Deep Generative
Models. 2014. arXiv: 1406.5298 [cs.LG].
[5] Gregory Koch. “Siamese neural networks for one-shot image recognition”.
In: 2015.

30 / 28
References II
[6] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
“Stochastic backpropagation and approximate inference in deep generative
models”. In: arXiv preprint arXiv:1401.4082 (2014).
[7] Jake Snell, Kevin Swersky, and Richard Zemel. “Prototypical networks for
few-shot learning”. In: Advances in Neural Information Processing Systems.
2017, pp. 4077–4087.
[8] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. “Learning structured output
representation using deep conditional generative models”. In: Advances in
Neural Information Processing Systems. 2015, pp. 3483–3491.
[9] Casper Kaae Sønderby et al. “Ladder variational autoencoders”. In:
Advances in neural information processing systems. 2016, pp. 3738–3746.
[10] Oriol Vinyals et al. “Matching networks for one shot learning”. In:
Advances in Neural Information Processing Systems. 2016, pp. 3630–3638.

Meta-learning and the ELBO

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Meta-learning and the ELBO

Similar to Meta-learning and the ELBO (20)

More from Yoonho Lee

More from Yoonho Lee (11)

Recently uploaded

Recently uploaded (20)

Meta-learning and the ELBO