2. 2 / 28
Why unsupervised/generative?
Intelligence existed before labels did
Unsupervised=cake, supervised=icing, RL=cherry (Lecun)
A human brain has 1015
connections and lives for 109
seconds. (Hinton)
”What I cannot create, I do not understand” (Feynman)
3. 3 / 28
A latent variable model
x1
z1
x2
z2
x3
z3
x4
z4
4. 4 / 28
A latent variable model
plate notation
x
z
N
5. 5 / 28
Learning a generative model
Assume z ∼ p(z) and x ∼ p(x|z; θ). A good generative model is one that
generates real-looking data, so let’s find
θ∗
= arg max
θ
p(xreal
; θ) (1)
6. 6 / 28
Learning a generative model
First attempt
Maybe we can find θ∗
using gradient ascent on something that increases with
p(x; θ)?
θ∗
= arg max
θ
p(x; θ) = arg max
θ
log p(x; θ) (2)
= arg max
θ
log p(z)p(x|z; θ)dz (3)
= arg max
θ
log Ep(z) [p(x|z; θ)] (4)
Even evaluating p(x; θ) requires an integration; we can’t monte-carlo
approximate it because of the log.
7. 7 / 28
Learning a generative model
The ELBO
Let’s give up on optimizing p(x; θ) directly. Let q(z|x) be anything we can
sample from. By Jensen’s inequality (omitting parameters),
log p(x) = log Ep(z) [p(x|z)] (5)
= log Eq(z|x)
p(z)p(x|z)
q(z|x)
= log Eq(z|x)
p(x, z)
q(z|x)
(6)
≥Eq(z|x) log
p(x, z)
q(z|x)
(7)
with equlity when q(z|x) = p(z|x) everywhere. This thing is called the ELBO
(Evidence Lower BOund). We assumed we can sample z ∼ q(z|x), so we can get
monte-carlo samples for the ELBO.
8. 8 / 28
Variational Audoencoders
x
z
N
Networks p(x|z; θ), q(z|x; φ).
Maximize ELBO = log p(z)p(x|z;θ)
q(z|x;φ) ,
where z ∼ q(z|x; φ).
“Auto-encoding variational bayes” by Kingma and Welling,
“Stochastic backpropagation and approximate inference in
deep generative models” by Rezende, Mohamed, and
Wierstra
9. 9 / 28
Variational Audoencoders
x
z
N
Networks p(x|z; θ), q(z|x; φ).
Maximize ELBO = log p(z)p(x|z;θ)
q(z|x;φ) ,
where z ∼ q(z|x; φ).
What does this loss function mean?
10. 10 / 28
Interpretations of ELBO
1. Lower bound of evidence
We have already shown that the ELBO is a lower bound of the evidence:
log p(x) ≥Eq(z|x) log
p(x, z)
q(z|x)
= ELBO (8)
Thus, we can view optimizing the ELBO as approximately optimizing p(x):
arg max
θ
p(x) ≈ arg max
θ
ELBO (9)
11. 11 / 28
Interpretations of ELBO
2. Distance to posterior
Let’s take a closer look at the gap btw p(x) and the ELBO:
log p(x) = log
p(x, z)
p(z|x)
= Eq(z|x) log
p(x, z)
p(z|x)
(10)
= Eq(z|x) log
p(x, z)
q(z|x)
+ DKL (q(z|x)||p(z|x)) (11)
= ELBO + DKL (q(z|x)||p(z|x)) (12)
From the point of view of the inference network, maximizing the ELBO is
equivalent to minimizing the KL divergence to the posterior:
arg min
φ
DKL (q(z|x)||p(z|x)) = arg max
φ
ELBO (13)
12. 12 / 28
Interpretations of ELBO
3. Autoencoder
ELBO = Eq(z|x) [log p(x, z) − log q(z)]
=
1
N
N
n=1
Eq(zn|xn) [log p(xn|zn)] − DKL (q(zn|xn)||p(z))
Can be used as a loss function if the KL divergence term can be computed
analytically.
For each datapoint xn, we make the model reconstruct xn while keeping each
embedding zn close to the prior p(z).
13. 13 / 28
Interpretations of ELBO
3. Autoencoder
1
N
N
n=1
DKL (q(zn)||p(z)) =
1
N
N
n=1
q(zn) (log q(zn) − log q(z) + log q(z) − log p(z))
=
1
N
N
n=1
DKL (q(zn)||q(z)) +
N
n=1 q(zn)
N
(log q(z) − log p(z))
=
1
N
N
n=1
DKL (q(zn)||q(z)) + DKL (q(z)||p(z))
so
ELBO =
1
N
N
n=1
Eq(zn) [log p(xn|zn)] − DKL (q(zn)||q(z)) − DKL (q(z)||p(z))
14. 14 / 28
Interpretations of ELBO
4. Free energy
ELBO = Eq(z) [log p(x, z) − log q(z)]
= −Eq(z) [− log p(x, z)] + H(q(z))
This is composed of a negative energy term plus the entropy of the distribution
of states.
Physical states are values of z and the energy of the state z is − log p(x, z).
Therefore, the ELBO takes the form of a negative Helmholtz energy.
15. 15 / 28
Interpretations of ELBO
4. Free energy
ELBO = −Eq(z) [− log p(x, z)] + H(q(z))
We know that the distribution over states that minimizes the free energy is the
Boltzmann distribution:
p(z) ∝ exp(−E(z)) = p(x, z) ∝ p(z|x). (14)
We see that the distribution of z that minimizes ELBO is p(z|x), the true
posterior.
16. 16 / 28
Interpretations of ELBO
5. Minimum Description Length (the ”bits-back” argument)
Suppose we want to describe a datapoint x using as few bits as possible. We can
first describe z, and then describe x. Shannon’s source coding theorem says that
this scheme takes at least
Ex∼data,z∼q(z|x) [− log p(z)− log p(x|z)]
bits of data.
17. 17 / 28
Interpretations of ELBO
5. Minimum Description Length (the ”bits-back” argument)
Suppose we want to describe a datapoint x using as few bits as possible. We can
first describe z, and then describe x. Shannon’s source coding theorem says that
this scheme takes at least
Ex∼data,z∼q(z|x) [− log p(z)− log p(x|z)]
bits of data.
The ”bits-back” argument is that after we know x, we can directly compute
q(z|x), so we should subtract this extra information from the cost of describing
z. The description length is then
Ex∼data,z∼q(z|x) [− log p(z)+ log q(z|x) − log p(x|z)] = ELBO
.
18. 18 / 28
Interpretations of ELBO
1. lower bound of p(x)
2. learning to output p(z|x)
3. autoencoder with regularization
4. free energy
5. communication cost
20. 20 / 28
2 latent variables?
x
z1
z2
N
Networks
p(x|z1; θ)p(z1|z2; θ), q(z2|z1; φ), q(z1|x; φ).
Maximize ELBO = log p(z2)p(z1|z2;θ)p(x|z1;θ)
q(z2|z1;φ)q(z1|x;φ) ,
where z1 ∼ q(z1|x; φ) and z2 ∼ q(z2|z1; φ).
21. 21 / 28
2 latent variables?
A better way
x
z1
z2
N
“Ladder variational autoencoders” by Sønderby et al.
22. 22 / 28
2 latent variables?
x
z1
z2
N
Networks
p(x|z1; θ)p(z1|z2; θ), q(z2|x; φ), q(z1|x; φ).
Maximize ELBO = log p(z2)p(z1|z2;θ)p(x|z1;θ)
q(z2|x;φ)q(z1|x,z2;φ,θ) ,
where z2 ∼ q(z2|x; φ) and z1 ∼ p(z1|x, z2; θ).
23. 23 / 28
Neural Statistician
x
z
c
N
T
“Towards a neural statistician” by Edwards and Storkey
24. 24 / 28
Conditional VAE
x y
z
N
“Learning structured output representation using deep conditional generative models”
by Sohn, Lee, and Yan
25. 25 / 28
semi-supervised VAE
x
z
y
N
Semi-Supervised Learning with Deep Generative Models by Kingma et al.
26. 26 / 28
Few-shot classification
x y
zc
N
M
T
“Siamese neural networks for one-shot image recognition” by Koch, “Matching
networks for one shot learning” by Vinyals et al., “Prototypical networks for few-shot
learning” by Snell, Swersky, and Zemel
27. 27 / 28
Few-shot classification
special case: triplet loss
x y
zc
3
2
T
“Deep Metric Learning Using Triplet Network” by Hoffer and Ailon
29. 29 / 28
References I
[1] Harrison Edwards and Amos Storkey. “Towards a neural statistician”. In:
arXiv preprint arXiv:1606.02185 (2016).
[2] Elad Hoffer and Nir Ailon. “Deep Metric Learning Using Triplet Network”.
In: Lecture Notes in Computer Science (2015), 84–92. issn: 1611-3349.
doi: 10.1007/978-3-319-24261-3_7. url:
http://dx.doi.org/10.1007/978-3-319-24261-3_7.
[3] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”.
In: arXiv preprint arXiv:1312.6114 (2013).
[4] Diederik P. Kingma et al. Semi-Supervised Learning with Deep Generative
Models. 2014. arXiv: 1406.5298 [cs.LG].
[5] Gregory Koch. “Siamese neural networks for one-shot image recognition”.
In: 2015.
30. 30 / 28
References II
[6] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
“Stochastic backpropagation and approximate inference in deep generative
models”. In: arXiv preprint arXiv:1401.4082 (2014).
[7] Jake Snell, Kevin Swersky, and Richard Zemel. “Prototypical networks for
few-shot learning”. In: Advances in Neural Information Processing Systems.
2017, pp. 4077–4087.
[8] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. “Learning structured output
representation using deep conditional generative models”. In: Advances in
Neural Information Processing Systems. 2015, pp. 3483–3491.
[9] Casper Kaae Sønderby et al. “Ladder variational autoencoders”. In:
Advances in neural information processing systems. 2016, pp. 3738–3746.
[10] Oriol Vinyals et al. “Matching networks for one shot learning”. In:
Advances in Neural Information Processing Systems. 2016, pp. 3630–3638.