SlideShare a Scribd company logo
1 of 31
Download to read offline
1 / 28
Meta-learning and the ELBO
eddy.l
January 02, 2019
2 / 28
Why unsupervised/generative?
Intelligence existed before labels did
Unsupervised=cake, supervised=icing, RL=cherry (Lecun)
A human brain has 1015
connections and lives for 109
seconds. (Hinton)
”What I cannot create, I do not understand” (Feynman)
3 / 28
A latent variable model
x1
z1
x2
z2
x3
z3
x4
z4
4 / 28
A latent variable model
plate notation
x
z
N
5 / 28
Learning a generative model
Assume z ∼ p(z) and x ∼ p(x|z; θ). A good generative model is one that
generates real-looking data, so let’s find
θ∗
= arg max
θ
p(xreal
; θ) (1)
6 / 28
Learning a generative model
First attempt
Maybe we can find θ∗
using gradient ascent on something that increases with
p(x; θ)?
θ∗
= arg max
θ
p(x; θ) = arg max
θ
log p(x; θ) (2)
= arg max
θ
log p(z)p(x|z; θ)dz (3)
= arg max
θ
log Ep(z) [p(x|z; θ)] (4)
Even evaluating p(x; θ) requires an integration; we can’t monte-carlo
approximate it because of the log.
7 / 28
Learning a generative model
The ELBO
Let’s give up on optimizing p(x; θ) directly. Let q(z|x) be anything we can
sample from. By Jensen’s inequality (omitting parameters),
log p(x) = log Ep(z) [p(x|z)] (5)
= log Eq(z|x)
p(z)p(x|z)
q(z|x)
= log Eq(z|x)
p(x, z)
q(z|x)
(6)
≥Eq(z|x) log
p(x, z)
q(z|x)
(7)
with equlity when q(z|x) = p(z|x) everywhere. This thing is called the ELBO
(Evidence Lower BOund). We assumed we can sample z ∼ q(z|x), so we can get
monte-carlo samples for the ELBO.
8 / 28
Variational Audoencoders
x
z
N
Networks p(x|z; θ), q(z|x; φ).
Maximize ELBO = log p(z)p(x|z;θ)
q(z|x;φ) ,
where z ∼ q(z|x; φ).
“Auto-encoding variational bayes” by Kingma and Welling,
“Stochastic backpropagation and approximate inference in
deep generative models” by Rezende, Mohamed, and
Wierstra
9 / 28
Variational Audoencoders
x
z
N
Networks p(x|z; θ), q(z|x; φ).
Maximize ELBO = log p(z)p(x|z;θ)
q(z|x;φ) ,
where z ∼ q(z|x; φ).
What does this loss function mean?
10 / 28
Interpretations of ELBO
1. Lower bound of evidence
We have already shown that the ELBO is a lower bound of the evidence:
log p(x) ≥Eq(z|x) log
p(x, z)
q(z|x)
= ELBO (8)
Thus, we can view optimizing the ELBO as approximately optimizing p(x):
arg max
θ
p(x) ≈ arg max
θ
ELBO (9)
11 / 28
Interpretations of ELBO
2. Distance to posterior
Let’s take a closer look at the gap btw p(x) and the ELBO:
log p(x) = log
p(x, z)
p(z|x)
= Eq(z|x) log
p(x, z)
p(z|x)
(10)
= Eq(z|x) log
p(x, z)
q(z|x)
+ DKL (q(z|x)||p(z|x)) (11)
= ELBO + DKL (q(z|x)||p(z|x)) (12)
From the point of view of the inference network, maximizing the ELBO is
equivalent to minimizing the KL divergence to the posterior:
arg min
φ
DKL (q(z|x)||p(z|x)) = arg max
φ
ELBO (13)
12 / 28
Interpretations of ELBO
3. Autoencoder
ELBO = Eq(z|x) [log p(x, z) − log q(z)]
=
1
N
N
n=1
Eq(zn|xn) [log p(xn|zn)] − DKL (q(zn|xn)||p(z))
Can be used as a loss function if the KL divergence term can be computed
analytically.
For each datapoint xn, we make the model reconstruct xn while keeping each
embedding zn close to the prior p(z).
13 / 28
Interpretations of ELBO
3. Autoencoder
1
N
N
n=1
DKL (q(zn)||p(z)) =
1
N
N
n=1
q(zn) (log q(zn) − log q(z) + log q(z) − log p(z))
=
1
N
N
n=1
DKL (q(zn)||q(z)) +
N
n=1 q(zn)
N
(log q(z) − log p(z))
=
1
N
N
n=1
DKL (q(zn)||q(z)) + DKL (q(z)||p(z))
so
ELBO =
1
N
N
n=1
Eq(zn) [log p(xn|zn)] − DKL (q(zn)||q(z)) − DKL (q(z)||p(z))
14 / 28
Interpretations of ELBO
4. Free energy
ELBO = Eq(z) [log p(x, z) − log q(z)]
= −Eq(z) [− log p(x, z)] + H(q(z))
This is composed of a negative energy term plus the entropy of the distribution
of states.
Physical states are values of z and the energy of the state z is − log p(x, z).
Therefore, the ELBO takes the form of a negative Helmholtz energy.
15 / 28
Interpretations of ELBO
4. Free energy
ELBO = −Eq(z) [− log p(x, z)] + H(q(z))
We know that the distribution over states that minimizes the free energy is the
Boltzmann distribution:
p(z) ∝ exp(−E(z)) = p(x, z) ∝ p(z|x). (14)
We see that the distribution of z that minimizes ELBO is p(z|x), the true
posterior.
16 / 28
Interpretations of ELBO
5. Minimum Description Length (the ”bits-back” argument)
Suppose we want to describe a datapoint x using as few bits as possible. We can
first describe z, and then describe x. Shannon’s source coding theorem says that
this scheme takes at least
Ex∼data,z∼q(z|x) [− log p(z)− log p(x|z)]
bits of data.
17 / 28
Interpretations of ELBO
5. Minimum Description Length (the ”bits-back” argument)
Suppose we want to describe a datapoint x using as few bits as possible. We can
first describe z, and then describe x. Shannon’s source coding theorem says that
this scheme takes at least
Ex∼data,z∼q(z|x) [− log p(z)− log p(x|z)]
bits of data.
The ”bits-back” argument is that after we know x, we can directly compute
q(z|x), so we should subtract this extra information from the cost of describing
z. The description length is then
Ex∼data,z∼q(z|x) [− log p(z)+ log q(z|x) − log p(x|z)] = ELBO
.
18 / 28
Interpretations of ELBO
1. lower bound of p(x)
2. learning to output p(z|x)
3. autoencoder with regularization
4. free energy
5. communication cost
19 / 28
2 latent variables?
x
z1
z2
N
20 / 28
2 latent variables?
x
z1
z2
N
Networks
p(x|z1; θ)p(z1|z2; θ), q(z2|z1; φ), q(z1|x; φ).
Maximize ELBO = log p(z2)p(z1|z2;θ)p(x|z1;θ)
q(z2|z1;φ)q(z1|x;φ) ,
where z1 ∼ q(z1|x; φ) and z2 ∼ q(z2|z1; φ).
21 / 28
2 latent variables?
A better way
x
z1
z2
N
“Ladder variational autoencoders” by Sønderby et al.
22 / 28
2 latent variables?
x
z1
z2
N
Networks
p(x|z1; θ)p(z1|z2; θ), q(z2|x; φ), q(z1|x; φ).
Maximize ELBO = log p(z2)p(z1|z2;θ)p(x|z1;θ)
q(z2|x;φ)q(z1|x,z2;φ,θ) ,
where z2 ∼ q(z2|x; φ) and z1 ∼ p(z1|x, z2; θ).
23 / 28
Neural Statistician
x
z
c
N
T
“Towards a neural statistician” by Edwards and Storkey
24 / 28
Conditional VAE
x y
z
N
“Learning structured output representation using deep conditional generative models”
by Sohn, Lee, and Yan
25 / 28
semi-supervised VAE
x
z
y
N
Semi-Supervised Learning with Deep Generative Models by Kingma et al.
26 / 28
Few-shot classification
x y
zc
N
M
T
“Siamese neural networks for one-shot image recognition” by Koch, “Matching
networks for one shot learning” by Vinyals et al., “Prototypical networks for few-shot
learning” by Snell, Swersky, and Zemel
27 / 28
Few-shot classification
special case: triplet loss
x y
zc
3
2
T
“Deep Metric Learning Using Triplet Network” by Hoffer and Ailon
28 / 28
Few-shot classification
Triplet loss
L ∼ d(a, p) − d(a, n) (15)
= − log p(a|N(p, αI)) + log p(a|N(n, αI)) (16)
Prototypical Net logits:
(−d2
(a, c1), −d2
(a, c2), · · · , −d2
(a, cn)) (17)
= (log p(a|N(c1, αI)), · · · , log p(a|N(cn, αI)) (18)
29 / 28
References I
[1] Harrison Edwards and Amos Storkey. “Towards a neural statistician”. In:
arXiv preprint arXiv:1606.02185 (2016).
[2] Elad Hoffer and Nir Ailon. “Deep Metric Learning Using Triplet Network”.
In: Lecture Notes in Computer Science (2015), 84–92. issn: 1611-3349.
doi: 10.1007/978-3-319-24261-3_7. url:
http://dx.doi.org/10.1007/978-3-319-24261-3_7.
[3] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”.
In: arXiv preprint arXiv:1312.6114 (2013).
[4] Diederik P. Kingma et al. Semi-Supervised Learning with Deep Generative
Models. 2014. arXiv: 1406.5298 [cs.LG].
[5] Gregory Koch. “Siamese neural networks for one-shot image recognition”.
In: 2015.
30 / 28
References II
[6] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
“Stochastic backpropagation and approximate inference in deep generative
models”. In: arXiv preprint arXiv:1401.4082 (2014).
[7] Jake Snell, Kevin Swersky, and Richard Zemel. “Prototypical networks for
few-shot learning”. In: Advances in Neural Information Processing Systems.
2017, pp. 4077–4087.
[8] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. “Learning structured output
representation using deep conditional generative models”. In: Advances in
Neural Information Processing Systems. 2015, pp. 3483–3491.
[9] Casper Kaae Sønderby et al. “Ladder variational autoencoders”. In:
Advances in neural information processing systems. 2016, pp. 3738–3746.
[10] Oriol Vinyals et al. “Matching networks for one shot learning”. In:
Advances in Neural Information Processing Systems. 2016, pp. 3630–3638.
31 / 28
Thank You

More Related Content

What's hot

Deep Learningを用いた教師なし画像検査の論文調査 GAN/SVM/Autoencoderとか .pdf
Deep Learningを用いた教師なし画像検査の論文調査 GAN/SVM/Autoencoderとか .pdfDeep Learningを用いた教師なし画像検査の論文調査 GAN/SVM/Autoencoderとか .pdf
Deep Learningを用いた教師なし画像検査の論文調査 GAN/SVM/Autoencoderとか .pdfRist Inc.
 
[DL輪読会]Backpropagation through the Void: Optimizing control variates for bla...
 [DL輪読会]Backpropagation through the Void: Optimizing control variates for bla... [DL輪読会]Backpropagation through the Void: Optimizing control variates for bla...
[DL輪読会]Backpropagation through the Void: Optimizing control variates for bla...Deep Learning JP
 
[DL輪読会]Learning Robust Rewards with Adversarial Inverse Reinforcement Learning
[DL輪読会]Learning Robust Rewards with Adversarial Inverse Reinforcement Learning[DL輪読会]Learning Robust Rewards with Adversarial Inverse Reinforcement Learning
[DL輪読会]Learning Robust Rewards with Adversarial Inverse Reinforcement LearningDeep Learning JP
 
[DL輪読会]Inverse Constrained Reinforcement Learning
[DL輪読会]Inverse Constrained Reinforcement Learning[DL輪読会]Inverse Constrained Reinforcement Learning
[DL輪読会]Inverse Constrained Reinforcement LearningDeep Learning JP
 
Categorical reparameterization with gumbel softmax
Categorical reparameterization with gumbel softmaxCategorical reparameterization with gumbel softmax
Categorical reparameterization with gumbel softmaxぱんいち すみもと
 
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...Dongmin Choi
 
[DL輪読会]Hindsight Experience Replay
[DL輪読会]Hindsight Experience Replay[DL輪読会]Hindsight Experience Replay
[DL輪読会]Hindsight Experience ReplayDeep Learning JP
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
Deep generative model.pdf
Deep generative model.pdfDeep generative model.pdf
Deep generative model.pdfHyungjoo Cho
 
yieldとreturnの話
yieldとreturnの話yieldとreturnの話
yieldとreturnの話bleis tift
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기NAVER Engineering
 
【DL輪読会】Generative models for molecular discovery: Recent advances and challenges
【DL輪読会】Generative models for molecular discovery: Recent advances and challenges【DL輪読会】Generative models for molecular discovery: Recent advances and challenges
【DL輪読会】Generative models for molecular discovery: Recent advances and challengesDeep Learning JP
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel LearningMasahiro Suzuki
 
[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative ModelsDeep Learning JP
 
[DL輪読会]Learning an Embedding Space for Transferable Robot Skills
[DL輪読会]Learning an Embedding Space for Transferable Robot Skills[DL輪読会]Learning an Embedding Space for Transferable Robot Skills
[DL輪読会]Learning an Embedding Space for Transferable Robot SkillsDeep Learning JP
 
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...Deep Learning JP
 
[DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
 [DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se... [DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
[DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...Deep Learning JP
 
充足可能性問題のいろいろ
充足可能性問題のいろいろ充足可能性問題のいろいろ
充足可能性問題のいろいろHiroshi Yamashita
 

What's hot (20)

Deep Learningを用いた教師なし画像検査の論文調査 GAN/SVM/Autoencoderとか .pdf
Deep Learningを用いた教師なし画像検査の論文調査 GAN/SVM/Autoencoderとか .pdfDeep Learningを用いた教師なし画像検査の論文調査 GAN/SVM/Autoencoderとか .pdf
Deep Learningを用いた教師なし画像検査の論文調査 GAN/SVM/Autoencoderとか .pdf
 
[DL輪読会]Backpropagation through the Void: Optimizing control variates for bla...
 [DL輪読会]Backpropagation through the Void: Optimizing control variates for bla... [DL輪読会]Backpropagation through the Void: Optimizing control variates for bla...
[DL輪読会]Backpropagation through the Void: Optimizing control variates for bla...
 
[DL輪読会]Learning Robust Rewards with Adversarial Inverse Reinforcement Learning
[DL輪読会]Learning Robust Rewards with Adversarial Inverse Reinforcement Learning[DL輪読会]Learning Robust Rewards with Adversarial Inverse Reinforcement Learning
[DL輪読会]Learning Robust Rewards with Adversarial Inverse Reinforcement Learning
 
[DL輪読会]Inverse Constrained Reinforcement Learning
[DL輪読会]Inverse Constrained Reinforcement Learning[DL輪読会]Inverse Constrained Reinforcement Learning
[DL輪読会]Inverse Constrained Reinforcement Learning
 
Categorical reparameterization with gumbel softmax
Categorical reparameterization with gumbel softmaxCategorical reparameterization with gumbel softmax
Categorical reparameterization with gumbel softmax
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
 
[DL輪読会]Hindsight Experience Replay
[DL輪読会]Hindsight Experience Replay[DL輪読会]Hindsight Experience Replay
[DL輪読会]Hindsight Experience Replay
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Deep generative model.pdf
Deep generative model.pdfDeep generative model.pdf
Deep generative model.pdf
 
yieldとreturnの話
yieldとreturnの話yieldとreturnの話
yieldとreturnの話
 
PRML 10.4 - 10.6
PRML 10.4 - 10.6PRML 10.4 - 10.6
PRML 10.4 - 10.6
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
 
【DL輪読会】Generative models for molecular discovery: Recent advances and challenges
【DL輪読会】Generative models for molecular discovery: Recent advances and challenges【DL輪読会】Generative models for molecular discovery: Recent advances and challenges
【DL輪読会】Generative models for molecular discovery: Recent advances and challenges
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning
 
[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models
 
[DL輪読会]Learning an Embedding Space for Transferable Robot Skills
[DL輪読会]Learning an Embedding Space for Transferable Robot Skills[DL輪読会]Learning an Embedding Space for Transferable Robot Skills
[DL輪読会]Learning an Embedding Space for Transferable Robot Skills
 
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
 
[DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
 [DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se... [DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
[DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
 
充足可能性問題のいろいろ
充足可能性問題のいろいろ充足可能性問題のいろいろ
充足可能性問題のいろいろ
 

Similar to Meta-learning and the ELBO

The dual geometry of Shannon information
The dual geometry of Shannon informationThe dual geometry of Shannon information
The dual geometry of Shannon informationFrank Nielsen
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Daisuke Yoneoka
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論岳華 杜
 
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18Olga Zinkevych
 
Toward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBOToward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBOKai-Wen Zhao
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodFrank Nielsen
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep LearningRayKim51
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Frank Nielsen
 
k-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsk-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsFrank Nielsen
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkHiroshi Kuwajima
 
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Frank Nielsen
 
Uncertainty in deep learning
Uncertainty in deep learningUncertainty in deep learning
Uncertainty in deep learningYujiro Katagiri
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionCharles Deledalle
 
A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...
A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...
A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...Joe Suzuki
 

Similar to Meta-learning and the ELBO (20)

The dual geometry of Shannon information
The dual geometry of Shannon informationThe dual geometry of Shannon information
The dual geometry of Shannon information
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
 
Toward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBOToward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBO
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...
 
Semi vae memo (2)
Semi vae memo (2)Semi vae memo (2)
Semi vae memo (2)
 
k-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsk-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture models
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
 
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
 
Uncertainty in deep learning
Uncertainty in deep learningUncertainty in deep learning
Uncertainty in deep learning
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
ma112011id535
ma112011id535ma112011id535
ma112011id535
 
A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...
A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...
A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...
 

More from Yoonho Lee

New Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodNew Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodYoonho Lee
 
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceGradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceYoonho Lee
 
Parameter Space Noise for Exploration
Parameter Space Noise for ExplorationParameter Space Noise for Exploration
Parameter Space Noise for ExplorationYoonho Lee
 
Meta Learning Shared Hierarchies
Meta Learning Shared HierarchiesMeta Learning Shared Hierarchies
Meta Learning Shared HierarchiesYoonho Lee
 
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...Yoonho Lee
 
The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningYoonho Lee
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningYoonho Lee
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksYoonho Lee
 
Modular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy SketchesModular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy SketchesYoonho Lee
 
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement LearningEvolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement LearningYoonho Lee
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsYoonho Lee
 

More from Yoonho Lee (11)

New Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodNew Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient Method
 
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceGradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
 
Parameter Space Noise for Exploration
Parameter Space Noise for ExplorationParameter Space Noise for Exploration
Parameter Space Noise for Exploration
 
Meta Learning Shared Hierarchies
Meta Learning Shared HierarchiesMeta Learning Shared Hierarchies
Meta Learning Shared Hierarchies
 
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
 
The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and Planning
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
 
Modular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy SketchesModular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy Sketches
 
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement LearningEvolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 

Recently uploaded

Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfWadeK3
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 

Recently uploaded (20)

Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 

Meta-learning and the ELBO

  • 1. 1 / 28 Meta-learning and the ELBO eddy.l January 02, 2019
  • 2. 2 / 28 Why unsupervised/generative? Intelligence existed before labels did Unsupervised=cake, supervised=icing, RL=cherry (Lecun) A human brain has 1015 connections and lives for 109 seconds. (Hinton) ”What I cannot create, I do not understand” (Feynman)
  • 3. 3 / 28 A latent variable model x1 z1 x2 z2 x3 z3 x4 z4
  • 4. 4 / 28 A latent variable model plate notation x z N
  • 5. 5 / 28 Learning a generative model Assume z ∼ p(z) and x ∼ p(x|z; θ). A good generative model is one that generates real-looking data, so let’s find θ∗ = arg max θ p(xreal ; θ) (1)
  • 6. 6 / 28 Learning a generative model First attempt Maybe we can find θ∗ using gradient ascent on something that increases with p(x; θ)? θ∗ = arg max θ p(x; θ) = arg max θ log p(x; θ) (2) = arg max θ log p(z)p(x|z; θ)dz (3) = arg max θ log Ep(z) [p(x|z; θ)] (4) Even evaluating p(x; θ) requires an integration; we can’t monte-carlo approximate it because of the log.
  • 7. 7 / 28 Learning a generative model The ELBO Let’s give up on optimizing p(x; θ) directly. Let q(z|x) be anything we can sample from. By Jensen’s inequality (omitting parameters), log p(x) = log Ep(z) [p(x|z)] (5) = log Eq(z|x) p(z)p(x|z) q(z|x) = log Eq(z|x) p(x, z) q(z|x) (6) ≥Eq(z|x) log p(x, z) q(z|x) (7) with equlity when q(z|x) = p(z|x) everywhere. This thing is called the ELBO (Evidence Lower BOund). We assumed we can sample z ∼ q(z|x), so we can get monte-carlo samples for the ELBO.
  • 8. 8 / 28 Variational Audoencoders x z N Networks p(x|z; θ), q(z|x; φ). Maximize ELBO = log p(z)p(x|z;θ) q(z|x;φ) , where z ∼ q(z|x; φ). “Auto-encoding variational bayes” by Kingma and Welling, “Stochastic backpropagation and approximate inference in deep generative models” by Rezende, Mohamed, and Wierstra
  • 9. 9 / 28 Variational Audoencoders x z N Networks p(x|z; θ), q(z|x; φ). Maximize ELBO = log p(z)p(x|z;θ) q(z|x;φ) , where z ∼ q(z|x; φ). What does this loss function mean?
  • 10. 10 / 28 Interpretations of ELBO 1. Lower bound of evidence We have already shown that the ELBO is a lower bound of the evidence: log p(x) ≥Eq(z|x) log p(x, z) q(z|x) = ELBO (8) Thus, we can view optimizing the ELBO as approximately optimizing p(x): arg max θ p(x) ≈ arg max θ ELBO (9)
  • 11. 11 / 28 Interpretations of ELBO 2. Distance to posterior Let’s take a closer look at the gap btw p(x) and the ELBO: log p(x) = log p(x, z) p(z|x) = Eq(z|x) log p(x, z) p(z|x) (10) = Eq(z|x) log p(x, z) q(z|x) + DKL (q(z|x)||p(z|x)) (11) = ELBO + DKL (q(z|x)||p(z|x)) (12) From the point of view of the inference network, maximizing the ELBO is equivalent to minimizing the KL divergence to the posterior: arg min φ DKL (q(z|x)||p(z|x)) = arg max φ ELBO (13)
  • 12. 12 / 28 Interpretations of ELBO 3. Autoencoder ELBO = Eq(z|x) [log p(x, z) − log q(z)] = 1 N N n=1 Eq(zn|xn) [log p(xn|zn)] − DKL (q(zn|xn)||p(z)) Can be used as a loss function if the KL divergence term can be computed analytically. For each datapoint xn, we make the model reconstruct xn while keeping each embedding zn close to the prior p(z).
  • 13. 13 / 28 Interpretations of ELBO 3. Autoencoder 1 N N n=1 DKL (q(zn)||p(z)) = 1 N N n=1 q(zn) (log q(zn) − log q(z) + log q(z) − log p(z)) = 1 N N n=1 DKL (q(zn)||q(z)) + N n=1 q(zn) N (log q(z) − log p(z)) = 1 N N n=1 DKL (q(zn)||q(z)) + DKL (q(z)||p(z)) so ELBO = 1 N N n=1 Eq(zn) [log p(xn|zn)] − DKL (q(zn)||q(z)) − DKL (q(z)||p(z))
  • 14. 14 / 28 Interpretations of ELBO 4. Free energy ELBO = Eq(z) [log p(x, z) − log q(z)] = −Eq(z) [− log p(x, z)] + H(q(z)) This is composed of a negative energy term plus the entropy of the distribution of states. Physical states are values of z and the energy of the state z is − log p(x, z). Therefore, the ELBO takes the form of a negative Helmholtz energy.
  • 15. 15 / 28 Interpretations of ELBO 4. Free energy ELBO = −Eq(z) [− log p(x, z)] + H(q(z)) We know that the distribution over states that minimizes the free energy is the Boltzmann distribution: p(z) ∝ exp(−E(z)) = p(x, z) ∝ p(z|x). (14) We see that the distribution of z that minimizes ELBO is p(z|x), the true posterior.
  • 16. 16 / 28 Interpretations of ELBO 5. Minimum Description Length (the ”bits-back” argument) Suppose we want to describe a datapoint x using as few bits as possible. We can first describe z, and then describe x. Shannon’s source coding theorem says that this scheme takes at least Ex∼data,z∼q(z|x) [− log p(z)− log p(x|z)] bits of data.
  • 17. 17 / 28 Interpretations of ELBO 5. Minimum Description Length (the ”bits-back” argument) Suppose we want to describe a datapoint x using as few bits as possible. We can first describe z, and then describe x. Shannon’s source coding theorem says that this scheme takes at least Ex∼data,z∼q(z|x) [− log p(z)− log p(x|z)] bits of data. The ”bits-back” argument is that after we know x, we can directly compute q(z|x), so we should subtract this extra information from the cost of describing z. The description length is then Ex∼data,z∼q(z|x) [− log p(z)+ log q(z|x) − log p(x|z)] = ELBO .
  • 18. 18 / 28 Interpretations of ELBO 1. lower bound of p(x) 2. learning to output p(z|x) 3. autoencoder with regularization 4. free energy 5. communication cost
  • 19. 19 / 28 2 latent variables? x z1 z2 N
  • 20. 20 / 28 2 latent variables? x z1 z2 N Networks p(x|z1; θ)p(z1|z2; θ), q(z2|z1; φ), q(z1|x; φ). Maximize ELBO = log p(z2)p(z1|z2;θ)p(x|z1;θ) q(z2|z1;φ)q(z1|x;φ) , where z1 ∼ q(z1|x; φ) and z2 ∼ q(z2|z1; φ).
  • 21. 21 / 28 2 latent variables? A better way x z1 z2 N “Ladder variational autoencoders” by Sønderby et al.
  • 22. 22 / 28 2 latent variables? x z1 z2 N Networks p(x|z1; θ)p(z1|z2; θ), q(z2|x; φ), q(z1|x; φ). Maximize ELBO = log p(z2)p(z1|z2;θ)p(x|z1;θ) q(z2|x;φ)q(z1|x,z2;φ,θ) , where z2 ∼ q(z2|x; φ) and z1 ∼ p(z1|x, z2; θ).
  • 23. 23 / 28 Neural Statistician x z c N T “Towards a neural statistician” by Edwards and Storkey
  • 24. 24 / 28 Conditional VAE x y z N “Learning structured output representation using deep conditional generative models” by Sohn, Lee, and Yan
  • 25. 25 / 28 semi-supervised VAE x z y N Semi-Supervised Learning with Deep Generative Models by Kingma et al.
  • 26. 26 / 28 Few-shot classification x y zc N M T “Siamese neural networks for one-shot image recognition” by Koch, “Matching networks for one shot learning” by Vinyals et al., “Prototypical networks for few-shot learning” by Snell, Swersky, and Zemel
  • 27. 27 / 28 Few-shot classification special case: triplet loss x y zc 3 2 T “Deep Metric Learning Using Triplet Network” by Hoffer and Ailon
  • 28. 28 / 28 Few-shot classification Triplet loss L ∼ d(a, p) − d(a, n) (15) = − log p(a|N(p, αI)) + log p(a|N(n, αI)) (16) Prototypical Net logits: (−d2 (a, c1), −d2 (a, c2), · · · , −d2 (a, cn)) (17) = (log p(a|N(c1, αI)), · · · , log p(a|N(cn, αI)) (18)
  • 29. 29 / 28 References I [1] Harrison Edwards and Amos Storkey. “Towards a neural statistician”. In: arXiv preprint arXiv:1606.02185 (2016). [2] Elad Hoffer and Nir Ailon. “Deep Metric Learning Using Triplet Network”. In: Lecture Notes in Computer Science (2015), 84–92. issn: 1611-3349. doi: 10.1007/978-3-319-24261-3_7. url: http://dx.doi.org/10.1007/978-3-319-24261-3_7. [3] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”. In: arXiv preprint arXiv:1312.6114 (2013). [4] Diederik P. Kingma et al. Semi-Supervised Learning with Deep Generative Models. 2014. arXiv: 1406.5298 [cs.LG]. [5] Gregory Koch. “Siamese neural networks for one-shot image recognition”. In: 2015.
  • 30. 30 / 28 References II [6] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. “Stochastic backpropagation and approximate inference in deep generative models”. In: arXiv preprint arXiv:1401.4082 (2014). [7] Jake Snell, Kevin Swersky, and Richard Zemel. “Prototypical networks for few-shot learning”. In: Advances in Neural Information Processing Systems. 2017, pp. 4077–4087. [8] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. “Learning structured output representation using deep conditional generative models”. In: Advances in Neural Information Processing Systems. 2015, pp. 3483–3491. [9] Casper Kaae Sønderby et al. “Ladder variational autoencoders”. In: Advances in neural information processing systems. 2016, pp. 3738–3746. [10] Oriol Vinyals et al. “Matching networks for one shot learning”. In: Advances in Neural Information Processing Systems. 2016, pp. 3630–3638.