SlideShare a Scribd company logo
1 of 30
Download to read offline
Variational Learning and Inference
with Deep Generative Neural Networks
Lawrence Carin
Duke University
11 December 2017
1
Model Development
• We are often interested in learning a model of the form
x ∼ pθ(x|z), z ∼ p(z)
where θ are unknown model parameters, and z are latent variables
drawn from known prior p(z)
• Model parameters θ are fixed for all data x
• Variation in x accounted for via variation of z, representing latent
processes
1
Example: ImageNet 1.2 Million Images
x ∼ pθ(x|z) with each z ∼ p(z) corresponding to an image
Questions: What’s the right model pθ(x|z), and how to determine θ?
2
Maximum Likelihood Learning
• Let q(x) represent the true, unknown distribution of the data
• Seek θ for which pθ(x) accurately models q(x)
• Maximum likelihood (ML) learning:
ˆθ = argmaxθ Eq(x) log pθ(x) ≈
1
N
N
i=1
log pθ(xi)
where {xi}i=1,N are the observed data
• Problem: pθ(x) = pθ(x|z)p(z)dz typically intractable to compute
3
Variational Approximation
• Let qφ(z|x) be a parametric approximation to
pθ(z|x) =
pθ(x|z)p(z)
pθ(x|z)p(z)dz
• Consider the variational expression
L(θ, φ) = Eq(x)Eqφ(z|x) log
pθ(x|z)p(z)
qφ(z|x)
= Eq(x)[log pθ(x) − KL(qφ(z|x) pθ(z|x)]
≤ Eq(x) log pθ(x)
• Alternate between θ and φ to maximize
L(θ, φ) ≈
1
N
N
i=1
Eqφ(zi|xi) log
pθ(xi|zi)p(zi)
qφ(zi|xi)
4
Form of the Approximating Distributions
• We typically use
pθ(x|z) = δ(x − fθ(z)),
with fθ(z) a deterministic function
• Randomness in pθ(x) manifested by latent variable z ∼ p(z)
• We do not assume an explicit form for qφ(z|x), we simply build a
model to sample from this distribution
z = gφ(x, δ) , δ ∼ N(0, I)
• Here employ deep neural networks for fθ(z) and gφ(x, δ)
5
Summarizing Model
• Generative process for data x
z ∼ p(z)
x(z) = fθ(z)
• Generative process for latent code z given x
δ ∼ N(0, I)
z = gφ(x, δ)
• fθ(z) and gφ(x, δ) learned deep neural networks
6
Variational Autoencoder
• Distribution pθ(x|z) termed a decoder, and qφ(z|x) is an encoder
7
Forms of the Variational Lower Bound
L(θ, φ) = Eq(x)Eqφ(z|x) log
pθ(x|z)p(z)
qφ(z|x)
= Eq(x) log pθ(x) − Eq(x)KL(qφ(z|x) pθ(z|x))
• Maximizing L(θ, φ): minimizing the expected distance
Eq(x)KL(pθ(z|x) qφ(z|x)) between the true and approximate
posterior
• May also be expressed
L(θ, φ) = −KL(qφ(x, z) pθ(x, z)) + C
where qφ(x, z) = q(x)qφ(z|x), pθ(x, z) = p(z)pθ(x|z), and
C = Eq(x) log q(x)
8
Cumulative Marginal Distributions
• We previously defined
pθ(x) = Ep(z)pθ(x|z)
• We now similarly define
qφ(z) = Eq(x)qφ(z|x)
• qφ(z) represents the cumulative distribution for latent variables z,
across all x ∼ q(x)
• Easily shown that, by re-expressing KL(qφ(x, z) pθ(x, z)):
L(θ, φ) = −Eq(x)KL(qφ(z|x) pθ(z|x)) − KL(q(x) pθ(x)) + C
= −Eqφ(z)KL(qφ(x|z) pθ(x|z)) − KL(qφ(z) p(z)) + C
9
Examination of the Variational Lower Bound
L(θ, φ) = −Eq(x)KL(qφ(z|x) pθ(z|x)) − KL(q(x) pθ(x)) + C
= −Eqφ(z)KL(qφ(x|z) pθ(x|z)) − KL(qφ(z) p(z)) + C
• First form encourages pθ(x) to be close to true data distribution q(x)
• Second form encourages that qφ(z) to be close to the prior p(z)
• Also encourages matching of conditional distributions
• It looks good, but in reality it’s not
• Culprit: The KL divergence is asymmetric
10
Support of a Distribution
• Support Sp(z) of distribution p(z) defined as member of the set
{ ˜Sp(z) :
˜Sp(z)
p(z)dz = 1 − }
with minimum size ˜Sp(z) =
˜Sp(z)
dz
• Typically interested in → 0+
• For notational convenience, replace Sp(z) with Sp(z), with
understanding is small
• Also define Sp(z)−
as largest set for which
Sp(z)−
p(z)dz =
Sp(z)
p(z)dz +
Sp(z)−
p(z)dz = 1
11
Analysis of the KL Divergence
L(θ, φ) = −Eq(x)KL(qφ(z|x) pθ(z|x)) − KL(q(x) pθ(x)) + C
= −Eqφ(z)KL(qφ(x|z) pθ(x|z)) − KL(qφ(z) p(z)) + C
• We examine the term −KL(q(x) pθ(x)) in detail, as representative
example
−KL(q(x) pθ(x)) = Eq(x) log pθ(x) + C
≈
Sq(x)
q(x) log pθ(x)dx + C
• We also have
Sq(x)
q(x) log pθ(x) =
Sq(x)∩Spθ(x)
q(x) log pθ(x)dx+
Sq(x)∩Spθ(x)−
q(x) log pθ(x)dx
12
Implications
Sq(x)
q(x) log pθ(x) =
Sq(x)∩Spθ(x)
q(x) log pθ(x)dx +
Sq(x)∩Spθ(x)−
q(x) log pθ(x)dx
• If Sq(x) ∩ Spθ(x)−
= ∅, then
Sq(x)∩Spθ(x)−
q(x) log pθ(x)dx will be large
negative
• Hence, maximizing L(θ, φ) encourages Sq(x) ∩ Spθ(x)−
= ∅
• By contrast, no strong penalty for Sq(x)−
∩ Spθ(x) = ∅, since
Sq(x)−
q(x) log pθ(x) ≈ 0
13
Summarizing
• Maximization of −KL(q(x) pθ(x)) implies
Sq(x) ∩ Spθ(x)−
= ∅ , Sq(x)−
∩ Spθ(x) = ∅
• Equivalently
Sq(x) ⊂ Spθ(x)
• May also show that maximization of −KL(qφ(z) p(z)) yields
Sqφ(z) ⊂ Sp(z)
• This implies many (most) x ∼ pθ(x) will not look like x ∼ q(x)
• This is a fundamental problem with variational-based learning
14
Implications of Traditional Variational Learning
𝑝"(𝑥|𝑧)
𝑞)(𝑧|𝑥)
𝑝(𝑧)
𝑞(𝑥)
Encoder Decoder
𝑞(𝑥)
𝑞)(𝑧)
𝑝(𝑧)
𝑝"(𝑥)
15
Flip Order of Distributions in KL
• Consider maximization of
−KL(pθ(x) q(x)) = Epθ(x) log q(x) + h(pθ(x))
• To optimize this term,
Spθ(x) ⊂ Sq(x)
and the subset should be as large as possible, to maximize h(pθ(x))
• May also show that maximization of −KL(p(z) qφ(z)) yields
Sp(z) ⊂ Sqφ(z)
16
New Form of the Variational Lower Bound
• Recall original form of the variational lower bound
Lx(θ, φ) = Eq(x)Eqφ(z|x) log
pθ(x|z)p(z)
qφ(z|x)
= −Eq(x)KL(qφ(z|x) pθ(z|x)) − KL(q(x) pθ(x)) + Cx
= −Eqφ(z)KL(qφ(x|z) pθ(x|z)) − KL(qφ(z) p(z)) + Cx
• Introduce a new form
Lz(θ, φ) = Ep(z)Epθ(x|z) log
qφ(z|x)q(x)
pθ(x|z)
= −Ep(z)KL(pθ(x|z) qφ(x|z)) − KL(p(z) qφ(z)) + Cz
= −Epθ(x)KL(pθ(z|x) qφ(z|x)) − KL(pθ(x) q(x)) + Cz
where Cx = −h(q(x)), Cz = −h(p(z))
17
Implications of New Variational Expression
𝑝"(𝑥|𝑧)
𝑞)(𝑧|𝑥)
𝑝(𝑧)
Encoder Decoder
𝑞(𝑥)
𝑞)(𝑧)
𝑝"(𝑥)𝑞(𝑥)
𝑝(𝑧)
18
Combine Old with New Variational Expression
𝑝"(𝑥|𝑧)
𝑝(𝑧)
Decoder
𝑞(𝑥)
𝑞)(𝑧|𝑥)
𝑞(𝑥)
Encoder
𝑞)(𝑧)
𝑝(𝑧)
𝑝"(𝑥)
𝐿+(𝜃, 𝜙)
𝑞)(𝑧|𝑥)
Encoder
𝑞)(𝑧)
𝑝"(𝑥|𝑧)
𝑝(𝑧)
Decoder
𝑞(𝑥)
𝑝"(𝑥)
𝑞(𝑥)
𝑝(𝑧)
𝐿/(𝜃, 𝜙)
19
Result of Combined Variational Expressions
𝑝"(𝑥|𝑧)
𝑞)(𝑧|𝑥)
𝑝(𝑧)
Encoder Decoder
𝑞(𝑥)
𝑞)(𝑧)
𝑝(𝑧)
𝑝"(𝑥)
𝑞(𝑥)
20
Symmetric Variational Representation
• Symmetric variational lower bound:
Lxz(θ, φ) = Lx(θ, φ) + Lz(θ, φ)
= Eq(x)Eqφ(z|x)h(x, z; θ, φ) − Ep(z)Epθ(x|z)h(x, z; θ, φ) + K
where K = Cx + Cz and
h(x, z; θ, φ) = log
pθ(x|z)p(z)
qφ(z|x)q(x)
= log
pθ(x, z)
qφ(x, z)
• Note that h(x, z; θ, φ) is a log likelihood ratio test (LRT) statistic,
and maximization of Lxz(θ, φ) corresponds to matching the
expectations to the LRT
• Problem: To evaluate h(·) we require q(x), the true data-generating
density, which we lack
21
Slight Detour - 1/2
• Introduce binary discrete variable b ∈ {0, 1}, and
p(x, z|b = 0) = pθ(x, z)
p(x, z|b = 1) = qφ(x, z)
• Let p(b = 0) = p(b = 1) = 1/2
• The posterior probabilities satisfy
p(b = 0|x, z) =
p(x, z|b = 0)p(b = 0)
1
i=0 p(x, z|b = i)p(b = i)
=
pθ(x, z)
qφ(x, z) + pθ(x, z)
and
p(b = 1|x, z) = 1 − p(b = 0|x, z) =
qφ(x, z)
qφ(x, z) + pθ(x, z)
22
Slight Detour - 2/2
• Let π(b = 0|x, z) ∈ [0, 1] be a function that defines the probability
b = 0 given (x, z)
• Define ˆπ(b = 0|x, z) as
argmaxπ(b=0|x,z) {Epθ(x,z) log π(b = 0|x, z)+Eqφ(x,z) log[1−π(b = 0|x, z)]}
• The solution to this setup is
ˆπ(b = 0|x, z) =
pθ(x, z)
qφ(x, z) + pθ(x, z)
ˆπ(b = 1|x, z) = 1 − ˆπ(b = 0|x, z) =
qφ(x, z)
qφ(x, z) + pθ(x, z)
23
Inferring Log Ratio from Synthesized Samples
• Consider the cost function
g(ψ; θ, φ) = Epθ(x,z) log σ[hψ(x, z; θ, φ)]+Epφ(x,z) log[1−σ(hψ(x, z; θ, φ)]
where σ(·) is the logistic function and hψ(x, z; θ, φ) is a deep neural
network with parameters ψ, with input (x, z) and scalar output
• For fixed (θ, φ), the parameters ψ∗
that maximize g(ψ; θ, φ) are
hψ∗ (x, z; θ, φ) = log
pθ(x, z)
qφ(x, z)
24
Algorithm Summary for Symmetric Variational Learning
(θi+1, φi+1) = argmax(θ,φ)Eqφ(x,z)hψi
(x, z) − Epθ(x,z)hψi
(x, z)
ψi+1 = argmaxψEpθi+1
(x,z) log σ(hψ(x, z)) + Epφi+1
(x,z) log(1 − σ(hψ(x, z))
• Expectations performed approximately via sampling:
z ∼ p(z), x = fθ(z)
x ∼ q(x), δ ∼ N(0, I), z = gφ(x, δ)
• Framework composed of three deep neural networks: fθ(z) and
gφ(x, δ) and hψ(x, z)
• Have derived a generative adversarial network (GAN) setup via
first-principles, symmetrizing a variational lower bound
25
GAN-Like Setup
(θi+1, φi+1) = argmax(θ,φ)Eqφ(x,z)hψi
(x, z) − Epθ(x,z)hψi
(x, z)
• Update generative model parameters (θ, φ) to best “fool” the
likelihood ratio test (LRT) statistic hψi
(x, z)
ψi+1 = argmaxψEpθi+1
(x,z) log σ(hψ(x, z))+Epφi+1
(x,z) log(1−σ(hψ(x, z))
• Given new generative model parameters, update the LRT test statistic,
to best distinguish between two types of generative models
• “Adversarial game” between LRT and generative model, that is derived
as a natural outcome of symmetrizing the variational expression
26
Synthesized Images: Training on MNIST
27
Synthesized Images: Training on ImageNet
28
Summary
• Have modeled data as being drawn with latent variable z ∼ p(z), with
z then fed through neural network yielding x = fθ(z)
• Given x, perform inference for latent variable using z = gφ(x, δ),
δ ∼ N(0, I)
• Learn NN parameters θ and φ via symmetric variational expression
• In the context of inference, learn z = gφ(x, δ) as a means to draw
samples for latent variables
• Excellent synthesis of realistic data, and also effective tool for inference
• Learning constitutes a generalization of generative adversarial networks
(GANs)
29

More Related Content

What's hot

Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?Christian Robert
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsChristian Robert
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerChristian Robert
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerChristian Robert
 
Bayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear modelsBayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear modelsCaleb (Shiqiang) Jin
 
Approximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsApproximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsStefano Cabras
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distancesChristian Robert
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionCharles Deledalle
 
Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Frank Nielsen
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the WeightsMark Chang
 
Locality-sensitive hashing for search in metric space
Locality-sensitive hashing for search in metric space Locality-sensitive hashing for search in metric space
Locality-sensitive hashing for search in metric space Eliezer Silva
 
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresLinear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresAnmol Dwivedi
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distancesChristian Robert
 

What's hot (20)

Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
talk MCMC & SMC 2004
talk MCMC & SMC 2004talk MCMC & SMC 2004
talk MCMC & SMC 2004
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like sampler
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
 
Bayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear modelsBayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear models
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli...
 Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli... Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli...
 
Approximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsApproximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-Likelihoods
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distances
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the Weights
 
Locality-sensitive hashing for search in metric space
Locality-sensitive hashing for search in metric space Locality-sensitive hashing for search in metric space
Locality-sensitive hashing for search in metric space
 
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresLinear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distances
 
Big model, big data
Big model, big dataBig model, big data
Big model, big data
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 

Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Variable Learning & Inference w/ Deep Generative Neural Networks - Lawrence Carin, Dec 11, 2017

Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Daisuke Yoneoka
 
Monte Carlo Methods
Monte Carlo MethodsMonte Carlo Methods
Monte Carlo MethodsJames Bell
 
Decision Making with Hierarchical Credal Sets (IPMU 2014)
Decision Making with Hierarchical Credal Sets (IPMU 2014)Decision Making with Hierarchical Credal Sets (IPMU 2014)
Decision Making with Hierarchical Credal Sets (IPMU 2014)Alessandro Antonucci
 
The dual geometry of Shannon information
The dual geometry of Shannon informationThe dual geometry of Shannon information
The dual geometry of Shannon informationFrank Nielsen
 
Error control coding bch, reed-solomon etc..
Error control coding   bch, reed-solomon etc..Error control coding   bch, reed-solomon etc..
Error control coding bch, reed-solomon etc..Madhumita Tamhane
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBOYoonho Lee
 
Gibbs flow transport for Bayesian inference
Gibbs flow transport for Bayesian inferenceGibbs flow transport for Bayesian inference
Gibbs flow transport for Bayesian inferenceJeremyHeng10
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber SecurityAltoros
 
Auto encoding-variational-bayes
Auto encoding-variational-bayesAuto encoding-variational-bayes
Auto encoding-variational-bayesmehdi Cherti
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Alexander Litvinenko
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Tomasz Kusmierczyk
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingFrank Nielsen
 
Ml mle_bayes
Ml  mle_bayesMl  mle_bayes
Ml mle_bayesPhong Vo
 
CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2zukun
 
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...HidenoriOgata
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Sangwoo Mo
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep LearningRayKim51
 

Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Variable Learning & Inference w/ Deep Generative Neural Networks - Lawrence Carin, Dec 11, 2017 (20)

sada_pres
sada_pressada_pres
sada_pres
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9
 
Monte Carlo Methods
Monte Carlo MethodsMonte Carlo Methods
Monte Carlo Methods
 
Decision Making with Hierarchical Credal Sets (IPMU 2014)
Decision Making with Hierarchical Credal Sets (IPMU 2014)Decision Making with Hierarchical Credal Sets (IPMU 2014)
Decision Making with Hierarchical Credal Sets (IPMU 2014)
 
The dual geometry of Shannon information
The dual geometry of Shannon informationThe dual geometry of Shannon information
The dual geometry of Shannon information
 
Error control coding bch, reed-solomon etc..
Error control coding   bch, reed-solomon etc..Error control coding   bch, reed-solomon etc..
Error control coding bch, reed-solomon etc..
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
Gibbs flow transport for Bayesian inference
Gibbs flow transport for Bayesian inferenceGibbs flow transport for Bayesian inference
Gibbs flow transport for Bayesian inference
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber Security
 
Auto encoding-variational-bayes
Auto encoding-variational-bayesAuto encoding-variational-bayes
Auto encoding-variational-bayes
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
 
Ml mle_bayes
Ml  mle_bayesMl  mle_bayes
Ml mle_bayes
 
CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2
 
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...
 
Random Forest
Random ForestRandom Forest
Random Forest
 
Nested sampling
Nested samplingNested sampling
Nested sampling
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 

Recently uploaded

Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 

Recently uploaded (20)

Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Variable Learning & Inference w/ Deep Generative Neural Networks - Lawrence Carin, Dec 11, 2017

  • 1. Variational Learning and Inference with Deep Generative Neural Networks Lawrence Carin Duke University 11 December 2017 1
  • 2. Model Development • We are often interested in learning a model of the form x ∼ pθ(x|z), z ∼ p(z) where θ are unknown model parameters, and z are latent variables drawn from known prior p(z) • Model parameters θ are fixed for all data x • Variation in x accounted for via variation of z, representing latent processes 1
  • 3. Example: ImageNet 1.2 Million Images x ∼ pθ(x|z) with each z ∼ p(z) corresponding to an image Questions: What’s the right model pθ(x|z), and how to determine θ? 2
  • 4. Maximum Likelihood Learning • Let q(x) represent the true, unknown distribution of the data • Seek θ for which pθ(x) accurately models q(x) • Maximum likelihood (ML) learning: ˆθ = argmaxθ Eq(x) log pθ(x) ≈ 1 N N i=1 log pθ(xi) where {xi}i=1,N are the observed data • Problem: pθ(x) = pθ(x|z)p(z)dz typically intractable to compute 3
  • 5. Variational Approximation • Let qφ(z|x) be a parametric approximation to pθ(z|x) = pθ(x|z)p(z) pθ(x|z)p(z)dz • Consider the variational expression L(θ, φ) = Eq(x)Eqφ(z|x) log pθ(x|z)p(z) qφ(z|x) = Eq(x)[log pθ(x) − KL(qφ(z|x) pθ(z|x)] ≤ Eq(x) log pθ(x) • Alternate between θ and φ to maximize L(θ, φ) ≈ 1 N N i=1 Eqφ(zi|xi) log pθ(xi|zi)p(zi) qφ(zi|xi) 4
  • 6. Form of the Approximating Distributions • We typically use pθ(x|z) = δ(x − fθ(z)), with fθ(z) a deterministic function • Randomness in pθ(x) manifested by latent variable z ∼ p(z) • We do not assume an explicit form for qφ(z|x), we simply build a model to sample from this distribution z = gφ(x, δ) , δ ∼ N(0, I) • Here employ deep neural networks for fθ(z) and gφ(x, δ) 5
  • 7. Summarizing Model • Generative process for data x z ∼ p(z) x(z) = fθ(z) • Generative process for latent code z given x δ ∼ N(0, I) z = gφ(x, δ) • fθ(z) and gφ(x, δ) learned deep neural networks 6
  • 8. Variational Autoencoder • Distribution pθ(x|z) termed a decoder, and qφ(z|x) is an encoder 7
  • 9. Forms of the Variational Lower Bound L(θ, φ) = Eq(x)Eqφ(z|x) log pθ(x|z)p(z) qφ(z|x) = Eq(x) log pθ(x) − Eq(x)KL(qφ(z|x) pθ(z|x)) • Maximizing L(θ, φ): minimizing the expected distance Eq(x)KL(pθ(z|x) qφ(z|x)) between the true and approximate posterior • May also be expressed L(θ, φ) = −KL(qφ(x, z) pθ(x, z)) + C where qφ(x, z) = q(x)qφ(z|x), pθ(x, z) = p(z)pθ(x|z), and C = Eq(x) log q(x) 8
  • 10. Cumulative Marginal Distributions • We previously defined pθ(x) = Ep(z)pθ(x|z) • We now similarly define qφ(z) = Eq(x)qφ(z|x) • qφ(z) represents the cumulative distribution for latent variables z, across all x ∼ q(x) • Easily shown that, by re-expressing KL(qφ(x, z) pθ(x, z)): L(θ, φ) = −Eq(x)KL(qφ(z|x) pθ(z|x)) − KL(q(x) pθ(x)) + C = −Eqφ(z)KL(qφ(x|z) pθ(x|z)) − KL(qφ(z) p(z)) + C 9
  • 11. Examination of the Variational Lower Bound L(θ, φ) = −Eq(x)KL(qφ(z|x) pθ(z|x)) − KL(q(x) pθ(x)) + C = −Eqφ(z)KL(qφ(x|z) pθ(x|z)) − KL(qφ(z) p(z)) + C • First form encourages pθ(x) to be close to true data distribution q(x) • Second form encourages that qφ(z) to be close to the prior p(z) • Also encourages matching of conditional distributions • It looks good, but in reality it’s not • Culprit: The KL divergence is asymmetric 10
  • 12. Support of a Distribution • Support Sp(z) of distribution p(z) defined as member of the set { ˜Sp(z) : ˜Sp(z) p(z)dz = 1 − } with minimum size ˜Sp(z) = ˜Sp(z) dz • Typically interested in → 0+ • For notational convenience, replace Sp(z) with Sp(z), with understanding is small • Also define Sp(z)− as largest set for which Sp(z)− p(z)dz = Sp(z) p(z)dz + Sp(z)− p(z)dz = 1 11
  • 13. Analysis of the KL Divergence L(θ, φ) = −Eq(x)KL(qφ(z|x) pθ(z|x)) − KL(q(x) pθ(x)) + C = −Eqφ(z)KL(qφ(x|z) pθ(x|z)) − KL(qφ(z) p(z)) + C • We examine the term −KL(q(x) pθ(x)) in detail, as representative example −KL(q(x) pθ(x)) = Eq(x) log pθ(x) + C ≈ Sq(x) q(x) log pθ(x)dx + C • We also have Sq(x) q(x) log pθ(x) = Sq(x)∩Spθ(x) q(x) log pθ(x)dx+ Sq(x)∩Spθ(x)− q(x) log pθ(x)dx 12
  • 14. Implications Sq(x) q(x) log pθ(x) = Sq(x)∩Spθ(x) q(x) log pθ(x)dx + Sq(x)∩Spθ(x)− q(x) log pθ(x)dx • If Sq(x) ∩ Spθ(x)− = ∅, then Sq(x)∩Spθ(x)− q(x) log pθ(x)dx will be large negative • Hence, maximizing L(θ, φ) encourages Sq(x) ∩ Spθ(x)− = ∅ • By contrast, no strong penalty for Sq(x)− ∩ Spθ(x) = ∅, since Sq(x)− q(x) log pθ(x) ≈ 0 13
  • 15. Summarizing • Maximization of −KL(q(x) pθ(x)) implies Sq(x) ∩ Spθ(x)− = ∅ , Sq(x)− ∩ Spθ(x) = ∅ • Equivalently Sq(x) ⊂ Spθ(x) • May also show that maximization of −KL(qφ(z) p(z)) yields Sqφ(z) ⊂ Sp(z) • This implies many (most) x ∼ pθ(x) will not look like x ∼ q(x) • This is a fundamental problem with variational-based learning 14
  • 16. Implications of Traditional Variational Learning 𝑝"(𝑥|𝑧) 𝑞)(𝑧|𝑥) 𝑝(𝑧) 𝑞(𝑥) Encoder Decoder 𝑞(𝑥) 𝑞)(𝑧) 𝑝(𝑧) 𝑝"(𝑥) 15
  • 17. Flip Order of Distributions in KL • Consider maximization of −KL(pθ(x) q(x)) = Epθ(x) log q(x) + h(pθ(x)) • To optimize this term, Spθ(x) ⊂ Sq(x) and the subset should be as large as possible, to maximize h(pθ(x)) • May also show that maximization of −KL(p(z) qφ(z)) yields Sp(z) ⊂ Sqφ(z) 16
  • 18. New Form of the Variational Lower Bound • Recall original form of the variational lower bound Lx(θ, φ) = Eq(x)Eqφ(z|x) log pθ(x|z)p(z) qφ(z|x) = −Eq(x)KL(qφ(z|x) pθ(z|x)) − KL(q(x) pθ(x)) + Cx = −Eqφ(z)KL(qφ(x|z) pθ(x|z)) − KL(qφ(z) p(z)) + Cx • Introduce a new form Lz(θ, φ) = Ep(z)Epθ(x|z) log qφ(z|x)q(x) pθ(x|z) = −Ep(z)KL(pθ(x|z) qφ(x|z)) − KL(p(z) qφ(z)) + Cz = −Epθ(x)KL(pθ(z|x) qφ(z|x)) − KL(pθ(x) q(x)) + Cz where Cx = −h(q(x)), Cz = −h(p(z)) 17
  • 19. Implications of New Variational Expression 𝑝"(𝑥|𝑧) 𝑞)(𝑧|𝑥) 𝑝(𝑧) Encoder Decoder 𝑞(𝑥) 𝑞)(𝑧) 𝑝"(𝑥)𝑞(𝑥) 𝑝(𝑧) 18
  • 20. Combine Old with New Variational Expression 𝑝"(𝑥|𝑧) 𝑝(𝑧) Decoder 𝑞(𝑥) 𝑞)(𝑧|𝑥) 𝑞(𝑥) Encoder 𝑞)(𝑧) 𝑝(𝑧) 𝑝"(𝑥) 𝐿+(𝜃, 𝜙) 𝑞)(𝑧|𝑥) Encoder 𝑞)(𝑧) 𝑝"(𝑥|𝑧) 𝑝(𝑧) Decoder 𝑞(𝑥) 𝑝"(𝑥) 𝑞(𝑥) 𝑝(𝑧) 𝐿/(𝜃, 𝜙) 19
  • 21. Result of Combined Variational Expressions 𝑝"(𝑥|𝑧) 𝑞)(𝑧|𝑥) 𝑝(𝑧) Encoder Decoder 𝑞(𝑥) 𝑞)(𝑧) 𝑝(𝑧) 𝑝"(𝑥) 𝑞(𝑥) 20
  • 22. Symmetric Variational Representation • Symmetric variational lower bound: Lxz(θ, φ) = Lx(θ, φ) + Lz(θ, φ) = Eq(x)Eqφ(z|x)h(x, z; θ, φ) − Ep(z)Epθ(x|z)h(x, z; θ, φ) + K where K = Cx + Cz and h(x, z; θ, φ) = log pθ(x|z)p(z) qφ(z|x)q(x) = log pθ(x, z) qφ(x, z) • Note that h(x, z; θ, φ) is a log likelihood ratio test (LRT) statistic, and maximization of Lxz(θ, φ) corresponds to matching the expectations to the LRT • Problem: To evaluate h(·) we require q(x), the true data-generating density, which we lack 21
  • 23. Slight Detour - 1/2 • Introduce binary discrete variable b ∈ {0, 1}, and p(x, z|b = 0) = pθ(x, z) p(x, z|b = 1) = qφ(x, z) • Let p(b = 0) = p(b = 1) = 1/2 • The posterior probabilities satisfy p(b = 0|x, z) = p(x, z|b = 0)p(b = 0) 1 i=0 p(x, z|b = i)p(b = i) = pθ(x, z) qφ(x, z) + pθ(x, z) and p(b = 1|x, z) = 1 − p(b = 0|x, z) = qφ(x, z) qφ(x, z) + pθ(x, z) 22
  • 24. Slight Detour - 2/2 • Let π(b = 0|x, z) ∈ [0, 1] be a function that defines the probability b = 0 given (x, z) • Define ˆπ(b = 0|x, z) as argmaxπ(b=0|x,z) {Epθ(x,z) log π(b = 0|x, z)+Eqφ(x,z) log[1−π(b = 0|x, z)]} • The solution to this setup is ˆπ(b = 0|x, z) = pθ(x, z) qφ(x, z) + pθ(x, z) ˆπ(b = 1|x, z) = 1 − ˆπ(b = 0|x, z) = qφ(x, z) qφ(x, z) + pθ(x, z) 23
  • 25. Inferring Log Ratio from Synthesized Samples • Consider the cost function g(ψ; θ, φ) = Epθ(x,z) log σ[hψ(x, z; θ, φ)]+Epφ(x,z) log[1−σ(hψ(x, z; θ, φ)] where σ(·) is the logistic function and hψ(x, z; θ, φ) is a deep neural network with parameters ψ, with input (x, z) and scalar output • For fixed (θ, φ), the parameters ψ∗ that maximize g(ψ; θ, φ) are hψ∗ (x, z; θ, φ) = log pθ(x, z) qφ(x, z) 24
  • 26. Algorithm Summary for Symmetric Variational Learning (θi+1, φi+1) = argmax(θ,φ)Eqφ(x,z)hψi (x, z) − Epθ(x,z)hψi (x, z) ψi+1 = argmaxψEpθi+1 (x,z) log σ(hψ(x, z)) + Epφi+1 (x,z) log(1 − σ(hψ(x, z)) • Expectations performed approximately via sampling: z ∼ p(z), x = fθ(z) x ∼ q(x), δ ∼ N(0, I), z = gφ(x, δ) • Framework composed of three deep neural networks: fθ(z) and gφ(x, δ) and hψ(x, z) • Have derived a generative adversarial network (GAN) setup via first-principles, symmetrizing a variational lower bound 25
  • 27. GAN-Like Setup (θi+1, φi+1) = argmax(θ,φ)Eqφ(x,z)hψi (x, z) − Epθ(x,z)hψi (x, z) • Update generative model parameters (θ, φ) to best “fool” the likelihood ratio test (LRT) statistic hψi (x, z) ψi+1 = argmaxψEpθi+1 (x,z) log σ(hψ(x, z))+Epφi+1 (x,z) log(1−σ(hψ(x, z)) • Given new generative model parameters, update the LRT test statistic, to best distinguish between two types of generative models • “Adversarial game” between LRT and generative model, that is derived as a natural outcome of symmetrizing the variational expression 26
  • 30. Summary • Have modeled data as being drawn with latent variable z ∼ p(z), with z then fed through neural network yielding x = fθ(z) • Given x, perform inference for latent variable using z = gφ(x, δ), δ ∼ N(0, I) • Learn NN parameters θ and φ via symmetric variational expression • In the context of inference, learn z = gφ(x, δ) as a means to draw samples for latent variables • Excellent synthesis of realistic data, and also effective tool for inference • Learning constitutes a generalization of generative adversarial networks (GANs) 29