QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Variable Learning & Inference w/ Deep Generative Neural Networks - Lawrence Carin, Dec 11, 2017

Variational Learning and Inference
with Deep Generative Neural Networks
Lawrence Carin
Duke University
11 December 2017
1

Model Development
• We are often interested in learning a model of the form
x ∼ pθ(x|z), z ∼ p(z)
where θ are unknown model parameters, and z are latent variables
drawn from known prior p(z)
• Model parameters θ are ﬁxed for all data x
• Variation in x accounted for via variation of z, representing latent
processes
1

Example: ImageNet 1.2 Million Images
x ∼ pθ(x|z) with each z ∼ p(z) corresponding to an image
Questions: What’s the right model pθ(x|z), and how to determine θ?
2

Maximum Likelihood Learning
• Let q(x) represent the true, unknown distribution of the data
• Seek θ for which pθ(x) accurately models q(x)
• Maximum likelihood (ML) learning:
ˆθ = argmaxθ Eq(x) log pθ(x) ≈
1
N
N
i=1
log pθ(xi)
where {xi}i=1,N are the observed data
• Problem: pθ(x) = pθ(x|z)p(z)dz typically intractable to compute
3

Form of the Approximating Distributions
• We typically use
pθ(x|z) = δ(x − fθ(z)),
with fθ(z) a deterministic function
• Randomness in pθ(x) manifested by latent variable z ∼ p(z)
• We do not assume an explicit form for qφ(z|x), we simply build a
model to sample from this distribution
z = gφ(x, δ) , δ ∼ N(0, I)
• Here employ deep neural networks for fθ(z) and gφ(x, δ)
5

Summarizing Model
• Generative process for data x
z ∼ p(z)
x(z) = fθ(z)
• Generative process for latent code z given x
δ ∼ N(0, I)
z = gφ(x, δ)
• fθ(z) and gφ(x, δ) learned deep neural networks
6

Variational Autoencoder
• Distribution pθ(x|z) termed a decoder, and qφ(z|x) is an encoder
7

Cumulative Marginal Distributions
• We previously deﬁned
pθ(x) = Ep(z)pθ(x|z)
• We now similarly deﬁne
qφ(z) = Eq(x)qφ(z|x)
• qφ(z) represents the cumulative distribution for latent variables z,
across all x ∼ q(x)
• Easily shown that, by re-expressing KL(qφ(x, z) pθ(x, z)):
L(θ, φ) = −Eq(x)KL(qφ(z|x) pθ(z|x)) − KL(q(x) pθ(x)) + C
= −Eqφ(z)KL(qφ(x|z) pθ(x|z)) − KL(qφ(z) p(z)) + C
9

Examination of the Variational Lower Bound
• First form encourages pθ(x) to be close to true data distribution q(x)
• Second form encourages that qφ(z) to be close to the prior p(z)
• Also encourages matching of conditional distributions
• It looks good, but in reality it’s not
• Culprit: The KL divergence is asymmetric
10

Support of a Distribution
• Support Sp(z) of distribution p(z) deﬁned as member of the set
{ ˜Sp(z) :
˜Sp(z)
p(z)dz = 1 − }
with minimum size ˜Sp(z) =
˜Sp(z)
dz
• Typically interested in → 0+
• For notational convenience, replace Sp(z) with Sp(z), with
understanding is small
• Also deﬁne Sp(z)−
as largest set for which
Sp(z)−
p(z)dz =
Sp(z)
p(z)dz +
Sp(z)−
p(z)dz = 1
11

Analysis of the KL Divergence
• We examine the term −KL(q(x) pθ(x)) in detail, as representative
example
−KL(q(x) pθ(x)) = Eq(x) log pθ(x) + C
≈
Sq(x)
q(x) log pθ(x)dx + C
• We also have
Sq(x)
q(x) log pθ(x) =
Sq(x)∩Spθ(x)
q(x) log pθ(x)dx+
Sq(x)∩Spθ(x)−
q(x) log pθ(x)dx
12

Implications
Sq(x)
q(x) log pθ(x) =
Sq(x)∩Spθ(x)
q(x) log pθ(x)dx +
Sq(x)∩Spθ(x)−
q(x) log pθ(x)dx
• If Sq(x) ∩ Spθ(x)−
= ∅, then
Sq(x)∩Spθ(x)−
q(x) log pθ(x)dx will be large
negative
• Hence, maximizing L(θ, φ) encourages Sq(x) ∩ Spθ(x)−
= ∅
• By contrast, no strong penalty for Sq(x)−
∩ Spθ(x) = ∅, since
Sq(x)−
q(x) log pθ(x) ≈ 0
13

Summarizing
• Maximization of −KL(q(x) pθ(x)) implies
Sq(x) ∩ Spθ(x)−
= ∅ , Sq(x)−
∩ Spθ(x) = ∅
• Equivalently
Sq(x) ⊂ Spθ(x)
• May also show that maximization of −KL(qφ(z) p(z)) yields
Sqφ(z) ⊂ Sp(z)
• This implies many (most) x ∼ pθ(x) will not look like x ∼ q(x)
• This is a fundamental problem with variational-based learning
14

Implications of Traditional Variational Learning
𝑝"(𝑥|𝑧)
𝑞)(𝑧|𝑥)
𝑝(𝑧)
𝑞(𝑥)
Encoder Decoder
𝑞(𝑥)
𝑞)(𝑧)
𝑝(𝑧)
𝑝"(𝑥)
15

Flip Order of Distributions in KL
• Consider maximization of
−KL(pθ(x) q(x)) = Epθ(x) log q(x) + h(pθ(x))
• To optimize this term,
Spθ(x) ⊂ Sq(x)
and the subset should be as large as possible, to maximize h(pθ(x))
• May also show that maximization of −KL(p(z) qφ(z)) yields
Sp(z) ⊂ Sqφ(z)
16

Implications of New Variational Expression
𝑝"(𝑥|𝑧)
𝑞)(𝑧|𝑥)
𝑝(𝑧)
Encoder Decoder
𝑞(𝑥)
𝑞)(𝑧)
𝑝"(𝑥)𝑞(𝑥)
𝑝(𝑧)
18

Combine Old with New Variational Expression
𝑝"(𝑥|𝑧)
𝑝(𝑧)
Decoder
𝑞(𝑥)
𝑞)(𝑧|𝑥)
𝑞(𝑥)
Encoder
𝑞)(𝑧)
𝑝(𝑧)
𝑝"(𝑥)
𝐿+(𝜃, 𝜙)
𝑞)(𝑧|𝑥)
Encoder
𝑞)(𝑧)
𝑝"(𝑥|𝑧)
𝑝(𝑧)
Decoder
𝑞(𝑥)
𝑝"(𝑥)
𝑞(𝑥)
𝑝(𝑧)
𝐿/(𝜃, 𝜙)
19

Result of Combined Variational Expressions
𝑝"(𝑥|𝑧)
𝑞)(𝑧|𝑥)
𝑝(𝑧)
Encoder Decoder
𝑞(𝑥)
𝑞)(𝑧)
𝑝(𝑧)
𝑝"(𝑥)
𝑞(𝑥)
20

Symmetric Variational Representation
• Symmetric variational lower bound:
Lxz(θ, φ) = Lx(θ, φ) + Lz(θ, φ)
= Eq(x)Eqφ(z|x)h(x, z; θ, φ) − Ep(z)Epθ(x|z)h(x, z; θ, φ) + K
where K = Cx + Cz and
h(x, z; θ, φ) = log
pθ(x|z)p(z)
qφ(z|x)q(x)
= log
pθ(x, z)
qφ(x, z)
• Note that h(x, z; θ, φ) is a log likelihood ratio test (LRT) statistic,
and maximization of Lxz(θ, φ) corresponds to matching the
expectations to the LRT
• Problem: To evaluate h(·) we require q(x), the true data-generating
density, which we lack
21

Slight Detour - 1/2
• Introduce binary discrete variable b ∈ {0, 1}, and
p(x, z|b = 0) = pθ(x, z)
p(x, z|b = 1) = qφ(x, z)
• Let p(b = 0) = p(b = 1) = 1/2
• The posterior probabilities satisfy
p(b = 0|x, z) =
p(x, z|b = 0)p(b = 0)
1
i=0 p(x, z|b = i)p(b = i)
=
pθ(x, z)
qφ(x, z) + pθ(x, z)
and
p(b = 1|x, z) = 1 − p(b = 0|x, z) =
qφ(x, z)
22

Slight Detour - 2/2
• Let π(b = 0|x, z) ∈ [0, 1] be a function that deﬁnes the probability
b = 0 given (x, z)
• Deﬁne ˆπ(b = 0|x, z) as
argmaxπ(b=0|x,z) {Epθ(x,z) log π(b = 0|x, z)+Eqφ(x,z) log[1−π(b = 0|x, z)]}
• The solution to this setup is
ˆπ(b = 0|x, z) =
pθ(x, z)
ˆπ(b = 1|x, z) = 1 − ˆπ(b = 0|x, z) =
qφ(x, z)
23

Inferring Log Ratio from Synthesized Samples
• Consider the cost function
g(ψ; θ, φ) = Epθ(x,z) log σ[hψ(x, z; θ, φ)]+Epφ(x,z) log[1−σ(hψ(x, z; θ, φ)]
where σ(·) is the logistic function and hψ(x, z; θ, φ) is a deep neural
network with parameters ψ, with input (x, z) and scalar output
• For ﬁxed (θ, φ), the parameters ψ∗
that maximize g(ψ; θ, φ) are
hψ∗ (x, z; θ, φ) = log
pθ(x, z)
qφ(x, z)
24

Algorithm Summary for Symmetric Variational Learning
(θi+1, φi+1) = argmax(θ,φ)Eqφ(x,z)hψi
(x, z) − Epθ(x,z)hψi
(x, z)
ψi+1 = argmaxψEpθi+1
(x,z) log σ(hψ(x, z)) + Epφi+1
(x,z) log(1 − σ(hψ(x, z))
• Expectations performed approximately via sampling:
z ∼ p(z), x = fθ(z)
x ∼ q(x), δ ∼ N(0, I), z = gφ(x, δ)
• Framework composed of three deep neural networks: fθ(z) and
gφ(x, δ) and hψ(x, z)
• Have derived a generative adversarial network (GAN) setup via
ﬁrst-principles, symmetrizing a variational lower bound
25

GAN-Like Setup
(θi+1, φi+1) = argmax(θ,φ)Eqφ(x,z)hψi
(x, z) − Epθ(x,z)hψi
(x, z)
• Update generative model parameters (θ, φ) to best “fool” the
likelihood ratio test (LRT) statistic hψi
(x, z)
ψi+1 = argmaxψEpθi+1
(x,z) log σ(hψ(x, z))+Epφi+1
(x,z) log(1−σ(hψ(x, z))
• Given new generative model parameters, update the LRT test statistic,
to best distinguish between two types of generative models
• “Adversarial game” between LRT and generative model, that is derived
as a natural outcome of symmetrizing the variational expression
26

Synthesized Images: Training on MNIST
27

Synthesized Images: Training on ImageNet
28

Summary
• Have modeled data as being drawn with latent variable z ∼ p(z), with
z then fed through neural network yielding x = fθ(z)
• Given x, perform inference for latent variable using z = gφ(x, δ),
δ ∼ N(0, I)
• Learn NN parameters θ and φ via symmetric variational expression
• In the context of inference, learn z = gφ(x, δ) as a means to draw
samples for latent variables
• Excellent synthesis of realistic data, and also eﬀective tool for inference
• Learning constitutes a generalization of generative adversarial networks
(GANs)
29

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Variable Learning & Inference w/ Deep Generative Neural Networks - Lawrence Carin, Dec 11, 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Variable Learning & Inference w/ Deep Generative Neural Networks - Lawrence Carin, Dec 11, 2017

Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Variable Learning & Inference w/ Deep Generative Neural Networks - Lawrence Carin, Dec 11, 2017 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Variable Learning & Inference w/ Deep Generative Neural Networks - Lawrence Carin, Dec 11, 2017