QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Variable Learning & Inference w/ Deep Generative Neural Networks - Lawrence Carin, Dec 11, 2017
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Variable Learning & Inference w/ Deep Generative Neural Networks - Lawrence Carin, Dec 11, 2017
Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Variable Learning & Inference w/ Deep Generative Neural Networks - Lawrence Carin, Dec 11, 2017 (20)
Unit-IV; Professional Sales Representative (PSR).pptx
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Variable Learning & Inference w/ Deep Generative Neural Networks - Lawrence Carin, Dec 11, 2017
1. Variational Learning and Inference
with Deep Generative Neural Networks
Lawrence Carin
Duke University
11 December 2017
1
2. Model Development
• We are often interested in learning a model of the form
x ∼ pθ(x|z), z ∼ p(z)
where θ are unknown model parameters, and z are latent variables
drawn from known prior p(z)
• Model parameters θ are fixed for all data x
• Variation in x accounted for via variation of z, representing latent
processes
1
3. Example: ImageNet 1.2 Million Images
x ∼ pθ(x|z) with each z ∼ p(z) corresponding to an image
Questions: What’s the right model pθ(x|z), and how to determine θ?
2
4. Maximum Likelihood Learning
• Let q(x) represent the true, unknown distribution of the data
• Seek θ for which pθ(x) accurately models q(x)
• Maximum likelihood (ML) learning:
ˆθ = argmaxθ Eq(x) log pθ(x) ≈
1
N
N
i=1
log pθ(xi)
where {xi}i=1,N are the observed data
• Problem: pθ(x) = pθ(x|z)p(z)dz typically intractable to compute
3
5. Variational Approximation
• Let qφ(z|x) be a parametric approximation to
pθ(z|x) =
pθ(x|z)p(z)
pθ(x|z)p(z)dz
• Consider the variational expression
L(θ, φ) = Eq(x)Eqφ(z|x) log
pθ(x|z)p(z)
qφ(z|x)
= Eq(x)[log pθ(x) − KL(qφ(z|x) pθ(z|x)]
≤ Eq(x) log pθ(x)
• Alternate between θ and φ to maximize
L(θ, φ) ≈
1
N
N
i=1
Eqφ(zi|xi) log
pθ(xi|zi)p(zi)
qφ(zi|xi)
4
6. Form of the Approximating Distributions
• We typically use
pθ(x|z) = δ(x − fθ(z)),
with fθ(z) a deterministic function
• Randomness in pθ(x) manifested by latent variable z ∼ p(z)
• We do not assume an explicit form for qφ(z|x), we simply build a
model to sample from this distribution
z = gφ(x, δ) , δ ∼ N(0, I)
• Here employ deep neural networks for fθ(z) and gφ(x, δ)
5
7. Summarizing Model
• Generative process for data x
z ∼ p(z)
x(z) = fθ(z)
• Generative process for latent code z given x
δ ∼ N(0, I)
z = gφ(x, δ)
• fθ(z) and gφ(x, δ) learned deep neural networks
6
9. Forms of the Variational Lower Bound
L(θ, φ) = Eq(x)Eqφ(z|x) log
pθ(x|z)p(z)
qφ(z|x)
= Eq(x) log pθ(x) − Eq(x)KL(qφ(z|x) pθ(z|x))
• Maximizing L(θ, φ): minimizing the expected distance
Eq(x)KL(pθ(z|x) qφ(z|x)) between the true and approximate
posterior
• May also be expressed
L(θ, φ) = −KL(qφ(x, z) pθ(x, z)) + C
where qφ(x, z) = q(x)qφ(z|x), pθ(x, z) = p(z)pθ(x|z), and
C = Eq(x) log q(x)
8
10. Cumulative Marginal Distributions
• We previously defined
pθ(x) = Ep(z)pθ(x|z)
• We now similarly define
qφ(z) = Eq(x)qφ(z|x)
• qφ(z) represents the cumulative distribution for latent variables z,
across all x ∼ q(x)
• Easily shown that, by re-expressing KL(qφ(x, z) pθ(x, z)):
L(θ, φ) = −Eq(x)KL(qφ(z|x) pθ(z|x)) − KL(q(x) pθ(x)) + C
= −Eqφ(z)KL(qφ(x|z) pθ(x|z)) − KL(qφ(z) p(z)) + C
9
11. Examination of the Variational Lower Bound
L(θ, φ) = −Eq(x)KL(qφ(z|x) pθ(z|x)) − KL(q(x) pθ(x)) + C
= −Eqφ(z)KL(qφ(x|z) pθ(x|z)) − KL(qφ(z) p(z)) + C
• First form encourages pθ(x) to be close to true data distribution q(x)
• Second form encourages that qφ(z) to be close to the prior p(z)
• Also encourages matching of conditional distributions
• It looks good, but in reality it’s not
• Culprit: The KL divergence is asymmetric
10
12. Support of a Distribution
• Support Sp(z) of distribution p(z) defined as member of the set
{ ˜Sp(z) :
˜Sp(z)
p(z)dz = 1 − }
with minimum size ˜Sp(z) =
˜Sp(z)
dz
• Typically interested in → 0+
• For notational convenience, replace Sp(z) with Sp(z), with
understanding is small
• Also define Sp(z)−
as largest set for which
Sp(z)−
p(z)dz =
Sp(z)
p(z)dz +
Sp(z)−
p(z)dz = 1
11
13. Analysis of the KL Divergence
L(θ, φ) = −Eq(x)KL(qφ(z|x) pθ(z|x)) − KL(q(x) pθ(x)) + C
= −Eqφ(z)KL(qφ(x|z) pθ(x|z)) − KL(qφ(z) p(z)) + C
• We examine the term −KL(q(x) pθ(x)) in detail, as representative
example
−KL(q(x) pθ(x)) = Eq(x) log pθ(x) + C
≈
Sq(x)
q(x) log pθ(x)dx + C
• We also have
Sq(x)
q(x) log pθ(x) =
Sq(x)∩Spθ(x)
q(x) log pθ(x)dx+
Sq(x)∩Spθ(x)−
q(x) log pθ(x)dx
12
14. Implications
Sq(x)
q(x) log pθ(x) =
Sq(x)∩Spθ(x)
q(x) log pθ(x)dx +
Sq(x)∩Spθ(x)−
q(x) log pθ(x)dx
• If Sq(x) ∩ Spθ(x)−
= ∅, then
Sq(x)∩Spθ(x)−
q(x) log pθ(x)dx will be large
negative
• Hence, maximizing L(θ, φ) encourages Sq(x) ∩ Spθ(x)−
= ∅
• By contrast, no strong penalty for Sq(x)−
∩ Spθ(x) = ∅, since
Sq(x)−
q(x) log pθ(x) ≈ 0
13
15. Summarizing
• Maximization of −KL(q(x) pθ(x)) implies
Sq(x) ∩ Spθ(x)−
= ∅ , Sq(x)−
∩ Spθ(x) = ∅
• Equivalently
Sq(x) ⊂ Spθ(x)
• May also show that maximization of −KL(qφ(z) p(z)) yields
Sqφ(z) ⊂ Sp(z)
• This implies many (most) x ∼ pθ(x) will not look like x ∼ q(x)
• This is a fundamental problem with variational-based learning
14
16. Implications of Traditional Variational Learning
𝑝"(𝑥|𝑧)
𝑞)(𝑧|𝑥)
𝑝(𝑧)
𝑞(𝑥)
Encoder Decoder
𝑞(𝑥)
𝑞)(𝑧)
𝑝(𝑧)
𝑝"(𝑥)
15
17. Flip Order of Distributions in KL
• Consider maximization of
−KL(pθ(x) q(x)) = Epθ(x) log q(x) + h(pθ(x))
• To optimize this term,
Spθ(x) ⊂ Sq(x)
and the subset should be as large as possible, to maximize h(pθ(x))
• May also show that maximization of −KL(p(z) qφ(z)) yields
Sp(z) ⊂ Sqφ(z)
16
18. New Form of the Variational Lower Bound
• Recall original form of the variational lower bound
Lx(θ, φ) = Eq(x)Eqφ(z|x) log
pθ(x|z)p(z)
qφ(z|x)
= −Eq(x)KL(qφ(z|x) pθ(z|x)) − KL(q(x) pθ(x)) + Cx
= −Eqφ(z)KL(qφ(x|z) pθ(x|z)) − KL(qφ(z) p(z)) + Cx
• Introduce a new form
Lz(θ, φ) = Ep(z)Epθ(x|z) log
qφ(z|x)q(x)
pθ(x|z)
= −Ep(z)KL(pθ(x|z) qφ(x|z)) − KL(p(z) qφ(z)) + Cz
= −Epθ(x)KL(pθ(z|x) qφ(z|x)) − KL(pθ(x) q(x)) + Cz
where Cx = −h(q(x)), Cz = −h(p(z))
17
19. Implications of New Variational Expression
𝑝"(𝑥|𝑧)
𝑞)(𝑧|𝑥)
𝑝(𝑧)
Encoder Decoder
𝑞(𝑥)
𝑞)(𝑧)
𝑝"(𝑥)𝑞(𝑥)
𝑝(𝑧)
18
21. Result of Combined Variational Expressions
𝑝"(𝑥|𝑧)
𝑞)(𝑧|𝑥)
𝑝(𝑧)
Encoder Decoder
𝑞(𝑥)
𝑞)(𝑧)
𝑝(𝑧)
𝑝"(𝑥)
𝑞(𝑥)
20
22. Symmetric Variational Representation
• Symmetric variational lower bound:
Lxz(θ, φ) = Lx(θ, φ) + Lz(θ, φ)
= Eq(x)Eqφ(z|x)h(x, z; θ, φ) − Ep(z)Epθ(x|z)h(x, z; θ, φ) + K
where K = Cx + Cz and
h(x, z; θ, φ) = log
pθ(x|z)p(z)
qφ(z|x)q(x)
= log
pθ(x, z)
qφ(x, z)
• Note that h(x, z; θ, φ) is a log likelihood ratio test (LRT) statistic,
and maximization of Lxz(θ, φ) corresponds to matching the
expectations to the LRT
• Problem: To evaluate h(·) we require q(x), the true data-generating
density, which we lack
21
24. Slight Detour - 2/2
• Let π(b = 0|x, z) ∈ [0, 1] be a function that defines the probability
b = 0 given (x, z)
• Define ˆπ(b = 0|x, z) as
argmaxπ(b=0|x,z) {Epθ(x,z) log π(b = 0|x, z)+Eqφ(x,z) log[1−π(b = 0|x, z)]}
• The solution to this setup is
ˆπ(b = 0|x, z) =
pθ(x, z)
qφ(x, z) + pθ(x, z)
ˆπ(b = 1|x, z) = 1 − ˆπ(b = 0|x, z) =
qφ(x, z)
qφ(x, z) + pθ(x, z)
23
25. Inferring Log Ratio from Synthesized Samples
• Consider the cost function
g(ψ; θ, φ) = Epθ(x,z) log σ[hψ(x, z; θ, φ)]+Epφ(x,z) log[1−σ(hψ(x, z; θ, φ)]
where σ(·) is the logistic function and hψ(x, z; θ, φ) is a deep neural
network with parameters ψ, with input (x, z) and scalar output
• For fixed (θ, φ), the parameters ψ∗
that maximize g(ψ; θ, φ) are
hψ∗ (x, z; θ, φ) = log
pθ(x, z)
qφ(x, z)
24
26. Algorithm Summary for Symmetric Variational Learning
(θi+1, φi+1) = argmax(θ,φ)Eqφ(x,z)hψi
(x, z) − Epθ(x,z)hψi
(x, z)
ψi+1 = argmaxψEpθi+1
(x,z) log σ(hψ(x, z)) + Epφi+1
(x,z) log(1 − σ(hψ(x, z))
• Expectations performed approximately via sampling:
z ∼ p(z), x = fθ(z)
x ∼ q(x), δ ∼ N(0, I), z = gφ(x, δ)
• Framework composed of three deep neural networks: fθ(z) and
gφ(x, δ) and hψ(x, z)
• Have derived a generative adversarial network (GAN) setup via
first-principles, symmetrizing a variational lower bound
25
27. GAN-Like Setup
(θi+1, φi+1) = argmax(θ,φ)Eqφ(x,z)hψi
(x, z) − Epθ(x,z)hψi
(x, z)
• Update generative model parameters (θ, φ) to best “fool” the
likelihood ratio test (LRT) statistic hψi
(x, z)
ψi+1 = argmaxψEpθi+1
(x,z) log σ(hψ(x, z))+Epφi+1
(x,z) log(1−σ(hψ(x, z))
• Given new generative model parameters, update the LRT test statistic,
to best distinguish between two types of generative models
• “Adversarial game” between LRT and generative model, that is derived
as a natural outcome of symmetrizing the variational expression
26
30. Summary
• Have modeled data as being drawn with latent variable z ∼ p(z), with
z then fed through neural network yielding x = fθ(z)
• Given x, perform inference for latent variable using z = gφ(x, δ),
δ ∼ N(0, I)
• Learn NN parameters θ and φ via symmetric variational expression
• In the context of inference, learn z = gφ(x, δ) as a means to draw
samples for latent variables
• Excellent synthesis of realistic data, and also effective tool for inference
• Learning constitutes a generalization of generative adversarial networks
(GANs)
29