Decoding Loan Approval: Predictive Modeling in Action
Generative models : VAE and GAN
1. Generative models : VAE and GANs
Jinhwan Suk
Department of Mathematical Science, KAIST
May 7, 2020
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 1 / 29
2. Contents
Introduction to Information Theory
What is generative Model?
Example 1 : VAE
Example 2 : GANs
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 2 / 29
3. Introduction to Information Theory
Introduction
Information theory is a branch of applied mathematics.
Originally proposed by Claude Shannon in 1948.
A key measure in information theory is entropy.
Basic Intuition
Learning that an unlikely event has occurred is more informative than
learning that a likely event has occurred.
• Message 1 : ”the sun rose this morning”
• Message 2 : ”there was a solar eclipse this morning”
Message 2 is much more informative than Message 1.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 3 / 29
4. Introduction to Information Theory
Formalization of Intuitions
Likely events should have low information. And, events that are
guaranteed to happen should have no information content whatsoever.
Less likely events should have higher information content.
Independent events should have additive information. e.g. a tossed
coin has come up as head twice.
Properties of Information function I(x) = IX (x)
I(x) is a function of P(x).
I(x) is inversely proportional to P(x).
I(x) = 0 if P(x) = 1.
I(X1 = x1, X2 = x2) = I(X1 = x1) + I(X2 = x2)
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 4 / 29
5. Introduction to Information Theory
Formalization of Intuitions
Let X1 and X2 be independent random variables with
P(X1 = x1) = p1 and P(X2 = x2) = p2
Then, we have
I(X1 = x1, X2 = x2) = I(P(x1, x2))
= I(P(X1 = x1)P(X2 = x2))
= I(p1p2)
= I(p1) + I(p2)
Thus, I(p) = k log p for some k < 0.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 5 / 29
6. Introduction to Information Theory
Measure of Information
Definition (Self-Informaiton)
self-information of an event X = x is
I(x) = − log P(x).
self-information is a measure of information(or, uncertainity, surprise) of a
certain single event.
Definition (Shannon-Entropy)
Shannon entropy is the expected amount of information in an entire
probability distribution defined by
H(X) = EX∼P[I(X)] = −EX∼P[log P(X)].
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 6 / 29
7. Introduction to Information Theory
Density Estimation
In classification problem, we usually want to describe P(Y |X) for
each input X.
So many models(cθ) aim to estimate conditional probability
distribution by choosing optimal ˆθ such that
cˆθ(x)[i] = P(Y = yi |X = x),
like softmax classifier or Logistic regresor.
So we can regard the classification problem as the regression problem
such that minimizes
R(cθ) = EX [L(cθ(X), P(Y |X))]
(L measures distance between two probability distribution)
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 7 / 29
8. Introduction to Information Theory
Two ways of measuring distance between probability distributions
Definition (Total variation)
The total variation distance between two probability measures Pθ and
Pθ∗ is defined by
TV (Pθ, Pθ∗ ) = max
A:events
|Pθ(A) − Pθ∗ (A)|.
Definition (Kullback-Leibler divergence)
The KL divergence between two probability measures Pθ and Pθ∗ is
defined by
DKL(Pθ||Pθ∗ ) = EX∼Pθ
[log Pθ(X) − log Pθ∗ (X)],
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 8 / 29
9. Introduction to Information Theory
Cross-Entropy
We usually use KL-divergence because finding estimator of θ is much
easier in KL-divergence.
DKL(Pθ||Pθ∗ ) = EX∼Pθ
[log P(X) − log Pθ∗ (X)]
= EX∼Pθ
[log Pθ(X)] − EX∼Pθ
[log Pθ∗ (X)]
= constant − EX∼Pθ
[log Pθ∗ (X)]
Hence, minimizing the KL divergence is equivalent to minimizing
−EX∼Pθ
[log Pθ∗ (x)], whose name is cross-entropy. And the estimation
using estimator that minimizes KL divergence or Cross-entropy is called
maximum likelihood principle.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 9 / 29
10. Introduction to Information Theory
Maximum Likelihood Estimation
Pθ∗ is distribution of population and we want to choose proper estimator ˆθ
by minimizing the distance between Pθ∗ and Pˆθ,
DKL(Pθ∗ || Pˆθ) = const − EX∼Pθ∗ [log Pˆθ(X)]
If X1, X2, ..., Xn are random samples, then by LLN,
EX∼Pθ∗ [log Pˆθ(x)] ∼
1
n
n
i=1
log Pˆθ(Xi )
∴ DKL(Pθ∗ || Pˆθ) = const −
1
n
n
i=1
log Pˆθ(Xi )
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 10 / 29
11. Introduction to Information Theory
Maximum Likelihood Estimation
min
θ∈Θ
DKL(Pθ∗ || Pˆθ) ⇐⇒ min
θ∈Θ
−
1
n
n
i=1
log Pˆθ(Xi )
⇐⇒ max
θ∈Θ
1
n
n
i=1
log Pˆθ(Xi )
⇐⇒ max
θ∈Θ
n
i=1
log Pˆθ(Xi )
⇐⇒ max
θ∈Θ
n
i=1
Pˆθ(Xi )
This is the maximum likelihood principle.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 11 / 29
12. Introduction to Information Theory
Return to Main Goal : Find an estimator ˆθ that minimizes
R(cθ) = EX [L(cθ(X), P(Y |X))].
Suppose that X1, X2, ..., Xn are i.i.d and cross-entropy is used for L.
EX [L(cθ(X), P(Y |X))] ∼
1
n
n
i=1
L(cθ(Xi ), P(Y |Xi ))
=
1
n
n
i=1
−EY |Xi ∼PYemp|Xi
[log cθ(Xi )]
=
1
n
n
i=1
− log{cθ(Xi )[Yi,true]}.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 12 / 29
13. What is Generative Model?
Generative Model vs Discriminative model
A generative model is a statistical model of the joint distribution on
X × Y , P(X, Y )
A discriminative model is a model of the conditional probability of
the target given an observation x, P(Y |X = x).
In unsupervised learning, generative model usually means the
statistical model of P(X).
How can we estimate joint(conditional) distribution?
What do we obtain while estimating the probability distribution?
What can we do with generative model?
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 13 / 29
14. Example of Discriminative Model
Simple Linear Regression
Assumption : P(y|x) = N(α + βx, σ2), σ > 0 is known.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 14 / 29
15. Concept of VAE
Goal : estimate population distribution using given observations.
Strong assumption on existence of latent variables, Z ∼ N(0, I).
X|Z ∼ N(f (Z; θ), σ2
∗ I))
X|Z ∼ Bernoulli(f (Z; θ))
Let Pemp be empirical distribution(assumption : Pemp ≈ Ppop)
arg min
θ
DKL(Pemp(X)||Pθ(X)) = arg min
θ
const − EX∼Pemp [log Pθ(X)]
= arg max
θ
EX∼Pemp [log Pθ(X)]
= arg max
θ
1
N
N
i=1
[log Pθ(Xi )]
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 15 / 29
16. Concept of VAE
Naive approach
Maximize Pθ(Xi ) w.r.t θ for each samples X1, X2, ..., Xn.
=⇒ But, Pθ(Xi ) is intractable.
Pθ(Xi ) =
Z
Pθ(Xi , z) dz
=
Z
Pθ(Xi |z)P(z) dz
∼
1
n
n
j=1
Pθ(Xi |Zj )P(Zj )
If we pick n large, then the approximation would be done quite well.
But for efficiency, we look for some other way to set n small.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 16 / 29
17. Concept of VAE
ELBO
Use Monte Carlo method on
Pθ(Xi ) =
Z
Pθ(Xi , z) dz
=
Z
Pθ(z|Xi )P(Xi ) dz
∼
1
n
n
j=1
Pθ(Zj |Xi )P(Xi )
Pick Zj where Pθ(Zj |Xi ) is high ⇒ intractable
Set Qφ(Z|X) ∼ N(µφ(X), σφ(X)2) to estimate Pθ(Zj |Xi ).
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 17 / 29
18. Concept of VAE
DKL(Qφ(Z|X)||Pθ(Z|X)) = EZ∼Q|X [log Qφ(Z|X) − log Pθ(Z|X)]
= EZ∼Q|X [log Qφ(Z|X) − log Pθ(X, Z)]
+ log Pθ(X)
We want to maximize log Pθ(X) and minimize
DKL(Qφ(Z|X)||Pθ(Z|X)) at once.
Define L(θ, φ, X) = EZ∼Q|X [log Pθ(X, Z) − log Qφ(Z|X)]
log Pθ(X) − DKL(Qφ(Z|X)||Pθ(Z|X)) = L(θ, φ, X)
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 18 / 29
19. Concept of VAE
ELBO
L(θ, φ, X) = EZ∼Q|X [log Pθ(X, Z) − log Qφ(Z|X)]
= EZ∼Q|X [log Pθ(X|Z) + log Pθ(Z) − log Qφ(Z|X)]
= EZ∼Q|X [log Pθ(X|Z)] − DKL(Qφ(Z|X)||Pθ(Z))
DKL(Qφ(Z|X)||Pθ(Z)) can be integrated analytically
DKL(Qφ(Z|X)||Pθ(Z)) =
1
2
(1 + log σφ(X)2
) − µφ(X)2
− σφ(X)2
EZ∼Q|X [log Pθ(X|Z)] requires estimation by sampling.
EZ∼Q|X [log Pθ(X|Z)] ≈
1
n
n
i=1
log Pθ(X|zi )
=
1
n
n
i=1
[−
(X − f (z1; θ))2
2σ2
− log(
√
2πσ2
)]
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 19 / 29
20. Concept of VAE
ELBO
Maximizing L(θ, φ, X) is equal to minimizing
1
n
n
i=1
(X − f (z1; θ))2
2σ2
+
1
2
(1 + log σφ(X)2
) − µφ(X)2
− σφ(X)2
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 20 / 29
21. Concept of VAE
Problem of the above formulation
Since Pθ(Xi ) ≈ 1
n
n
j=1 P(Xi |zj )P(zj ) and we use n = 1,
log Pθ(Xi ) ≈ log[P(Xi |z1)P(z1)]
= log P(Xi |z1) + log P(z1)
= log
1
√
2πσ
exp(−
(Xi − f (z1; θ))2
2σ2
) + log
1
√
2π
exp(−
z2
1
2
)
= −
(Xi − f (z1; θ))2
2σ2
+ const.
Therfore, maximizing log Pθ(Xi ) is transformed to
minimizing −(Xi −f (z1;θ))2
2σ2 .
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 21 / 29
22. Concept of VAE
Problem of the above formulation
To address this problem, we should set σ very small
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 22 / 29
23. Concept of GANs
Introduction
Goal : estimate population distribution using given observations.
Strong assumption on existence of latent variables, Z ∼ PZ .
Define G(z; θg ) which is mapping to data space,
Pg (X = x) = PZ (G(Z) = x)
Define D(x; θd ) that represents the probability that x is real.
min
G
max
D
V (D, G) = Ex∼Pemp [log D(x)] + E[log(1 − D(G(z)))]
What is difference between VAE and GANs??
⇒ GANs do not formulate about P(X) explicitly.
⇒ But we can show it has a global optimum Pg = Pemp
⇒ So we can say that GANs is generative model.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 23 / 29
24. Concept of GANs
Algorithm
V (D, G) = Ex∼Pemp [log D(x)] + E[log(1 − D(G(z)))]
∼
1
m
m
i=1
log D(xi ) +
1
m
m
j=1
log(1 − D(G(zj )))
1 Sample minibatch of m noise samples and minibatch of m examples.
2 update the discriminator by ascending its stochastic gradient :
1
m
m
i=1
θd
[log D(xi ) + log(1 − D(G(zi )))]
3 Sample minibatch of m noise samples.
4 Update the generator by descending its stochastic gradient :
1
m
m
i=1
θg [log(1 − D(G(zi )))]
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 24 / 29
25. Concept of GANs
Global optimality of Pg = Pemp
Proposition 1
For G fixed, the optimal discriminator D is
D∗
G (x) =
Pemp(x)
Pemp(x) + Pg (x)
Proposition 2
The global minimum of the virtual training criterion is achieved if and only
if Pg = Pemp.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 25 / 29