Metrics for generativemodels

Metrics
for
distributions
and
their

applications
for
generative
models

(part
1)
Dai
Hai
Nguyen
Kyoto
University

Learning
generative
models
?
Q
𝑃"
Ρ
Distance
( 𝑃",Q)=?
𝑄 =
1
𝑛
( 𝛿*+
,
-./

Learning generative models?
• Maximum Likelihood Estimation (MLE):
Given training samples 𝑥/, 𝑥2,…, 𝑥,, how to learn 𝑝45678 𝑥; 𝜃 from
which training samples are likely to be generated
𝜃∗
= 𝑎𝑟𝑔𝑚𝑎𝑥" ( log
𝑝45678(𝑥-; 𝜃)
,
-./

Learning
generative
models?
• Likelihood-free model
Random
input
NEURAL
NETWORK
Z~Uniform
Generator Output

How to measure similarity between 𝑝 and 𝑞 ?
§ Kullback-Leibler (KL) divergence: asymmetric, i.e., 𝐷HI(𝑝| 𝑞 ≠ 𝐷HI(𝑞| 𝑝
𝐷HI(𝑝| 𝑞 = L 𝑝 𝑥 𝑙𝑜𝑔
𝑝(𝑥)
𝑞(𝑥)
𝑑𝑥
§ Jensen-shanon (JS) divergence: symmetric
𝐷PQ(𝑝| 𝑞 =
1
2
𝐷HI(𝑝||
𝑝 + 𝑞
2
) +
1
2
𝐷HI(𝑞||
𝑝 + 𝑞
2
)
§ Optimal transport (OT):
𝒲 U 𝑝, 𝑞 = 𝑖𝑛𝑓
X~Z([,)
𝐸 *,^ ~X[||𝑥 − 𝑦||]
Where Π(𝑝, 𝑞) is a set of all joint distribution of (X, Y) with marginals 𝑝 and 𝑞

Many fundamental problems can be cast as quantifying
similarity between two distributions
§ Maximum likelihood estimation (MLE) is equivalent to minimizing KL
divergence
Suppose we sample N of 𝑥~𝑝(𝑥|𝜃∗
)
MLE of 𝜃 is
𝜃∗
= argmin
"
−
1
𝑁
( log 𝑝 𝑥- 𝜃 =
j
-./
− Ε*~[ 𝑥 𝜃∗ [log 𝑝 𝑥 𝜃 ]
By def of KL divergence:
𝐷HI(𝑝(𝑥|𝜃∗
)| 𝑝 𝑥 𝜃 = Ε*~[ 𝑥 𝜃∗ [log
𝑝 𝑥 𝜃∗
𝑝 𝑥 𝜃
]
= Ε*~[ 𝑥 𝜃∗ log 𝑝 𝑥 𝜃∗
− Ε*~[ 𝑥 𝜃∗ log 𝑝 𝑥 𝜃

Training GAN is equivalent to minimizing JS divergence
§ GAN has two networks: D and G, which are playing a minimax game
min
l
max
n
𝐿 𝐷, 𝐺 = Ε*~(*) log 𝐷 𝑥 + Εq~r(q) log(1 − 𝐷(𝐺(𝑧)))
= Ε*~(*) log 𝐷 𝑥 + Ε*~[(*) log(1 − 𝐷(𝑥))
Where 𝑝 𝑥
and 𝑞(𝑥)
is the distributions of fake images and real images,
respectively
§ Fixing G, optimal D can be easily obtained:
𝐷 𝑥 =
𝑞(𝑥)
𝑝 𝑥 + 𝑞(𝑥)

Training GAN is equivalent to minimizing JS divergence
§ GAN has two networks: D and G, which are playing a minimax game
min
l
max
n
𝐿 𝐷, 𝐺 = Ε*~(*) log 𝐷 𝑥 + Εq~[(q) log(1 − 𝐷(𝐺(𝑧)))
= Ε*~(*) log 𝐷 𝑥 + Ε*~[(*) log(1 − 𝐷(𝑥))
Where 𝑝 𝑥
and 𝑞(𝑥)
is the distribution of fake and real images, respectively
§ Fixing G, optimal D can be easily obtained by:
𝐷 𝑥 =
𝑝(𝑥)
𝑝 𝑥 + 𝑞(𝑥)
And 𝐿 𝐷, 𝐺 = ∫ 𝑞 𝑥 𝑙𝑜𝑔
(*)
[ * u(*)
𝑑𝑥 + ∫ 𝑝 𝑥 𝑙𝑜𝑔
[(*)
[ * u(*)
𝑑𝑥
= 2𝐷PQ (𝑝| 𝑞 − log4

f-‐divergences
• Divergence
between
two
distributions
𝐷w(𝑞| 𝑝 = L 𝑝 𝑥 𝑓(
𝑞 𝑥
𝑝 𝑥
)𝑑𝑥
• f:
generator
function,
convex
and
f(1)
=
0
• Every
function
f
has
a
convex
conjugate
f*
such
that:
𝑓 𝑥 = sup
^∈654(w∗)
{𝑥𝑦
− 𝑓∗
(𝑦)}

f-‐divergences
• Different
generator
f
give
different
divergences

Estimating
f-‐divergences
from
samples
𝐷w(𝑞| 𝑝 = L 𝑝 𝑥 𝑓
𝑞 𝑥
𝑝 𝑥
𝑑𝑥
= L 𝑝 𝑥 sup
~∈654(w∗)
{𝑡
𝑞 𝑥
𝑝 𝑥
− 𝑓∗
(𝑡)} 𝑑𝑥
≥ sup
•∈‚
{L 𝑞 𝑥 𝑇 𝑥 𝑑𝑥 −L 𝑝 𝑥 𝑓∗
𝑇 𝑥 𝑑𝑥}
= sup
•∈‚
{ 𝐸*~„ 𝑇 𝑥 − 𝐸*~… 𝑓∗
𝑇 𝑥 }
Samples
from
PSamples
from
Q
Conjugate
function
of
f(x):
𝑓∗
𝑥 = sup
~∈654(w)
{𝑡𝑥 − 𝑓(𝑡)}
Some
properties:
• 𝑓(𝑥) = sup
~∈654(w∗)
{𝑡𝑥 − 𝑓∗
𝑡 }
• 𝑓∗∗
𝑥 = 𝑓 𝑥
• 𝑓∗
𝑥 is
always
convec

Training
f-‐divergence
GAN
• f-‐GAN:
m𝑖𝑛
"
max
†

𝐹 𝜃, 𝑤 = 𝐸*~„ 𝑇† 𝑥 − 𝐸*~…‰
𝑓∗
𝑇† 𝑥
f-‐GAN:
Training
Generative
Neural
Sampler
using
Variational Divergence
Minimization,
NIPS2016

Turns
out:
GAN
is
a
specific
case
of
f-‐divergence
• GAN:
m𝑖𝑛
"
max
†
𝐸*~„ log 𝐷† 𝑥 − 𝐸*~…‰
log(1 − 𝐷† 𝑥 )
• f-‐GAN:
m𝑖𝑛
"
max
†
𝐸*~„ 𝑇† 𝑥 − 𝐸*~…‰
𝑓∗
𝑇† 𝑥
By
choosing
suitable
T
and
f,
f-‐GAN
turns
into
original
GAN
(^^)

1-Wasserstein distance (another option)
§ It seeks for a probabilistic coupling 𝛾:
𝑊/ = min
X∈ℙ
L 𝑐 𝑥, 𝑦
𝒳×𝒴
𝛾 𝑥, 𝑦 𝑑𝑥𝑑𝑦 = 𝐸 *,^ ~X 𝑐(𝑥, 𝑦)
Where ℙ = {𝛾 ≥ 0, ∫ 𝛾 𝑥, 𝑦 𝑑𝑦 = 𝑝𝒴
, ∫ 𝛾 𝑥, 𝑦 𝑑𝑥 = 𝑞𝒳
}
𝑐 𝑥, 𝑦 is the displacement cost from x to y (e.g. Euclidean distance)
§ a.k.a Earth mover distance
§ Can be formulated as Linear Programming (convex)

Kantarovich’s formulation of OT
§ In case of discrete input
𝑝 = ( 𝑎- 𝛿*+
4
-./
, 𝑞 = ( 𝑏“ 𝛿^”
,
“./
§ Couplings:
ℙ = {𝑃 ≥ 0, 𝑃 ∈ ℝ4×,
, 𝑃1, = 𝑎, 𝑃•
14 = 𝑏}
§ LP problem: find P
𝑃 = argmin
…∈ℙ
< 𝑃, 𝐶 >
Where C is cost matrix, i.e. 𝐶-“ = 𝑐(𝑥-, 𝑦“)

Why OT is better than KL and JS divergences?
§ OT provides a smooth measure and
more useful than KL and JS
§ Example:

How to apply 1-Wassertain distance to GAN?
𝒲 U 𝑝, 𝑞 = 𝑖𝑛𝑓
X~Z([,)
𝐸 *,^ ~X 𝑥 − 𝑦
= inf
X
< 𝐶, 𝛾 >
s.t. š
∑ 𝛾-“ = 𝑝-,,
“./
𝑖 = 1, 𝑚
∑ 𝛾-“ = 𝑞“,4
-./ 𝑗 = 1, 𝑛
min 𝑐• 𝑥
s.t.

𝐴 𝑥 = 𝑏
𝑥 ≥ 0
m𝑎𝑥 𝑏• 𝑦
s.t.

𝐴• 𝑦 ≤ 𝑐
Primal Dual
𝑐 = 𝑣𝑒𝑐 𝐶 ∈ ℝ4×,
𝑥 = 𝑣𝑒𝑐 𝛾 ∈ ℝ4×,
𝑏•
= [𝑝•
, 𝑞•
]•
∈ ℝ4u,
max 𝑓•
𝑝 + 𝑔•
𝑞
𝑠. 𝑡.
𝑓- + 𝑔“ ≤ 𝐶-“, 𝑖 = 1, . . , 𝑚; 𝑗 = 1 … 𝑛
It
easy
to
see
that

𝑓-= −𝑔- ,
so:
| 𝑓- − 𝑓“| ≤1|𝑥- − 𝑦“|
𝒲 U 𝑝, 𝑞 = 𝑠𝑢𝑝
||w||¥¦/
𝐸*~[ 𝑓 𝑥 − 𝐸*~ 𝑓(𝑥)
(Kantorovich-‐Rubinstein
duality)

Training
WGAN
In
WGAN,
replace
discrimimator with
𝑓 and
minimize
1-‐Wasserstain
distance:
min
"
𝒲 U 𝑝, 𝑞" = 𝑠𝑢𝑝
||†||§¦/
𝐸*~[ 𝑓† 𝑥 − 𝐸q~r(𝑔"(𝑧))
Ref:
Wasserstein
GAN,
ICML2017
Find
𝑤
Update

𝜃

Thank
you
for
listening

Metrics for generativemodels

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Metrics for generativemodels

Similar to Metrics for generativemodels (20)

More from Dai-Hai Nguyen

More from Dai-Hai Nguyen (8)

Recently uploaded

Recently uploaded (20)

Metrics for generativemodels