Wasserstein GAN

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Wasserstein GAN
Bar Vinograd
The First Original Independent Seminar No. 5
03.05.2017

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Outline
1 Wasserstein GAN
Introduction
Distances
EM Properties
Training
Results
Summary
2 Improved Training of Wasserstein GANs
Theory
Algorithm
Results
Summary

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Introduction
Unsupervised Learning
unsupervised learning
For data {x(i)}m
i=1 and a family of densities {Pθ}θ∈Rd solve
max
θ∈Rd
1
m
m
i=1
log Pθ(x(i)
)
or, min KL(Pr Pθ)

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Introduction
GANs
Find gθ : Z → X s.t.
Z ∈ Z is a random variable e.g. a Gaussian distribution
X is the domain being modelled (e.g. images, texts,
audio)
gθ a network of some kind
the distribution induced by gθ i.e. Pg is close to Pr

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Introduction
GANs are hard to train
Problems
Saturated gradients
Loss is not correlated with convergence
Unstable
Mode collapse
In general Pr and Pθ unlikley to have non-negligible
intersection

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Introduction
Problems
Saturated gradients
Loss is not correlated with convergence
Unstable
Mode collapse
In general Pr and Pθ unlikley to have non-negligible
intersection
Solutions
Balancing generator and discriminator. This gives a lower
bound on loss and avoids collapse.
Apply random noise to real samples - creates an
intesections
−log(D) trick for generator loss

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Introduction
Goodfello
(2017)

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Distances
Kullback-Leibler (KL) divergence
KL(Pr Pg ) = log
Pr (x)
Pg (x)
Pr (x)dµ(x)
Jensen-Shannon (JS) distance
JS(Pr Pg ) =
1
2
(KL(Pr M) + KL(Pg M))
where M = 1
2 (Pr + Pg )
Total Variation (TV) distance
δ(Pr , Pg ) = sup
A⊂ΣX
|Pr (A) − Pg (A)|

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Distances
Earth Mover (EM) distance or Wasserstein-1
W (Pr , Pg ) = inf
γ∈Π(Pr ,Pg )
E(x,y)∼γ [ x − y ]
where Π(Pr , Pg ) is the set of all copulings of Pr and Pg .
Which are all the distributions on X2 with marginals Pr
and Pg .

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Distances
EM Illustration

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Distances
EM Illustration
Now do it with high dimensional dirt piles

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Distances
Example
Let
Z ∼ U[0, 1]
P0 be the distribution of points (0, Z) ∈ R2
gθ(z) = (θ, z)
Then
W (P0, Pθ) = |θ|
JS(P0, Pθ) =
log(2) θ = 0
0 θ = 0
KL(P0 Pθ) =
∞ θ = 0
0 θ = 0
δ(P0, Pθ) =
1 θ = 0
0 θ = 0

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
All distances other than EM are not continuous and so are
their derivatives.
Disjoint or measure zero intesection between supports of
function family and real distribution is common
We would like our loss to have an informative gradient

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Summary
Wasserstein (or EM) loss for neural networks is continuous and
diﬀerentiable almost everywhere. Moreover, Convergence in KL
implies convergence in TV and JS which implies convergence in
EM.

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Training
Kantorovich-Rubinstein duality
W (Pr , Pθ) = sup
f L≤1
Ex∼Pr [f (x)] − Ex∼Pθ
[f (x)]
where f : X → R and is 1-Lipschitz.
A function f : X → Y is K-Lipschitz if there exists a
K ≥ 0 s.t.
f (x1) − f (x2) ≤ K x1 − x2
for all x1, x2 ∈ X
Unllike the defintion of EM, this duality provides us with a
tractable definition
This is a private case of a integral probability metric
(IPM). For example, it is also defined for TV with the
appropriate choice of function family.

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Training
which tells us how to train the generator

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Training
clipping keeps the discriminator Lipschitz

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Training
clipping keeps the critic Lipschitz
training the critic to optimality
gradients do not saturate

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Training

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Results
64x64x3 image generation on LSUN-Bedrooms dataset
Compared with DCGAN, train with the −log(D) trick

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Results
JS generator loss
loss is saturated

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Results
WGAN generator loss
loss is correlated with image quality

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Summary
Beneﬁts
Theortically sound critic loss
Stable training
Loss correlates with desired result
No mode collapse
Problems
Does not work well with momentum-based optimizers e.g.
Adam
Slower to converge than KL loss
Requires hyper-parameter tuning

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Problems with weight clipping
Diminished capacity. Lots of Lipschitz functions are not
included in this family

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Exploding or vanishing gradients (w.o. batchnorm)
depending on c

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
depending on c
Weights tend to saturate i.e. −c or c

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
depending on c
Weights tend to saturate i.e. −c or c
Unstable with momentum-based techniques e.g. Adam

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
The Kantorovich-Rubinstein duality states that
W (Pr , Pθ) = sup
f L≤1
Ex∼Pr [f (x)] − Ex∼Pθ
[f (x)]
where f : X → R and is 1-Lipschitz.
A diﬀerentiable function is 1-Lipchitz iﬀ its graidents are
in the unit ball
The optimal solution for the duality is has gradients with
norm 1 almost everywhere i.e. on the unit ball

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Enforcing unit sphere gradients almost everywhere is not
tractable
Sampling random points in X and taking the gradient on
them
Gradient Penalty
L = Ex∼Pg [D(x)] − Ex∼Pr [D(x)]
critic loss
+λ Eˆx∼Pˆx
( ˆx D(ˆx) 2 − 1)2
gradient penalty
No batch norm in critic - penalizing the norm per sample
and not per batch. Use layer normaliztion instead
Use Adam
Sample uniformaly along lines between samples from Pr
and Pg
Sphere under 2
λ = 10

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Toy Datasets

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Speed on CIFAR-10
but architecture must be stablized ﬁrst. Generator and
discriminator balanced.

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Overﬁtting on MNIST
(a) LSUN (b) MNIST with penalty (left) and clipping (right)

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
LSUN
No BN and a constant number of ﬁlters in the generator,
as in Arjovsky et al. (2017)
4-layer 512-dim ReLU MLP generator, as in Arjovsky et al.
(2017)
No normalization in either the discriminator or generator
Gated multiplicative nonlinearities everywhere, as in van
den Oord et al. (2016)
tanh nonlinearities everywhere
101-layer ResNet generator and discriminator

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
LSUN

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Language Modeling
First general language model trained entirely adversarially
without a supervised maximum-likelihood loss. Here X simplex
of degree n and Pr can be thought of as a histogram on the
alphabet.

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Summary
Improves and maintains beneﬁts of original WGAN
Fixed problems with weight clipping: Adam, Hyper
parameter Tuning, Depth
Enables a richer critic family

Wasserstein
GAN
Bar Vinograd
Wasserstein
GAN
Introduction
Distances
EM Properties
Training
Results
Summary
Improved
Training of
Wasserstein
GANs
Theory
Algorithm
Results
Summary
Thank You

Wasserstein
GAN
Bar Vinograd
Appendix
For Further
Reading
For Further Reading I
M. Arjovsky and L. Bottou
Towards principled methods for training generative
adverserial networks.
under review for ICLR 2017, abs/1701.04862, 2017.
M. Arjovsky, S. Chintala, and L. Bottou
Wasserstein GAN.
abs/1701.07875, 2017.
I. Gulrajani, F, Ahmed, M. Arjovsky, V, Dumoulin, and A.
Courville
Improved Training of Wasserstein GANs.
abs/1704.00028, 2017.

Wasserstein GAN

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Wasserstein GAN

Similar to Wasserstein GAN (20)

Recently uploaded

Recently uploaded (20)

Wasserstein GAN