What is Critical in GAN Training?

WHAT IS CRITICAL IN GAN TRAINING?
Hao-Wen Dong
Music and Computing (MAC) Lab, CITI, Academia Sinica

Outlines
• Generative Adversarial Networks (GAN[1])
• Wasserstein GANs (WGAN[2])
• Lipschitz Regularization
• Spectral Normalization (SN-GAN[3])
• Gradient Penalties (WGAN-GP[4], DRAGAN[5], GAN-GP[6])
• What is critical in GAN training?
2

Generative Adversarial Networks

Generative Adversarial Networks (GANs)
• Two-player game between the discriminator D and the generator G
𝐽 𝑫 𝐷, 𝐺 = − 𝔼
𝑥~𝑝 𝑑𝑎𝑡𝑎
log 𝐷 𝑥 − 𝔼
𝑧~𝑝 𝑧
log 1 − 𝐷 𝐺 𝑧
𝐽 𝑮 𝐺 = 𝔼
𝑧~𝑝 𝑧
(to assign real data a 1) (to assign fake data a 0)
fake data
(to make D assign generated data a 1)
4
data distribution prior distribution

Original Algorithm
5
(Goodfellow et. al [1])

Original Convergence Proof
For a finite data set 𝑿, we only have
𝒑 𝒅𝒂𝒕𝒂 = ቊ
𝟏, 𝒙 ∈ 𝑿
𝟎, 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
(may need density estimation)
hard to optimize
if the criterion can be easily improved
usually not the case
usually not the case
6
(one of)

Minimax and Non-saturating GANs
𝐽 𝑫
𝐷, 𝐺 = − 𝔼
𝑧~𝑝 𝑧
Minimax: 𝐽 𝑮 𝐺 = 𝔼
𝑧~𝑝 𝑧
Non-saturating: 𝐽 𝑮
𝐺 = − 𝔼
𝑧~𝑝 𝑧
log 𝐷 𝐺 𝑧 (used in practice)
(won’t stop training when D is stronger)
7

Comparisons of GANs
D is fooledD is not fooled
(non-saturating GAN)
(minimax GAN)
Less training difficulties
at the initial stage when
G can hardly fool D
8
(Goodfellow [7])

Wasserstein Distance (Earth-Mover Distance)
can be optimized easier
10
𝑊 ℙ 𝑟, ℙ 𝜃 = in𝑓
𝛾∈Π ℙ 𝑟,ℙ 𝜃
𝔼 𝑥,𝑦 ~𝛾 𝑥 − 𝑦

Kantorovich-Rubinstein duality
• Kantorovich-Rubinstein duality
𝑊 ℙ 𝑟, ℙ 𝜃 = sup
𝑓 𝐿≤1
𝔼 𝑥~ℙ 𝑟
𝑓 𝑥 − 𝔼 𝑥~ℙ 𝜃
𝑓 𝑥
• Definition A function 𝑓: ℜ → ℜ is called Lipschitz continuous if
∃𝐾 ∈ ℜ 𝑠. 𝑡. ∀𝑥1, 𝑥2 ∈ ℜ 𝑓 𝑥1 − 𝑓 𝑥2 ≤ 𝐾 𝑥1 − 𝑥2
all the 1-Lipschitz functions 𝒇: 𝑿 → 𝕽
11
 

Wasserstein GAN
• Key: use a NN to estimate Wasserstein distance (and use it as critics for G)
𝐽 𝑫 𝐷, 𝐺 = − 𝔼
𝐷 𝑥 − 𝔼
𝑧~𝑝 𝑧
𝐷 𝐺 𝑧
𝐽 𝑮
𝐺 = 𝔼
𝑧~𝑝 𝑧
𝐷 𝐺 𝑧
• Original GAN (non-saturating)
𝐽 𝑫 𝐷, 𝐺 = − 𝔼
𝑧~𝑝 𝑧
𝐽 𝑮 𝐺 = − 𝔼
𝑧~𝑝 𝑧
log 𝐷 𝐺 𝑧
12

Wasserstein GAN
• Problem: such NN needs to satisfy a Lipschitz constraint
• Global regularization
• weight clipping  original WGAN
• spectral normalization  SNGAN
• Local regularization
• gradient penalties  WGAN-GP
13

Weight Clipping
• Key: clip the weights of the critic into −𝑐, 𝑐
15
gradient exploding
gradient vanishing
output input
training difficulties
(Gulrajani et. al [4])

Spectral Normalization
• Key: constraining the spectral norm of each layer
• For each layer 𝑔: 𝒉𝑖𝑛 → 𝒉 𝑜𝑢𝑡, by definition we have
𝑔 𝐿𝑖𝑝 = sup
𝒉
𝜎 𝛻𝑔 𝒉 ,
where
𝜎 𝐴 ≔ max
𝒉≠0
𝐴𝒉 2
𝒉 2
= max
𝒉 2≤1
𝐴𝒉 2
16
spectral norm
the largest singular value of A

• For a linear layer 𝑔 𝒉 = 𝑊𝒉, 𝑔 𝐿𝑖𝑝 = sup
𝒉
𝜎 𝛻𝑔 𝒉 = sup
𝒉
𝜎 𝑊 = 𝜎 𝑊
• For typical activation layers 𝑎 ℎ ,
𝑎 𝐿𝑖𝑝 = 1 for ReLU, LeakyReLU
𝑎 𝐿𝑖𝑝 = 𝐾 for other common activation layers (e.g. sigmoid, tanh)
• With the inequality
𝑓1 ∘ 𝑓2 𝐿𝑖𝑝 ≤ 𝑓1 𝐿𝑖𝑝 ∙ 𝑓2 𝐿𝑖𝑝,
we now have
𝑓 𝐿𝑖𝑝 ≤ 𝑊1 𝐿𝑖𝑝 ∙ 𝑎1 𝐿𝑖𝑝 ∙∙∙ 𝑊𝐿 𝐿𝑖𝑝 ∙ 𝑎 𝐿 𝐿𝑖𝑝 = ෑ
𝑙=1
𝐿
𝜎 𝑊𝑙
17linear activation
layer 1
linear activation
layer L

18
• Spectral normalization
ഥ𝑊𝑆𝑁 𝑊 ∶=
𝑊
𝜎 𝑊
(𝑊: weight matrix)
• Now we have 𝑓 𝐿𝑖𝑝 ≤ 1 anywhere
• Fast approximation of 𝜎 𝑊 using
power iteration method (see the
paper)
(Miyato et. al [3])

Gradient Penalties
• Key: punish the critic discriminator when it violate the Lipschitz constraint
• But it’s impossible to enforce punishment anywhere
𝔼ො𝑥~ℙෝ𝑥
𝛻ො𝑥 𝐷 ො𝑥 2 − 1 2
• Two common sampling approaches for ො𝑥
WGAN-GP ℙො𝑥 = 𝛼ℙ 𝑥 + 1 − 𝛼 ℙ 𝑔
DRAGAN ℙො𝑥 = 𝛼ℙ 𝑥 + 1 − 𝛼 ℙ 𝑛𝑜𝑖𝑠𝑒
𝛼~𝑈 0,1
19
between data and model distribution
around data distribution
punish it when the gradient
norm get away from 1
make the gradient
norm stay close to 1

WGAN-GP
20
gradient exploding
gradient vanishing
inputoutput
with gradient penalty
ℙො𝑥 = 𝛼ℙ 𝑥 + 1 − 𝛼 ℙ 𝑔
between data and
model distribution
(Gulrajani et. al [4])

DRAGAN
21
ℙො𝑥 = 𝛼ℙ 𝑥 + 1 − 𝛼 ℙ 𝑛𝑜𝑖𝑠𝑒
around data distribution
GAN
WGAN-GP
DRAGAN
GAN
WGAN-GP
DRAGAN
(Fedus et. al [6])

What is critical in GAN training?

Why is WGAN more stable?
the properties of the underlying divergence that is being optimized
or
the Lipschitz constraint
23
theoretically, only minimax GAN may
suffer from gradient vanishing
(Goodfellow [7])

Comparisons
24
GAN-GP
DRAGAN
WGAN-GP
GAN
non-saturating GAN
+
gradient penalties
(Kodali et. al [5])

Why is WGAN more stable?
properties of the underlying divergence that is being optimized
or
Lipschitz constraint
25

(Recap) Original Convergence Proof
For a finite data set 𝑿, we only have
𝒑 𝒅𝒂𝒕𝒂 = ቊ
𝟏, 𝒙 ∈ 𝑿
𝟎, 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
(may need density estimation)
hard to optimize
26

From a Distribution Estimation Viewpoint
27
Locally regularized Globally regularizedUnregularized
smoother critics
• give a more stable guidance to the generator
• alleviate mode collapse issue
(my thoughts)

Open Questions
• Gradient penalties
• are usually too strong in WGAN-GP
• may create spurious local optima
• improved-improved-WGAN [8]
• Spectral normalization
• may impact the optimization procedure?
• can be used as a general regularization tool for any NN?
28

References
[1] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,
and Yoshua Bengio, “Generative Adversarial Networks,” in Proc. NIPS, 2014.
[2] Martin Arjovsky, Soumith Chintala, and Léon Bottou, “Wasserstein Generative Adversarial Networks,” in Proc.
ICML, 2017.
[3] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida, “Spectral Normalization for Generative
Adversarial Networks,” in Proc. ICLR, 2018.
[4] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville, “Improved training of
Wasserstein GANs,” In Proc. NIPS, 2017.
[5] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira, “On Convergence and Stability of GANs,” arXiv
preprint arXiv:1705.07215, 2017.
[6] William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M. Dai, Shakir Mohamed, and Ian Goodfellow,
“Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step,” in Proc. ICLR,
2017.
[7] Ian J. Goodfellow, “On distinguishability criteria for estimating generative models,” in Proc. ICLR, Workshop
Track, 2015.
[8] Xiang Wei, Boqing Gong, Zixia Liu, Wei Lu, and Liqiang Wang, “Improving the Improved Training of Wasserstein
GANs: A Consistency Term and Its Dual Effect,” in Proc. ICLR, 2018.
29

What is Critical in GAN Training?

Recommended

Recommended

More Related Content

Similar to What is Critical in GAN Training?

Similar to What is Critical in GAN Training? (20)

More from Hao-Wen (Herman) Dong

More from Hao-Wen (Herman) Dong (6)

Recently uploaded

Recently uploaded (20)

What is Critical in GAN Training?