Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NTHU AI Reading Group: Improved Training of Wasserstein GANs

1,255 views

Published on

NTHU AI Reading Group:
Improved Training of Wasserstein GANs

Published in: Technology

NTHU AI Reading Group: Improved Training of Wasserstein GANs

  1. 1. NTHU AI Reading Group: Improved Training of Wasserstein GANs Mark Chang 2017/6/6
  2. 2. Outlines • Wasserstein GANs • Derivation of Kantorovich-Rubinstein Duality • Improved Training of WGANs • Experiments
  3. 3. Outlines • Wasserstein GANs • Regular GANs • Source of Instability • Earth Mover’s Distance • Kantorovich-Rubinstein Duality • Wasserstein GANs • Weight Clipping • Derivation of Kantorovich-Rubinstein Duality • Improved Training of WGANs • Experiments
  4. 4. Regular GANs Generator Network G(z)prior min G max D V (D, G) generated data real data 1 0 Discriminator Network D(x) sigmoid function V (D, G) = Ex⇠Pr(x)[logD(x)] + Ez⇠Pz(z)[log(1 D(G(z))] z ⇠ Pz(z) x ⇠ Pr(x)
  5. 5. Source of Instability x Pr(x) Vanishing Gradient Optimal Discriminator D⇤ (x) Disjoint Distributions V (D, G) = Ex⇠Pr(x)[logD(x)] + Ez⇠Pz(z)[log(1 D(G(z))] real data generated data Pg(x)
  6. 6. Earth Mover’s Distance Cost function of WGAN : Earth Mover’s Distance V (D, G) = Ex⇠Pr(x)[logD(x)] + Ez⇠Pz(z)[log(1 D(G(z))] EMD(Pr, P✓) = inf 2⇧(Pr,P✓) X x,y kx yk (x, y) = inf 2⇧(Pr,P✓) E(x,y)⇠ kx yk
  7. 7. Earth Mover’s Distance Pr(x) Pg(x)
  8. 8. Earth Mover’s Distance x y photo credit : https://vincentherrmann.github.io/blog/wasserstein/ EMD(Pr, P✓) = inf 2⇧(Pr,P✓) X x,y kx yk (x, y) = inf 2⇧(Pr,P✓) E(x,y)⇠ kx yk Real data Generated data X y (x, y) = Pr(x) X x (x, y) = P✓(y) x1 y2 (x1, y2)
  9. 9. Kantorovich-Rubinstein Duality Kantorovich-Rubinstein Duality EMD(Pr, P✓) = sup kfkL1 Ex⇠Pr f(x) Ex⇠P✓ f(x). 1-Lipschitz Constraint This formula is highly intractable EMD(Pr, P✓) = inf 2⇧(Pr,P✓) X x,y kx yk (x, y) = inf 2⇧(Pr,P✓) E(x,y)⇠ kx yk
  10. 10. Wasserstein GANs Generator Network prior generated data real data Critic Network z ⇠ Pz(z) x ⇠ Pr(x) no sigmoid function fw(x) fw(g✓(z)) g✓ fw min ✓ max w2[ k,k]l Ex⇠Pr [fw(x)] Ez⇠Pz [fw(g✓(z))] k-Lipschitz Constraint
  11. 11. Wasserstein GANs • k-Lipschitz continuous f(x) 8x1, x2, 9k For a real function such that photo credit : https://en.wikipedia.org/wiki/Lipschitz_continuity f(x) |f(x1) f(x2)| |x1 x2|  k g(x) = kx
  12. 12. Weight Clipping Enforce a k-Lipschitz constraint : w 2 [ c, c]l f(x) is a multi-layer neural network.
  13. 13. Weight Clipping
  14. 14. Outlines • Wasserstein GANs • Derivation of Kantorovich-Rubinstein Duality • Earth Mover’s Distance • Linear Programming • Dual Form • Improved Training of WGANs • Experiments
  15. 15. Derivation of Kantorovich- Rubinstein Duality • Wasserstein GAN and the Kantorovich-Rubinstein Duality • https://vincentherrmann.github.io/blog/wasserstein/ • Optimal Transportation: Continuous and Discrete • http://smat.epfl.ch/~zemel/vt/pdm.pdf • Optimal Transport: Old and New • http://www.springer.com/br/book/9783540710493
  16. 16. Earth Mover’s Distance photo credit : https://vincentherrmann.github.io/blog/wasserstein/ sum of all the element-wise products EMD(Pr, P✓) = inf 2⇧(Pr,P✓) X x,y kx yk (x, y) = inf 2⇧(Pr,P✓) hD, iFEMD(Pr, P✓) = inf 2⇧(Pr,P✓) X x,y kx yk (x, y) = inf 2⇧(Pr,P✓) hD, iF P✓(y) Pr(x) = 2 6 6 6 4 (x1, y1) (x1, y2) · · · (x1, yn) (x2, y1) (x2, y2) · · · (x2, yn) ... ... ... ... (xn, y1) (xn, y2) · · · (xn, yn) 3 7 7 7 5 D = 2 6 6 6 4 kx1 y1k kx1 y2k · · · kx1 ynk kx2 y1k kx2 y2k · · · kx2 ynk ... ... ... ... kxn y1k kxn y2k · · · kxn ynk 3 7 7 7 5
  17. 17. Linear Programming Ax = b x 0 Objective function: minimize Constraint: z = cT x X y (x, y) = Pr(x) X x (x, y) = P✓(y) Objective function: Constraint: 8x, y (x, y) 0 EMD(Pr, P✓) = inf 2⇧(Pr,P✓) hD, iF
  18. 18. Linear Programming z = cT x 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 (x1, y1) (x1, y2) ... (x2, y1) (x2, y2) ... (xn, y1) (xn, y2) ... 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 c = vec(D) x = vec( ) 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 kx1 y1k kx1 y2k ... kx2 y1k kx2 y2k ... kxn y1k kxn y2k ... 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 EMD(Pr, P✓) = inf 2⇧ hD, iF Objective function:
  19. 19. Linear Programming 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 (x1, y1) (x1, y2) ... (x2, y1) (x2, y2) ... (xn, y1) (xn, y2) ... 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 = Ax = b b =  Pr P✓ X y (x, y) = Pr(x) X x (x, y) = P✓(y) x = vec( )A 2 6 6 6 6 6 6 6 6 6 6 6 6 4 Pr(x1) Pr(x2) ... Pr(xn) P✓(y1) P✓(y2) ... P✓(yn) 3 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 4 1 1 · · · 0 0 · · · 0 0 · · · 0 0 · · · 1 1 · · · 0 0 · · · ... ... ... ... ... ... ... ... ... 0 0 · · · 0 0 · · · 1 1 · · · 1 0 · · · 1 0 · · · 1 0 · · · 0 1 · · · 0 1 · · · 0 1 · · · ... ... ... ... ... ... ... ... ... 0 0 · · · 0 0 · · · 0 0 · · · 3 7 7 7 7 7 7 7 7 7 7 7 7 5 Constraint:
  20. 20. Dual Form Ax = b x 0 z = cT x ˜z = bT y z = cT x yT Ax = yT b = ˜z AT y  c z = ˜zStrong Duality: Weak Duality: z ˜z Primal Problem: Dual Problem: minimize: maximize: constraint: constraint: is a lower bound of z = cT x yT Ax = yT bx yT Ax = yT b = ˜z
  21. 21. Dual Form ˜z = bT y EMD(Pr, P✓) = fT Pr + gT P✓ b =  Pr P✓ y =  f g Objective function: 2 6 6 6 6 6 6 6 6 6 6 6 6 4 f(x1) f(x2) ... f(xn) g(x1) g(x2) ... g(xn) 3 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 4 Pr(x1) Pr(x2) ... Pr(xn) P✓(x1) P✓(x2) ... P✓(xn) 3 7 7 7 7 7 7 7 7 7 7 7 7 5 )
  22. 22. Dual Form AT y  c 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 0 · · · 0 1 0 · · · 0 1 0 · · · 0 0 1 · · · 0 ... ... ... ... ... ... ... ... 0 1 · · · 0 1 0 · · · 0 0 1 · · · 0 0 1 · · · 0 ... ... ... ... ... ... ... ... 0 0 · · · 1 1 0 · · · 0 0 0 · · · 1 0 1 · · · 0 ... ... ... ... ... ... ... ... 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5  c = vec(D)AT y =  f g constraint: 2 6 6 6 6 6 6 6 6 6 6 6 6 4 f(x1) f(x2) ... f(xn) g(x1) g(x2) ... g(xn) 3 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 kx1 x1k kx1 x2k ... kx2 x1k kx2 x2k ... kxn x1k kxn x2k ... 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 f(xi) + g(xj)  kxi xjk) 8i, j
  23. 23. Dual Form f(xi) + g(xj)  kxi xjk f(xi) + g(xi)  kxi xik = 0 EMD(Pr, P✓) = fT Pr + gT P✓ f(xi) = g(xi) 8i, jconstraint: if i = j ) ) maximize:
  24. 24. Dual Form EMD(Pr, P✓) = sup kfkL1 Ex⇠Pr f(x) Ex⇠P✓ f(x). ⇢ f(xi) f(xj)  kxi xjk f(xi) f(xj) kxi xjk 1  f(xi) f(xj) kxi xjk  1 kfkL1 f(xi) = g(xi) f(xi) + g(xj)  kxi xjk8i, jconstraint: ) ) ) 1-Lipschitz Constraint: The slope should be between -1 and 1 ) 1-Lipschitz Constraint
  25. 25. Outlines • Wasserstein GANs • Derivation of Kantorovich-Rubinstein Duality • Improved Training of WGANs • Difficulties with weight constraints • Gradient penalty • Experiments
  26. 26. Difficulties with weight constraints • Capacity underuse • Weights attain their maximum or minimum values • Can only learn simple function • Exploding and vanishing gradients • Clipping parameter is too large -> exploding gradient • Clipping parameter is too small -> vanishing gradient
  27. 27. Difficulties with weight constraints • Capacity underuse
  28. 28. Difficulties with weight constraints • Capacity underuse
  29. 29. Difficulties with weight constraints • Exploding and vanishing gradients
  30. 30. Gradient penalty • Optimal critic has gradients with norm 1 almost everywhere under andPr Pg xt = (1 t)x + ty rf⇤ (xt) = y xt ky xtk krf⇤ (xt)k= 1) x ⇠ Pr y ⇠ Pg L = E˜x⇠Pg [f(˜x)] E˜x⇠Pr [f(x)] + Ext⇠Pt [(krxt f(xt)k 1)2 ] gradient penaltyoriginal critic loss
  31. 31. Gradient penalty
  32. 32. Outlines • Wasserstein GANs • Derivation of Kantorovich-Rubinstein Duality • Improved Training of WGANs • Experiments • Architecture robustness on LSUN bedrooms • Character-level language modeling
  33. 33. Architecture robustness on LSUN bedrooms
  34. 34. Character-level language modeling
  35. 35. Reference • Towards Principled Methods for Training Generative Adversarial Networks • https://arxiv.org/abs/1701.04862 • Wasserstein GAN • https://arxiv.org/abs/1701.07875 • Wasserstein GAN and the Kantorovich-Rubinstein Duality • https://vincentherrmann.github.io/blog/wasserstein/ • Improved Training of Wasserstein GANs • https://arxiv.org/abs/1704.00028
  36. 36. About the Speaker Mark Chang • Email: ckmarkoh at gmail dot com • Blog: https://ckmarkoh.github.io/ • Github: https://github.com/ckmarkoh • Slideshare: http://www.slideshare.net/ckmarkohchang • Youtube: https://www.youtube.com/channel/UCckNPGDL21aznRhl3EijRQw 37 HTC Research & Healthcare Deep Learning Algorithms Research Engineer

×