Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What can a statistician expect from GANs?

30 views

Published on

Maxime Sangnier (Sorbonne Université) at the International Workshop Machine Learning and Artificial Intelligence at Télécom ParisTech

Published in: Technology
  • Be the first to comment

  • Be the first to like this

What can a statistician expect from GANs?

  1. 1. GANs from a statistical point of view Maxime Sangnier International workshop Machine Learning & Artificial Intelligence September 17, 2018 Sorbonne Université, CNRS, LPSM, LIP6, Paris, France Joint work with Gérard Biau1 , Benoît Cadre2 and Ugo Tanielian1,3 1 Sorbonne Université, CNRS, LPSM, Paris, France 2 ENS Rennes, Univ Rennes, CNRS, IRMAR, Rennes, France 3 Criteo, Paris, France
  2. 2. Contributors Gérard Biau (Sorbonne Université) Benoît Cadre (ENS Rennes) Ugo Tanielian (Sorbonne Université & Criteo) 1
  3. 3. Generative models
  4. 4. Motivation Generative models aim at generating artificial contents. • Images: • merchandising; • painting; • art; • super-resolution and denoising; • text to image. • Movies: • pose to movie; • Audio: • speech synthesis ; • music. 2
  5. 5. Merchandising vue.ai 3
  6. 6. Art prisma-ai.com 4
  7. 7. Painting Interactive GAN.1 1 J.-Y. Zhu et al. “Generative Visual Manipulation on the Natural Image Manifold”. In: European Conference on Computer Vision. 2016. 5
  8. 8. Superresolution SuperResolution GAN.2 2 C. Ledig et al. “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network”. In: arXiv:1609.04802 [cs, stat] (2016). 6
  9. 9. Text-to-image Stacked GAN.3 3 H. Zhang et al. “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”. In: arXiv:1612.03242 [cs, stat] (2016). 7
  10. 10. Movies Everybody Dance Now.4 4 C. Chan et al. “Everybody Dance Now”. In: arXiv:1808.07371 [cs] (2018). 8
  11. 11. Speech synthesis WaveNet by DeepMind. 9
  12. 12. Motivation Generative models aim at generating artificial contents. • Outstanding image generation and extrapolation5 . • And even more I’m not aware of. . . 5 T. Karras et al. “Progressive Growing of GANs for Improved Quality, Stability, and Variation”. In: International Conference on Learning Representations. 2018. 10
  13. 13. Motivation Generative models aim at generating artificial contents. • Outstanding image generation and extrapolation5 . • And even more I’m not aware of. . . Generative models are used for: • exploring unseen realities; • providing many answers to a single question. 5 Karras et al., “Progressive Growing of GANs for Improved Quality, Stability, and Variation”. 10
  14. 14. Generate from data X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd . How to sample according to p ? 11
  15. 15. Generate from data X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd . How to sample according to p ? Naive approach 1. estimate p by ˆp; 2. sample according to ˆp. 11
  16. 16. Generate from data X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd . How to sample according to p ? Naive approach 1. estimate p by ˆp; 2. sample according to ˆp. Drawbacks • both problems are difficult in themselves; • we cannot define a realistic parametric statistical model; • non-parametric density estimation inefficient in high dimension; • this approach violates Vapnik’s principle: When solving a problem of interest, do not solve a more general problem as an intermediate step. 11
  17. 17. Some generative methods METHOD DENSITY- FREE FLEXIBILITY SIMPLE SAMPLING Autoregressive models (WaveNet6 ) Nonlinear independent components analy- sis (Real NVP7 ) Variational autoencoders8 Boltzmann machines9 Generative stochastic networks10 Generative adversarial networks 6 A.v.d. Oord et al. “WaveNet: A Generative Model for Raw Audio”. In: arXiv:1609.03499 [cs] (2016). 7 L. Dinh, J. Sohl-Dickstein, and S. Bengio. “Density estimation using Real NVP”. . In: arXiv:1605.08803 [cs, stat] (2016). 8 D.P. Kingma and M. Welling. “Auto-Encoding Variational Bayes”. In: International Conference on Learning Representations. 2013. 9 S.E. Fahlman, G.E. Hinton, and T.J. Sejnowski. “Massively Parallel Architectures for AI: Netl, Thistle, and Boltzmann Machines”. In: Proceedings of the Third AAAI Conference on Artificial Intelligence. 1983. 10 Y. Bengio et al. “Deep Generative Stochastic Networks Trainable by Backprop”. In: International Conference on Machine Learning. 2014. 12
  18. 18. Generative adversarial models
  19. 19. A direct approach Cornerstone: don’t estimate p . 11 I. Goodfellow et al. “Generative Adversarial Nets”. In: Advances in Neural Information Processing Systems. 2014. 13
  20. 20. A direct approach Cornerstone: don’t estimate p . General procedure: • sample U1, . . . , Un i.i.d. thanks to a parametric model; • compare X1, . . . , Xn and U1, . . . , Un and update the model. 11 Goodfellow et al., “Generative Adversarial Nets”. 13
  21. 21. A direct approach Cornerstone: don’t estimate p . General procedure: • sample U1, . . . , Un i.i.d. thanks to a parametric model; • compare X1, . . . , Xn and U1, . . . , Un and update the model. GANs11 follow this principle. 11 Goodfellow et al., “Generative Adversarial Nets”. 13
  22. 22. Generating a random sample Inverse transform sampling • S: scalar random variable; • FS: cumulative distribution function of S; • Z ∼ U([0, 1]). • F−1 S (Z) d = S. 14
  23. 23. Generating a random sample Inverse transform sampling • S: scalar random variable; • FS: cumulative distribution function of S; • Z ∼ U([0, 1]). • F−1 S (Z) d = S. Generators • X1, . . . , Xn i.i.d. according to a density p on E ⊆ Rd , dominated by a known measure µ. • G = {Gθ : Rd → E}θ∈Θ, Θ ⊂ Rp : parametric family of generators (d d); • Z1, . . . , Zn random vectors from Rd (typically U([0, 1]d )); • Ui = Gθ(Zi ): generated sample; • P = {pθ}θ∈Θ: associated family of densities with by definition Gθ(Z1) d = pθdµ. 14
  24. 24. Generating a random sample Remarks • Each pθ is a candidate to represent p . 15
  25. 25. Generating a random sample Remarks • Each pθ is a candidate to represent p . • The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the analysis. 15
  26. 26. Generating a random sample Remarks • Each pθ is a candidate to represent p . • The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the analysis. • It is not assumed that p belongs to P. 15
  27. 27. Generating a random sample Remarks • Each pθ is a candidate to represent p . • The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the analysis. • It is not assumed that p belongs to P. • In GANs: Gθ is a neural network with p weights, stored in θ ∈ Rp . 15
  28. 28. Comparing two samples The next step • The procedure should drive θ such that Gθ(Z1) d = X1. • Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ. 16
  29. 29. Comparing two samples The next step • The procedure should drive θ such that Gθ(Z1) d = X1. • Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ. Supervised learning • Both samples have same distribution as soon as we cannot distinguish them. • This is a classification problem: Class Y = 0 Class Y = 1 Gθ(Z1), . . . , Gθ(Zn) X1, . . . , Xn 16
  30. 30. Comparing two samples The next step • The procedure should drive θ such that Gθ(Z1) d = X1. • Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ. Supervised learning • Both samples have same distribution as soon as we cannot distinguish them. • This is a classification problem: Class Y = 0 Class Y = 1 Gθ(Z1), . . . , Gθ(Zn) X1, . . . , Xn 16
  31. 31. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. 17
  32. 32. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) 17
  33. 33. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) • Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with same distribution as (X, Y). 17
  34. 34. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) • Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with same distribution as (X, Y). • Classification model: Y|X = x ∼ B(D(x)), i.e. P(Y = 1|X = x) = D(x). 17
  35. 35. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) • Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with same distribution as (X, Y). • Classification model: Y|X = x ∼ B(D(x)), i.e. P(Y = 1|X = x) = D(x). • Maximum (conditional) likelihood estimation: sup D∈D n i=1 D(Xi ) × n i=1 (1 − D(Gθ(Zi ))) or sup D∈D ˆL(θ, D), with ˆL(θ, D) = 1 n n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))) . 17
  36. 36. Adversarial principle Generator • supD∈D ˆL(θ, D) acts like a divergence between the distributions of Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn. 18
  37. 37. Adversarial principle Generator • supD∈D ˆL(θ, D) acts like a divergence between the distributions of Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn. • Minimum divergence estimation: inf θ∈Θ sup D∈D ˆL(θ, D) . or inf θ∈Θ sup D∈D n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))). 18
  38. 38. Adversarial principle Generator • supD∈D ˆL(θ, D) acts like a divergence between the distributions of Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn. • Minimum divergence estimation: inf θ∈Θ sup D∈D ˆL(θ, D) . or inf θ∈Θ sup D∈D n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))). • Adversarial, minimax or zero-sum game. 18
  39. 39. The GAN Zoo Avinash Hindupur’s Github. 19
  40. 40. The GAN Zoo Curbing the discriminator • least squares12 : inf D∈D n i=1 (D(Xi ) − 1)2 + n i=1 D(Gθ(Zi ))2 , inf θ∈Θ n i=1 (D(Gθ(Zi )) − 1)2 . • asymmetric hinge13 : inf D∈D − n i=1 D(Xi ) + n i=1 max (0, 1 − D(Gθ(Zi ))) , inf θ∈Θ − n i=1 D(Gθ(Zi )). 12 X. Mao et al. “Least Squares Generative Adversarial Networks”. In: IEEE International Conference on Computer Vision. 2017. 13 J. Zhao, M. Mathieu, and Y. LeCun. “Energy-based Generative Adversarial Network”. In: International Conference on Learning Representations. 2017. 20
  41. 41. The GAN Zoo Metrics as minimax games • Maximum mean discrepancy14 and Wasserstein15 : inf θ∈Θ sup T∈T Tp dµ − Tpθdµ. • f-divergences16 : inf θ∈Θ sup T∈T Tp dµ − (f ◦ T)pθdµ. With T being a prescribed class of functions and f the convex conjugate of a lower-semicontinuous function f. 14 G.K. Dziugaite, D.M. Roy, and Z. Ghahramani. “Training generative neural networks via Maximum Mean Discrepancy optimization”. In: Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence. 2015; Y. Li, K. Swersky, and R. Zemel. “Generative Moment Matching Networks”. In: International Conference on Machine Learning. 2015. 15 M. Arjovsky, S. Chintala, and L. Bottou. “Wasserstein Generative Adversarial Networks”. In: International Conference on Machine Learning. 2017. 16 S. Nowozin, B. Cseke, and R. Tomioka. “f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization”. In: Neural Information Processing Systems. June 2016. 21
  42. 42. Roadmap • Minimum divergence estimation: uniqueness of minimizers. • Approximation properties: importance of the family of discriminators on the quality of the approximation • Statistical analysis: consistency and rate of convergence. 22
  43. 43. Minimum divergence estimation
  44. 44. Kullback-Leibler and Jensen divergences Kullback-Leibler • For P Q probability measures on E: DKL(P Q) = ln dP dQ dP. • Properties: DKL(P Q) ≥ 0 DKL(P Q) = 0 ⇐⇒ P = Q. • If p = dP dµ and q = dQ dµ : DKL(P Q) = p ln p q dµ. • DKL is not symmetric and defined only for P Q. 23
  45. 45. Kullback-Leibler and Jensen divergences Kullback-Leibler • For P Q probability measures on E: DKL(P Q) = ln dP dQ dP. • Properties: DKL(P Q) ≥ 0 DKL(P Q) = 0 ⇐⇒ P = Q. • If p = dP dµ and q = dQ dµ : DKL(P Q) = p ln p q dµ. • DKL is not symmetric and defined only for P Q. 23
  46. 46. Kullback-Leibler and Jensen divergences Jensen-Shannon • For P and Q probability measures on E: DJS(P, Q) = 1 2 DKL P P + Q 2 + 1 2 DKL Q P + Q 2 . • Property: 0 ≤ DJS(P, Q) ≤ ln 2. • (P, Q) → DJS(P, Q) is a distance. 24
  47. 47. Kullback-Leibler and Jensen divergences Jensen-Shannon • For P and Q probability measures on E: DJS(P, Q) = 1 2 DKL P P + Q 2 + 1 2 DKL Q P + Q 2 . • Property: 0 ≤ DJS(P, Q) ≤ ln 2. • (P, Q) → DJS(P, Q) is a distance. 24
  48. 48. Kullback-Leibler and Jensen divergences Jensen-Shannon • For P and Q probability measures on E: DJS(P, Q) = 1 2 DKL P P + Q 2 + 1 2 DKL Q P + Q 2 . • Property: 0 ≤ DJS(P, Q) ≤ ln 2. • (P, Q) → DJS(P, Q) is a distance. 24
  49. 49. GAN and Jensen-Shannon divergence GANs • Empirical criteria: ˆL(θ, D) = 1 n n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))) . • Problem: inf θ∈Θ sup D∈D ˆL(θ, D) . 25
  50. 50. GAN and Jensen-Shannon divergence GANs • Empirical criteria: ˆL(θ, D) = 1 n n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))) . • Problem: inf θ∈Θ sup D∈D ˆL(θ, D) . Ideal GANs • Population version of the criteria: L(θ, D) = ln(D)p dµ + ln(1 − D)pθdµ. • No constraint: D = D∞, set of all functions from E to [0, 1]. • Problem: inf θ∈Θ sup D∈D∞ L(θ, D) . 25
  51. 51. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. 26
  52. 52. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. • Optimal discriminator: Dθ = p p + pθ , with convention 0/0 = 0. 26
  53. 53. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. • Optimal discriminator: Dθ = p p + pθ , with convention 0/0 = 0. • Optimal criteria: sup D∈D∞ L(θ, D) = L(θ, Dθ ) = 2DJS(p , pθ) − ln 4. 26
  54. 54. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. • Optimal discriminator: Dθ = p p + pθ , with convention 0/0 = 0. • Optimal criteria: sup D∈D∞ L(θ, D) = L(θ, Dθ ) = 2DJS(p , pθ) − ln 4. • Problem: inf θ∈Θ sup D∈D∞ L(θ, D) = inf θ∈Θ L(θ, Dθ ) = 2 inf θ∈Θ DJS(p , pθ) − ln 4. 26
  55. 55. The quest for Dθ Numerical approach • Big n, big D: try to approximate Dθ with arg maxD∈D ˆL(θ, D). • Close to divergence minimization: supD∈D ˆL(θ, D) ≈ 2DJS(p , pθ) − ln 4. 17 Goodfellow et al., “Generative Adversarial Nets”. 27
  56. 56. The quest for Dθ Numerical approach • Big n, big D: try to approximate Dθ with arg maxD∈D ˆL(θ, D). • Close to divergence minimization: supD∈D ˆL(θ, D) ≈ 2DJS(p , pθ) − ln 4. Theorem Let θ ∈ Θ and Aθ = {p = pθ = 0}. If µ(Aθ) = 0, then {Dθ } = arg maxD∈D∞ L(θ, D). If µ(Aθ) > 0, then Dθ is unique only on EAθ. Completes Proposition 1 in17 . 17 Goodfellow et al., “Generative Adversarial Nets”. 27
  57. 57. Oracle parameter • Oracle parameter regarding the Jensen-Shannon divergence: θ ∈ arg minθ∈Θ L(θ, Dθ ) = arg minθ∈Θ DJS(p , pθ). • Gθ is the ideal generator. • If p ∈ P, p = pθ DJS(p , pθ ) = 0 Dθ = 1 2 . • What if p /∈ P? Existence and uniqueness of θ ? 28
  58. 58. Oracle parameter • Oracle parameter regarding the Jensen-Shannon divergence: θ ∈ arg minθ∈Θ L(θ, Dθ ) = arg minθ∈Θ DJS(p , pθ). • Gθ is the ideal generator. • If p ∈ P, p = pθ DJS(p , pθ ) = 0 Dθ = 1 2 . • What if p /∈ P? Existence and uniqueness of θ ? Theorem Assume that P is a convex and compact set for the JS distance. If p > 0 µ-almost everywhere, then there exists ¯p ∈ P such that {¯p} = arg minp∈P DJS(p , p). In addition, if the model P is identifiable, then there exists θ ∈ Θ such mathematical {θ } = arg minθ∈Θ L(θ, Dθ ). 28
  59. 59. Oracle parameter Existence and uniqueness • Compactness of P and continuity of DJS(p , ·). • p > 0 µ-a.e. enables strict convexity of DJS(p , ·). 29
  60. 60. Oracle parameter Existence and uniqueness • Compactness of P and continuity of DJS(p , ·). • p > 0 µ-a.e. enables strict convexity of DJS(p , ·). Compactness of P with respect to the JS distance 1. Θ compact and P convex. 2. For all x ∈ E, θ ∈ Θ → pθ(x) is continuous. 3. sup(θ,θ )∈Θ2 |pθ ln pθ | ∈ L1 (µ). 29
  61. 61. Oracle parameter Existence and uniqueness • Compactness of P and continuity of DJS(p , ·). • p > 0 µ-a.e. enables strict convexity of DJS(p , ·). Compactness of P with respect to the JS distance 1. Θ compact and P convex. 2. For all x ∈ E, θ ∈ Θ → pθ(x) is continuous. 3. sup(θ,θ )∈Θ2 |pθ ln pθ | ∈ L1 (µ). Identifiability High-dimensional parametric setting often misspecified =⇒ identifiability not satisfied. 29
  62. 62. Approximation properties
  63. 63. From JS divergence to likelihood GAN = JS divergence • GANs don’t minimize the Jensen-Shannon divergence. • Considering supD∈D∞ L(θ, D) means knowing Dθ = p p +pθ , thus knowing p . 30
  64. 64. From JS divergence to likelihood GAN = JS divergence • GANs don’t minimize the Jensen-Shannon divergence. • Considering supD∈D∞ L(θ, D) means knowing Dθ = p p +pθ , thus knowing p . Parametrized discriminators • D = {Dα}α∈Λ, Λ ⊂ Rq : parametric family of discriminators. • Likelihood-type problem with two parametric families: inf θ∈Θ sup α∈Λ L(θ, Dα). • Likelihood parameter: ¯θ ∈ arg minθ∈Θ sup α∈Λ L(θ, Dα). • How close the best candidate p¯θ is to the ideal density pθ ? • How does it depend on the capability of D to approximate Dθ ? 30
  65. 65. Approximation result (Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2 (µ) such that m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε. 31
  66. 66. Approximation result (Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2 (µ) such that m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and p¯θ ≤ M. Then, under Assumption (Hε) with ε < 1/(2M), there exists a constant c1 > 0 (depending only upon m and M) such that DJS(p , p¯θ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 . 31
  67. 67. Approximation result (Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2 (µ) such that m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and p¯θ ≤ M. Then, under Assumption (Hε) with ε < 1/(2M), there exists a constant c1 > 0 (depending only upon m and M) such that DJS(p , p¯θ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 . Remarks As soon as the class D becomes richer: • minimizing supα∈Λ L(θ, Dα) over Θ helps minimizing DJS(p , pθ). • since under some assumptions {pθ } = arg minpθ:θ∈Θ DJS(p , pθ), p¯θ comes closer to pθ . 31
  68. 68. Statistical analysis
  69. 69. The estimation problem Estimator ˆθ ∈ arg minθ∈Θ sup α∈Λ ˆL(θ, α), where ˆL(θ, α) = 1 n n i=1 ln(Dα(Xi )) + n i=1 ln(1 − Dα(Gθ(Zi ))) . 32
  70. 70. The estimation problem Estimator ˆθ ∈ arg minθ∈Θ sup α∈Λ ˆL(θ, α), where ˆL(θ, α) = 1 n n i=1 ln(Dα(Xi )) + n i=1 ln(1 − Dα(Gθ(Zi ))) . (Hreg) Regularity conditions of order 1 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ˆθ exists (and so for ¯θ). 32
  71. 71. The estimation problem Estimator ˆθ ∈ arg minθ∈Θ sup α∈Λ ˆL(θ, α), where ˆL(θ, α) = 1 n n i=1 ln(Dα(Xi )) + n i=1 ln(1 − Dα(Gθ(Zi ))) . (Hreg) Regularity conditions of order 1 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ˆθ exists (and so for ¯θ). Questions • How far DJS(p , pˆθ) is from minθ∈Θ DJS(p , pθ) = DJS(p , pθ )? • Does ˆθ converge towards ¯θ as n → ∞? • What is the asymptotic distribution of ˆθ − ¯θ? 32
  72. 72. Non-asymptotic bound on the JS divergence (Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε. 33
  73. 73. Non-asymptotic bound on the JS divergence (Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and pθ ≤ M for all θ ∈ Θ. Then, under Assumptions (Hreg) and (Hε) with ε < 1/(2M), there exist two constants c1 > 0 (depending only upon m and M) and c2 such that E DJS(p , pˆθ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 + c2 1 √ n . 33
  74. 74. Non-asymptotic bound on the JS divergence (Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and pθ ≤ M for all θ ∈ Θ. Then, under Assumptions (Hreg) and (Hε) with ε < 1/(2M), there exist two constants c1 > 0 (depending only upon m and M) and c2 such that E DJS(p , pˆθ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 + c2 1 √ n . Remarks • Under (Hreg), {ˆL(θ, α) − L(θ, α)}θ∈Θ,α∈Λ is a subgaussian process for · / √ n. • Dudley’s inequality: E supθ∈Θ,α∈Λ |ˆL(θ, α) − L(θ, α)| = O 1√ n . • c2 scales as p + q =⇒ loose bound in the usual over-parametrized regime (LSUN, FACES: √ n ≈ 1000 p + q ≈ 1500000). 33
  75. 75. Illustration Setting • p (x) = e−x/s s(1+e−x/s)2 , x ∈ R : logistic density (x ∈ R). • Gθ and Dα are two fully connected neural networks. • Z ∼ U([0, 1]): scalar noise. • n = 100000 (1/ √ n is negligible) and 30 replications. 34
  76. 76. Illustration Setting • p (x) = e−x/s s(1+e−x/s)2 , x ∈ R : logistic density (x ∈ R). • Gθ and Dα are two fully connected neural networks. • Z ∼ U([0, 1]): scalar noise. • n = 100000 (1/ √ n is negligible) and 30 replications. 34
  77. 77. Illustration Setting • Generator depth: 3. • Discriminator depth: 2 then 5. 35
  78. 78. Convergence of ˆθ (Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist. 36
  79. 79. Convergence of ˆθ (Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist. (H1) The pair (¯θ, ¯α) is unique and belongs to int(Θ) × int(Λ). Theorem Under Assumptions (Hreg) and (H1), ˆθ a.s. → ¯θ and ˆα a.s. → ¯α. 36
  80. 80. Convergence of ˆθ (Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist. (H1) The pair (¯θ, ¯α) is unique and belongs to int(Θ) × int(Λ). Theorem Under Assumptions (Hreg) and (H1), ˆθ a.s. → ¯θ and ˆα a.s. → ¯α. Remarks • Convergence of ˆθ comes from supθ∈Θ,α∈Λ |ˆL(θ, α) − L(θ, α)| a.s. → 0. • It does not need uniqueness of ¯α. • Convergence of ˆα comes from that of ˆθ. 36
  81. 81. Illustration Setting • Three models: 1. Laplace: p (x) = 1 3 e− 2|x| 3 vs pθ(x) = 1√ 2πθ e − x2 2θ2 . 2. Claw: p (x) = pclaw(x) vs pθ(x) = 1√ 2πθ e − x2 2θ2 . 3. Exponential: p (x) = e−x 1R+ vs pθ(x) = 1 θ 1[0,θ](x). • Gθ: generalized inverse of the cdf of pθ. • Z ∼ U([0, 1]): scalar noise. • Dα = pα1 pα1 +pα0 . • n = 10 to 10000 and 200 replications. 37
  82. 82. Illustration Claw vs Gaussian Exponential vs Uniform 38
  83. 83. Central limit theorem (Hloc) Local smoothness conditions around (¯θ, ¯α) (such that Hessians are invertible). Theorem Under Assumptions (Hreg), (H1) and (Hloc), √ n(ˆθ − ¯θ) d → N(0, Σ). 39
  84. 84. Central limit theorem (Hloc) Local smoothness conditions around (¯θ, ¯α) (such that Hessians are invertible). Theorem Under Assumptions (Hreg), (H1) and (Hloc), √ n(ˆθ − ¯θ) d → N(0, Σ). Remark One has Σ 2 = O(p3 q4 ), which suggests that ˆθ has a large dispersion around ¯θ in the over-parametrized regime. 39
  85. 85. Illustration Histograms of √ n(ˆθ − ¯θ): Claw vs Gaussian Exponential vs Uniform 40
  86. 86. Conclusion
  87. 87. Take-home message A first step for understanding GANs • From data to sampling. • The richness of the class of discriminators D controls the gap between GANs and the JS divergence. • The generator parameters θ are asymptotically normal with rate √ n. 41
  88. 88. Take-home message A first step for understanding GANs • From data to sampling. • The richness of the class of discriminators D controls the gap between GANs and the JS divergence. • The generator parameters θ are asymptotically normal with rate √ n. Future investigations 1. Impact of the latent variable Z (dimension, distribution) and the networks (number of layers in Gθ, dimensionality of Θ) on the performance of GANs (currently it is assumed p µ, pθ µ =⇒ information on the supporting manifold of p ). 2. How much assumptions (Hε) and (Hε) are satisfied for neural nets as discriminators? 3. Over-parametrized regime: convergence of distributions instead of parameters. 41

×