•0 likes•89 views

Report

Download to read offline

Maxime Sangnier (Sorbonne Université) at the International Workshop Machine Learning and Artificial Intelligence at Télécom ParisTech

- 1. GANs from a statistical point of view Maxime Sangnier International workshop Machine Learning & Artiﬁcial Intelligence September 17, 2018 Sorbonne Université, CNRS, LPSM, LIP6, Paris, France Joint work with Gérard Biau1 , Benoît Cadre2 and Ugo Tanielian1,3 1 Sorbonne Université, CNRS, LPSM, Paris, France 2 ENS Rennes, Univ Rennes, CNRS, IRMAR, Rennes, France 3 Criteo, Paris, France
- 2. Contributors Gérard Biau (Sorbonne Université) Benoît Cadre (ENS Rennes) Ugo Tanielian (Sorbonne Université & Criteo) 1
- 4. Motivation Generative models aim at generating artiﬁcial contents. • Images: • merchandising; • painting; • art; • super-resolution and denoising; • text to image. • Movies: • pose to movie; • Audio: • speech synthesis ; • music. 2
- 7. Painting Interactive GAN.1 1 J.-Y. Zhu et al. “Generative Visual Manipulation on the Natural Image Manifold”. In: European Conference on Computer Vision. 2016. 5
- 8. Superresolution SuperResolution GAN.2 2 C. Ledig et al. “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network”. In: arXiv:1609.04802 [cs, stat] (2016). 6
- 9. Text-to-image Stacked GAN.3 3 H. Zhang et al. “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”. In: arXiv:1612.03242 [cs, stat] (2016). 7
- 10. Movies Everybody Dance Now.4 4 C. Chan et al. “Everybody Dance Now”. In: arXiv:1808.07371 [cs] (2018). 8
- 11. Speech synthesis WaveNet by DeepMind. 9
- 12. Motivation Generative models aim at generating artiﬁcial contents. • Outstanding image generation and extrapolation5 . • And even more I’m not aware of. . . 5 T. Karras et al. “Progressive Growing of GANs for Improved Quality, Stability, and Variation”. In: International Conference on Learning Representations. 2018. 10
- 13. Motivation Generative models aim at generating artiﬁcial contents. • Outstanding image generation and extrapolation5 . • And even more I’m not aware of. . . Generative models are used for: • exploring unseen realities; • providing many answers to a single question. 5 Karras et al., “Progressive Growing of GANs for Improved Quality, Stability, and Variation”. 10
- 14. Generate from data X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd . How to sample according to p ? 11
- 15. Generate from data X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd . How to sample according to p ? Naive approach 1. estimate p by ˆp; 2. sample according to ˆp. 11
- 16. Generate from data X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd . How to sample according to p ? Naive approach 1. estimate p by ˆp; 2. sample according to ˆp. Drawbacks • both problems are difﬁcult in themselves; • we cannot deﬁne a realistic parametric statistical model; • non-parametric density estimation inefﬁcient in high dimension; • this approach violates Vapnik’s principle: When solving a problem of interest, do not solve a more general problem as an intermediate step. 11
- 17. Some generative methods METHOD DENSITY- FREE FLEXIBILITY SIMPLE SAMPLING Autoregressive models (WaveNet6 ) Nonlinear independent components analy- sis (Real NVP7 ) Variational autoencoders8 Boltzmann machines9 Generative stochastic networks10 Generative adversarial networks 6 A.v.d. Oord et al. “WaveNet: A Generative Model for Raw Audio”. In: arXiv:1609.03499 [cs] (2016). 7 L. Dinh, J. Sohl-Dickstein, and S. Bengio. “Density estimation using Real NVP”. . In: arXiv:1605.08803 [cs, stat] (2016). 8 D.P. Kingma and M. Welling. “Auto-Encoding Variational Bayes”. In: International Conference on Learning Representations. 2013. 9 S.E. Fahlman, G.E. Hinton, and T.J. Sejnowski. “Massively Parallel Architectures for AI: Netl, Thistle, and Boltzmann Machines”. In: Proceedings of the Third AAAI Conference on Artiﬁcial Intelligence. 1983. 10 Y. Bengio et al. “Deep Generative Stochastic Networks Trainable by Backprop”. In: International Conference on Machine Learning. 2014. 12
- 19. A direct approach Cornerstone: don’t estimate p . 11 I. Goodfellow et al. “Generative Adversarial Nets”. In: Advances in Neural Information Processing Systems. 2014. 13
- 20. A direct approach Cornerstone: don’t estimate p . General procedure: • sample U1, . . . , Un i.i.d. thanks to a parametric model; • compare X1, . . . , Xn and U1, . . . , Un and update the model. 11 Goodfellow et al., “Generative Adversarial Nets”. 13
- 21. A direct approach Cornerstone: don’t estimate p . General procedure: • sample U1, . . . , Un i.i.d. thanks to a parametric model; • compare X1, . . . , Xn and U1, . . . , Un and update the model. GANs11 follow this principle. 11 Goodfellow et al., “Generative Adversarial Nets”. 13
- 22. Generating a random sample Inverse transform sampling • S: scalar random variable; • FS: cumulative distribution function of S; • Z ∼ U([0, 1]). • F−1 S (Z) d = S. 14
- 23. Generating a random sample Inverse transform sampling • S: scalar random variable; • FS: cumulative distribution function of S; • Z ∼ U([0, 1]). • F−1 S (Z) d = S. Generators • X1, . . . , Xn i.i.d. according to a density p on E ⊆ Rd , dominated by a known measure µ. • G = {Gθ : Rd → E}θ∈Θ, Θ ⊂ Rp : parametric family of generators (d d); • Z1, . . . , Zn random vectors from Rd (typically U([0, 1]d )); • Ui = Gθ(Zi ): generated sample; • P = {pθ}θ∈Θ: associated family of densities with by deﬁnition Gθ(Z1) d = pθdµ. 14
- 24. Generating a random sample Remarks • Each pθ is a candidate to represent p . 15
- 25. Generating a random sample Remarks • Each pθ is a candidate to represent p . • The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the analysis. 15
- 26. Generating a random sample Remarks • Each pθ is a candidate to represent p . • The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the analysis. • It is not assumed that p belongs to P. 15
- 27. Generating a random sample Remarks • Each pθ is a candidate to represent p . • The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the analysis. • It is not assumed that p belongs to P. • In GANs: Gθ is a neural network with p weights, stored in θ ∈ Rp . 15
- 28. Comparing two samples The next step • The procedure should drive θ such that Gθ(Z1) d = X1. • Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ. 16
- 29. Comparing two samples The next step • The procedure should drive θ such that Gθ(Z1) d = X1. • Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ. Supervised learning • Both samples have same distribution as soon as we cannot distinguish them. • This is a classiﬁcation problem: Class Y = 0 Class Y = 1 Gθ(Z1), . . . , Gθ(Zn) X1, . . . , Xn 16
- 30. Comparing two samples The next step • The procedure should drive θ such that Gθ(Z1) d = X1. • Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ. Supervised learning • Both samples have same distribution as soon as we cannot distinguish them. • This is a classiﬁcation problem: Class Y = 0 Class Y = 1 Gθ(Z1), . . . , Gθ(Zn) X1, . . . , Xn 16
- 31. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. 17
- 32. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) 17
- 33. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) • Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with same distribution as (X, Y). 17
- 34. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) • Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with same distribution as (X, Y). • Classiﬁcation model: Y|X = x ∼ B(D(x)), i.e. P(Y = 1|X = x) = D(x). 17
- 35. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) • Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with same distribution as (X, Y). • Classiﬁcation model: Y|X = x ∼ B(D(x)), i.e. P(Y = 1|X = x) = D(x). • Maximum (conditional) likelihood estimation: sup D∈D n i=1 D(Xi ) × n i=1 (1 − D(Gθ(Zi ))) or sup D∈D ˆL(θ, D), with ˆL(θ, D) = 1 n n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))) . 17
- 36. Adversarial principle Generator • supD∈D ˆL(θ, D) acts like a divergence between the distributions of Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn. 18
- 37. Adversarial principle Generator • supD∈D ˆL(θ, D) acts like a divergence between the distributions of Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn. • Minimum divergence estimation: inf θ∈Θ sup D∈D ˆL(θ, D) . or inf θ∈Θ sup D∈D n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))). 18
- 38. Adversarial principle Generator • supD∈D ˆL(θ, D) acts like a divergence between the distributions of Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn. • Minimum divergence estimation: inf θ∈Θ sup D∈D ˆL(θ, D) . or inf θ∈Θ sup D∈D n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))). • Adversarial, minimax or zero-sum game. 18
- 39. The GAN Zoo Avinash Hindupur’s Github. 19
- 40. The GAN Zoo Curbing the discriminator • least squares12 : inf D∈D n i=1 (D(Xi ) − 1)2 + n i=1 D(Gθ(Zi ))2 , inf θ∈Θ n i=1 (D(Gθ(Zi )) − 1)2 . • asymmetric hinge13 : inf D∈D − n i=1 D(Xi ) + n i=1 max (0, 1 − D(Gθ(Zi ))) , inf θ∈Θ − n i=1 D(Gθ(Zi )). 12 X. Mao et al. “Least Squares Generative Adversarial Networks”. In: IEEE International Conference on Computer Vision. 2017. 13 J. Zhao, M. Mathieu, and Y. LeCun. “Energy-based Generative Adversarial Network”. In: International Conference on Learning Representations. 2017. 20
- 41. The GAN Zoo Metrics as minimax games • Maximum mean discrepancy14 and Wasserstein15 : inf θ∈Θ sup T∈T Tp dµ − Tpθdµ. • f-divergences16 : inf θ∈Θ sup T∈T Tp dµ − (f ◦ T)pθdµ. With T being a prescribed class of functions and f the convex conjugate of a lower-semicontinuous function f. 14 G.K. Dziugaite, D.M. Roy, and Z. Ghahramani. “Training generative neural networks via Maximum Mean Discrepancy optimization”. In: Proceedings of the Thirty-First Conference on Uncertainty in Artiﬁcial Intelligence. 2015; Y. Li, K. Swersky, and R. Zemel. “Generative Moment Matching Networks”. In: International Conference on Machine Learning. 2015. 15 M. Arjovsky, S. Chintala, and L. Bottou. “Wasserstein Generative Adversarial Networks”. In: International Conference on Machine Learning. 2017. 16 S. Nowozin, B. Cseke, and R. Tomioka. “f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization”. In: Neural Information Processing Systems. June 2016. 21
- 42. Roadmap • Minimum divergence estimation: uniqueness of minimizers. • Approximation properties: importance of the family of discriminators on the quality of the approximation • Statistical analysis: consistency and rate of convergence. 22
- 44. Kullback-Leibler and Jensen divergences Kullback-Leibler • For P Q probability measures on E: DKL(P Q) = ln dP dQ dP. • Properties: DKL(P Q) ≥ 0 DKL(P Q) = 0 ⇐⇒ P = Q. • If p = dP dµ and q = dQ dµ : DKL(P Q) = p ln p q dµ. • DKL is not symmetric and deﬁned only for P Q. 23
- 45. Kullback-Leibler and Jensen divergences Kullback-Leibler • For P Q probability measures on E: DKL(P Q) = ln dP dQ dP. • Properties: DKL(P Q) ≥ 0 DKL(P Q) = 0 ⇐⇒ P = Q. • If p = dP dµ and q = dQ dµ : DKL(P Q) = p ln p q dµ. • DKL is not symmetric and deﬁned only for P Q. 23
- 46. Kullback-Leibler and Jensen divergences Jensen-Shannon • For P and Q probability measures on E: DJS(P, Q) = 1 2 DKL P P + Q 2 + 1 2 DKL Q P + Q 2 . • Property: 0 ≤ DJS(P, Q) ≤ ln 2. • (P, Q) → DJS(P, Q) is a distance. 24
- 47. Kullback-Leibler and Jensen divergences Jensen-Shannon • For P and Q probability measures on E: DJS(P, Q) = 1 2 DKL P P + Q 2 + 1 2 DKL Q P + Q 2 . • Property: 0 ≤ DJS(P, Q) ≤ ln 2. • (P, Q) → DJS(P, Q) is a distance. 24
- 48. Kullback-Leibler and Jensen divergences Jensen-Shannon • For P and Q probability measures on E: DJS(P, Q) = 1 2 DKL P P + Q 2 + 1 2 DKL Q P + Q 2 . • Property: 0 ≤ DJS(P, Q) ≤ ln 2. • (P, Q) → DJS(P, Q) is a distance. 24
- 49. GAN and Jensen-Shannon divergence GANs • Empirical criteria: ˆL(θ, D) = 1 n n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))) . • Problem: inf θ∈Θ sup D∈D ˆL(θ, D) . 25
- 50. GAN and Jensen-Shannon divergence GANs • Empirical criteria: ˆL(θ, D) = 1 n n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))) . • Problem: inf θ∈Θ sup D∈D ˆL(θ, D) . Ideal GANs • Population version of the criteria: L(θ, D) = ln(D)p dµ + ln(1 − D)pθdµ. • No constraint: D = D∞, set of all functions from E to [0, 1]. • Problem: inf θ∈Θ sup D∈D∞ L(θ, D) . 25
- 51. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. 26
- 52. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. • Optimal discriminator: Dθ = p p + pθ , with convention 0/0 = 0. 26
- 53. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. • Optimal discriminator: Dθ = p p + pθ , with convention 0/0 = 0. • Optimal criteria: sup D∈D∞ L(θ, D) = L(θ, Dθ ) = 2DJS(p , pθ) − ln 4. 26
- 54. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. • Optimal discriminator: Dθ = p p + pθ , with convention 0/0 = 0. • Optimal criteria: sup D∈D∞ L(θ, D) = L(θ, Dθ ) = 2DJS(p , pθ) − ln 4. • Problem: inf θ∈Θ sup D∈D∞ L(θ, D) = inf θ∈Θ L(θ, Dθ ) = 2 inf θ∈Θ DJS(p , pθ) − ln 4. 26
- 55. The quest for Dθ Numerical approach • Big n, big D: try to approximate Dθ with arg maxD∈D ˆL(θ, D). • Close to divergence minimization: supD∈D ˆL(θ, D) ≈ 2DJS(p , pθ) − ln 4. 17 Goodfellow et al., “Generative Adversarial Nets”. 27
- 56. The quest for Dθ Numerical approach • Big n, big D: try to approximate Dθ with arg maxD∈D ˆL(θ, D). • Close to divergence minimization: supD∈D ˆL(θ, D) ≈ 2DJS(p , pθ) − ln 4. Theorem Let θ ∈ Θ and Aθ = {p = pθ = 0}. If µ(Aθ) = 0, then {Dθ } = arg maxD∈D∞ L(θ, D). If µ(Aθ) > 0, then Dθ is unique only on EAθ. Completes Proposition 1 in17 . 17 Goodfellow et al., “Generative Adversarial Nets”. 27
- 57. Oracle parameter • Oracle parameter regarding the Jensen-Shannon divergence: θ ∈ arg minθ∈Θ L(θ, Dθ ) = arg minθ∈Θ DJS(p , pθ). • Gθ is the ideal generator. • If p ∈ P, p = pθ DJS(p , pθ ) = 0 Dθ = 1 2 . • What if p /∈ P? Existence and uniqueness of θ ? 28
- 58. Oracle parameter • Oracle parameter regarding the Jensen-Shannon divergence: θ ∈ arg minθ∈Θ L(θ, Dθ ) = arg minθ∈Θ DJS(p , pθ). • Gθ is the ideal generator. • If p ∈ P, p = pθ DJS(p , pθ ) = 0 Dθ = 1 2 . • What if p /∈ P? Existence and uniqueness of θ ? Theorem Assume that P is a convex and compact set for the JS distance. If p > 0 µ-almost everywhere, then there exists ¯p ∈ P such that {¯p} = arg minp∈P DJS(p , p). In addition, if the model P is identiﬁable, then there exists θ ∈ Θ such mathematical {θ } = arg minθ∈Θ L(θ, Dθ ). 28
- 59. Oracle parameter Existence and uniqueness • Compactness of P and continuity of DJS(p , ·). • p > 0 µ-a.e. enables strict convexity of DJS(p , ·). 29
- 60. Oracle parameter Existence and uniqueness • Compactness of P and continuity of DJS(p , ·). • p > 0 µ-a.e. enables strict convexity of DJS(p , ·). Compactness of P with respect to the JS distance 1. Θ compact and P convex. 2. For all x ∈ E, θ ∈ Θ → pθ(x) is continuous. 3. sup(θ,θ )∈Θ2 |pθ ln pθ | ∈ L1 (µ). 29
- 61. Oracle parameter Existence and uniqueness • Compactness of P and continuity of DJS(p , ·). • p > 0 µ-a.e. enables strict convexity of DJS(p , ·). Compactness of P with respect to the JS distance 1. Θ compact and P convex. 2. For all x ∈ E, θ ∈ Θ → pθ(x) is continuous. 3. sup(θ,θ )∈Θ2 |pθ ln pθ | ∈ L1 (µ). Identiﬁability High-dimensional parametric setting often misspeciﬁed =⇒ identiﬁability not satisﬁed. 29
- 63. From JS divergence to likelihood GAN = JS divergence • GANs don’t minimize the Jensen-Shannon divergence. • Considering supD∈D∞ L(θ, D) means knowing Dθ = p p +pθ , thus knowing p . 30
- 64. From JS divergence to likelihood GAN = JS divergence • GANs don’t minimize the Jensen-Shannon divergence. • Considering supD∈D∞ L(θ, D) means knowing Dθ = p p +pθ , thus knowing p . Parametrized discriminators • D = {Dα}α∈Λ, Λ ⊂ Rq : parametric family of discriminators. • Likelihood-type problem with two parametric families: inf θ∈Θ sup α∈Λ L(θ, Dα). • Likelihood parameter: ¯θ ∈ arg minθ∈Θ sup α∈Λ L(θ, Dα). • How close the best candidate p¯θ is to the ideal density pθ ? • How does it depend on the capability of D to approximate Dθ ? 30
- 65. Approximation result (Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2 (µ) such that m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε. 31
- 66. Approximation result (Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2 (µ) such that m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and p¯θ ≤ M. Then, under Assumption (Hε) with ε < 1/(2M), there exists a constant c1 > 0 (depending only upon m and M) such that DJS(p , p¯θ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 . 31
- 67. Approximation result (Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2 (µ) such that m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and p¯θ ≤ M. Then, under Assumption (Hε) with ε < 1/(2M), there exists a constant c1 > 0 (depending only upon m and M) such that DJS(p , p¯θ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 . Remarks As soon as the class D becomes richer: • minimizing supα∈Λ L(θ, Dα) over Θ helps minimizing DJS(p , pθ). • since under some assumptions {pθ } = arg minpθ:θ∈Θ DJS(p , pθ), p¯θ comes closer to pθ . 31
- 69. The estimation problem Estimator ˆθ ∈ arg minθ∈Θ sup α∈Λ ˆL(θ, α), where ˆL(θ, α) = 1 n n i=1 ln(Dα(Xi )) + n i=1 ln(1 − Dα(Gθ(Zi ))) . 32
- 70. The estimation problem Estimator ˆθ ∈ arg minθ∈Θ sup α∈Λ ˆL(θ, α), where ˆL(θ, α) = 1 n n i=1 ln(Dα(Xi )) + n i=1 ln(1 − Dα(Gθ(Zi ))) . (Hreg) Regularity conditions of order 1 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ˆθ exists (and so for ¯θ). 32
- 71. The estimation problem Estimator ˆθ ∈ arg minθ∈Θ sup α∈Λ ˆL(θ, α), where ˆL(θ, α) = 1 n n i=1 ln(Dα(Xi )) + n i=1 ln(1 − Dα(Gθ(Zi ))) . (Hreg) Regularity conditions of order 1 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ˆθ exists (and so for ¯θ). Questions • How far DJS(p , pˆθ) is from minθ∈Θ DJS(p , pθ) = DJS(p , pθ )? • Does ˆθ converge towards ¯θ as n → ∞? • What is the asymptotic distribution of ˆθ − ¯θ? 32
- 72. Non-asymptotic bound on the JS divergence (Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε. 33
- 73. Non-asymptotic bound on the JS divergence (Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and pθ ≤ M for all θ ∈ Θ. Then, under Assumptions (Hreg) and (Hε) with ε < 1/(2M), there exist two constants c1 > 0 (depending only upon m and M) and c2 such that E DJS(p , pˆθ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 + c2 1 √ n . 33
- 74. Non-asymptotic bound on the JS divergence (Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and pθ ≤ M for all θ ∈ Θ. Then, under Assumptions (Hreg) and (Hε) with ε < 1/(2M), there exist two constants c1 > 0 (depending only upon m and M) and c2 such that E DJS(p , pˆθ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 + c2 1 √ n . Remarks • Under (Hreg), {ˆL(θ, α) − L(θ, α)}θ∈Θ,α∈Λ is a subgaussian process for · / √ n. • Dudley’s inequality: E supθ∈Θ,α∈Λ |ˆL(θ, α) − L(θ, α)| = O 1√ n . • c2 scales as p + q =⇒ loose bound in the usual over-parametrized regime (LSUN, FACES: √ n ≈ 1000 p + q ≈ 1500000). 33
- 75. Illustration Setting • p (x) = e−x/s s(1+e−x/s)2 , x ∈ R : logistic density (x ∈ R). • Gθ and Dα are two fully connected neural networks. • Z ∼ U([0, 1]): scalar noise. • n = 100000 (1/ √ n is negligible) and 30 replications. 34
- 76. Illustration Setting • p (x) = e−x/s s(1+e−x/s)2 , x ∈ R : logistic density (x ∈ R). • Gθ and Dα are two fully connected neural networks. • Z ∼ U([0, 1]): scalar noise. • n = 100000 (1/ √ n is negligible) and 30 replications. 34
- 77. Illustration Setting • Generator depth: 3. • Discriminator depth: 2 then 5. 35
- 78. Convergence of ˆθ (Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist. 36
- 79. Convergence of ˆθ (Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist. (H1) The pair (¯θ, ¯α) is unique and belongs to int(Θ) × int(Λ). Theorem Under Assumptions (Hreg) and (H1), ˆθ a.s. → ¯θ and ˆα a.s. → ¯α. 36
- 80. Convergence of ˆθ (Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist. (H1) The pair (¯θ, ¯α) is unique and belongs to int(Θ) × int(Λ). Theorem Under Assumptions (Hreg) and (H1), ˆθ a.s. → ¯θ and ˆα a.s. → ¯α. Remarks • Convergence of ˆθ comes from supθ∈Θ,α∈Λ |ˆL(θ, α) − L(θ, α)| a.s. → 0. • It does not need uniqueness of ¯α. • Convergence of ˆα comes from that of ˆθ. 36
- 81. Illustration Setting • Three models: 1. Laplace: p (x) = 1 3 e− 2|x| 3 vs pθ(x) = 1√ 2πθ e − x2 2θ2 . 2. Claw: p (x) = pclaw(x) vs pθ(x) = 1√ 2πθ e − x2 2θ2 . 3. Exponential: p (x) = e−x 1R+ vs pθ(x) = 1 θ 1[0,θ](x). • Gθ: generalized inverse of the cdf of pθ. • Z ∼ U([0, 1]): scalar noise. • Dα = pα1 pα1 +pα0 . • n = 10 to 10000 and 200 replications. 37
- 82. Illustration Claw vs Gaussian Exponential vs Uniform 38
- 83. Central limit theorem (Hloc) Local smoothness conditions around (¯θ, ¯α) (such that Hessians are invertible). Theorem Under Assumptions (Hreg), (H1) and (Hloc), √ n(ˆθ − ¯θ) d → N(0, Σ). 39
- 84. Central limit theorem (Hloc) Local smoothness conditions around (¯θ, ¯α) (such that Hessians are invertible). Theorem Under Assumptions (Hreg), (H1) and (Hloc), √ n(ˆθ − ¯θ) d → N(0, Σ). Remark One has Σ 2 = O(p3 q4 ), which suggests that ˆθ has a large dispersion around ¯θ in the over-parametrized regime. 39
- 85. Illustration Histograms of √ n(ˆθ − ¯θ): Claw vs Gaussian Exponential vs Uniform 40
- 86. Conclusion
- 87. Take-home message A ﬁrst step for understanding GANs • From data to sampling. • The richness of the class of discriminators D controls the gap between GANs and the JS divergence. • The generator parameters θ are asymptotically normal with rate √ n. 41
- 88. Take-home message A ﬁrst step for understanding GANs • From data to sampling. • The richness of the class of discriminators D controls the gap between GANs and the JS divergence. • The generator parameters θ are asymptotically normal with rate √ n. Future investigations 1. Impact of the latent variable Z (dimension, distribution) and the networks (number of layers in Gθ, dimensionality of Θ) on the performance of GANs (currently it is assumed p µ, pθ µ =⇒ information on the supporting manifold of p ). 2. How much assumptions (Hε) and (Hε) are satisﬁed for neural nets as discriminators? 3. Over-parametrized regime: convergence of distributions instead of parameters. 41