What can a statistician expect from GANs?

GANs from a statistical point of view
Maxime Sangnier
International workshop Machine Learning & Artificial Intelligence
September 17, 2018
Sorbonne Université, CNRS, LPSM, LIP6, Paris, France
Joint work with Gérard Biau1
, Benoît Cadre2
and Ugo Tanielian1,3
1
Sorbonne Université, CNRS, LPSM, Paris, France
2
ENS Rennes, Univ Rennes, CNRS, IRMAR, Rennes, France
3
Criteo, Paris, France
Contributors
Gérard Biau (Sorbonne Université) Benoît Cadre (ENS Rennes)
Ugo Tanielian (Sorbonne Université & Criteo) 1
Generative models
Motivation
Generative models aim at generating artificial contents.
• Images:
• merchandising;
• painting;
• art;
• super-resolution and denoising;
• text to image.
• Movies:
• pose to movie;
• Audio:
• speech synthesis ;
• music.
2
Merchandising
vue.ai
3
Art
prisma-ai.com
4
Painting
Interactive GAN.1
1
J.-Y. Zhu et al. “Generative Visual Manipulation on the Natural Image Manifold”. In: European
Conference on Computer Vision. 2016.
5
Superresolution
SuperResolution GAN.2
2
C. Ledig et al. “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial
Network”. In: arXiv:1609.04802 [cs, stat] (2016).
6
Text-to-image
Stacked GAN.3
3
H. Zhang et al. “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative
Adversarial Networks”. In: arXiv:1612.03242 [cs, stat] (2016).
7
Movies
Everybody Dance Now.4
4
C. Chan et al. “Everybody Dance Now”. In: arXiv:1808.07371 [cs] (2018).
8
Speech synthesis
WaveNet by DeepMind.
9
Motivation
Generative models aim at generating artificial contents.
• Outstanding image generation and extrapolation5
.
• And even more I’m not aware of. . .
5
T. Karras et al. “Progressive Growing of GANs for Improved Quality, Stability, and Variation”. In:
International Conference on Learning Representations. 2018.
10
Motivation
Generative models aim at generating artificial contents.
• Outstanding image generation and extrapolation5
.
• And even more I’m not aware of. . .
Generative models are used for:
• exploring unseen realities;
• providing many answers to a single question.
5
Karras et al., “Progressive Growing of GANs for Improved Quality, Stability, and Variation”.
10
Generate from data
X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd
.
How to sample according to p ?
11
Generate from data
X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd
.
How to sample according to p ?
Naive approach
1. estimate p by ˆp;
2. sample according to ˆp.
11
Generate from data
X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd
.
How to sample according to p ?
Naive approach
1. estimate p by ˆp;
2. sample according to ˆp.
Drawbacks
• both problems are difficult in themselves;
• we cannot define a realistic parametric statistical model;
• non-parametric density estimation inefficient in high dimension;
• this approach violates Vapnik’s principle:
When solving a problem of interest, do not solve a more general
problem as an intermediate step.
11
Some generative methods
METHOD DENSITY-
FREE
FLEXIBILITY SIMPLE
SAMPLING
Autoregressive models (WaveNet6
)
Nonlinear independent components analy-
sis (Real NVP7
)
Variational autoencoders8
Boltzmann machines9
Generative stochastic networks10
Generative adversarial networks
6
A.v.d. Oord et al. “WaveNet: A Generative Model for Raw Audio”. In: arXiv:1609.03499 [cs]
(2016).
7
L. Dinh, J. Sohl-Dickstein, and S. Bengio. “Density estimation using Real NVP”. . In:
arXiv:1605.08803 [cs, stat] (2016).
8
D.P. Kingma and M. Welling. “Auto-Encoding Variational Bayes”. In: International Conference on
Learning Representations. 2013.
9
S.E. Fahlman, G.E. Hinton, and T.J. Sejnowski. “Massively Parallel Architectures for AI: Netl,
Thistle, and Boltzmann Machines”. In: Proceedings of the Third AAAI Conference on Artificial
Intelligence. 1983.
10
Y. Bengio et al. “Deep Generative Stochastic Networks Trainable by Backprop”. In: International
Conference on Machine Learning. 2014. 12
Generative adversarial models
A direct approach
Cornerstone: don’t estimate p .
11
I. Goodfellow et al. “Generative Adversarial Nets”. In: Advances in Neural Information Processing
Systems. 2014.
13
A direct approach
Cornerstone: don’t estimate p .
General procedure:
• sample U1, . . . , Un i.i.d. thanks to a
parametric model;
• compare X1, . . . , Xn and U1, . . . , Un
and update the model.
11
Goodfellow et al., “Generative Adversarial Nets”.
13
A direct approach
Cornerstone: don’t estimate p .
General procedure:
• sample U1, . . . , Un i.i.d. thanks to a
parametric model;
• compare X1, . . . , Xn and U1, . . . , Un
and update the model.
GANs11
follow this principle.
11
Goodfellow et al., “Generative Adversarial Nets”.
13
Generating a random sample
Inverse transform sampling
• S: scalar random variable;
• FS: cumulative distribution function of S;
• Z ∼ U([0, 1]).
• F−1
S (Z)
d
= S.
14
Generating a random sample
Inverse transform sampling
• S: scalar random variable;
• FS: cumulative distribution function of S;
• Z ∼ U([0, 1]).
• F−1
S (Z)
d
= S.
Generators
• X1, . . . , Xn i.i.d. according to a density p on E ⊆ Rd
, dominated by a
known measure µ.
• G = {Gθ : Rd
→ E}θ∈Θ, Θ ⊂ Rp
: parametric family of generators
(d d);
• Z1, . . . , Zn random vectors from Rd
(typically U([0, 1]d
));
• Ui = Gθ(Zi ): generated sample;
• P = {pθ}θ∈Θ: associated family of densities with by definition
Gθ(Z1)
d
= pθdµ.
14
Generating a random sample
Remarks
• Each pθ is a candidate to represent p .
15
Generating a random sample
Remarks
• Each pθ is a candidate to represent p .
• The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the
analysis.
15
Generating a random sample
Remarks
• Each pθ is a candidate to represent p .
• The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the
analysis.
• It is not assumed that p belongs to P.
15
Generating a random sample
Remarks
• Each pθ is a candidate to represent p .
• The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the
analysis.
• It is not assumed that p belongs to P.
• In GANs: Gθ is a neural network with p weights, stored in θ ∈ Rp
.
15
Comparing two samples
The next step
• The procedure should drive θ such that Gθ(Z1)
d
= X1.
• Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ.
16
Comparing two samples
The next step
• The procedure should drive θ such that Gθ(Z1)
d
= X1.
• Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ.
Supervised learning
• Both samples have same distribution as soon as we cannot distinguish
them.
• This is a classification problem:
Class Y = 0 Class Y = 1
Gθ(Z1), . . . , Gθ(Zn) X1, . . . , Xn
16
Comparing two samples
The next step
• The procedure should drive θ such that Gθ(Z1)
d
= X1.
• Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ.
Supervised learning
• Both samples have same distribution as soon as we cannot distinguish
them.
• This is a classification problem:
Class Y = 0 Class Y = 1
Gθ(Z1), . . . , Gθ(Zn) X1, . . . , Xn
16
Adversarial principle
Discriminator
• D a family of functions from E to [0, 1]: the discriminators.
17
Adversarial principle
Discriminator
• D a family of functions from E to [0, 1]: the discriminators.
• Choose D ∈ D such that for any x ∈ E,
D(x) ≥ 1/2 =⇒ true observation (1)
D(x) < 1/2 =⇒ fake (generated) point. (2)
17
Adversarial principle
Discriminator
• D a family of functions from E to [0, 1]: the discriminators.
• Choose D ∈ D such that for any x ∈ E,
D(x) ≥ 1/2 =⇒ true observation (1)
D(x) < 1/2 =⇒ fake (generated) point. (2)
• Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with
same distribution as (X, Y).
17
Adversarial principle
Discriminator
• D a family of functions from E to [0, 1]: the discriminators.
• Choose D ∈ D such that for any x ∈ E,
D(x) ≥ 1/2 =⇒ true observation (1)
D(x) < 1/2 =⇒ fake (generated) point. (2)
• Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with
same distribution as (X, Y).
• Classification model: Y|X = x ∼ B(D(x)), i.e. P(Y = 1|X = x) = D(x).
17
Adversarial principle
Discriminator
• D a family of functions from E to [0, 1]: the discriminators.
• Choose D ∈ D such that for any x ∈ E,
D(x) ≥ 1/2 =⇒ true observation (1)
D(x) < 1/2 =⇒ fake (generated) point. (2)
• Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with
same distribution as (X, Y).
• Classification model: Y|X = x ∼ B(D(x)), i.e. P(Y = 1|X = x) = D(x).
• Maximum (conditional) likelihood estimation:
sup
D∈D
n
i=1
D(Xi ) ×
n
i=1
(1 − D(Gθ(Zi ))) or sup
D∈D
ˆL(θ, D),
with
ˆL(θ, D) =
1
n
n
i=1
ln(D(Xi )) +
n
i=1
ln(1 − D(Gθ(Zi ))) .
17
Adversarial principle
Generator
• supD∈D
ˆL(θ, D) acts like a divergence between the distributions of
Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn.
18
Adversarial principle
Generator
• supD∈D
ˆL(θ, D) acts like a divergence between the distributions of
Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn.
• Minimum divergence estimation:
inf
θ∈Θ
sup
D∈D
ˆL(θ, D) .
or
inf
θ∈Θ
sup
D∈D
n
i=1
ln(D(Xi )) +
n
i=1
ln(1 − D(Gθ(Zi ))).
18
Adversarial principle
Generator
• supD∈D
ˆL(θ, D) acts like a divergence between the distributions of
Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn.
• Minimum divergence estimation:
inf
θ∈Θ
sup
D∈D
ˆL(θ, D) .
or
inf
θ∈Θ
sup
D∈D
n
i=1
ln(D(Xi )) +
n
i=1
ln(1 − D(Gθ(Zi ))).
• Adversarial, minimax or zero-sum game.
18
The GAN Zoo
Avinash Hindupur’s Github. 19
The GAN Zoo
Curbing the discriminator
• least squares12
:
inf
D∈D
n
i=1
(D(Xi ) − 1)2
+
n
i=1
D(Gθ(Zi ))2
, inf
θ∈Θ
n
i=1
(D(Gθ(Zi )) − 1)2
.
• asymmetric hinge13
:
inf
D∈D
−
n
i=1
D(Xi ) +
n
i=1
max (0, 1 − D(Gθ(Zi ))) , inf
θ∈Θ
−
n
i=1
D(Gθ(Zi )).
12
X. Mao et al. “Least Squares Generative Adversarial Networks”. In: IEEE International
Conference on Computer Vision. 2017.
13
J. Zhao, M. Mathieu, and Y. LeCun. “Energy-based Generative Adversarial Network”. In:
International Conference on Learning Representations. 2017.
20
The GAN Zoo
Metrics as minimax games
• Maximum mean discrepancy14
and Wasserstein15
:
inf
θ∈Θ
sup
T∈T
Tp dµ − Tpθdµ.
• f-divergences16
:
inf
θ∈Θ
sup
T∈T
Tp dµ − (f ◦ T)pθdµ.
With T being a prescribed class of functions and f the convex conjugate of
a lower-semicontinuous function f.
14
G.K. Dziugaite, D.M. Roy, and Z. Ghahramani. “Training generative neural networks via Maximum
Mean Discrepancy optimization”. In: Proceedings of the Thirty-First Conference on Uncertainty in
Artificial Intelligence. 2015; Y. Li, K. Swersky, and R. Zemel. “Generative Moment Matching
Networks”. In: International Conference on Machine Learning. 2015.
15
M. Arjovsky, S. Chintala, and L. Bottou. “Wasserstein Generative Adversarial Networks”. In:
International Conference on Machine Learning. 2017.
16
S. Nowozin, B. Cseke, and R. Tomioka. “f-GAN: Training Generative Neural Samplers using
Variational Divergence Minimization”. In: Neural Information Processing Systems. June 2016.
21
Roadmap
• Minimum divergence estimation: uniqueness of minimizers.
• Approximation properties: importance of the family of discriminators on
the quality of the approximation
• Statistical analysis: consistency and rate of convergence.
22
Minimum divergence estimation
Kullback-Leibler and Jensen divergences
Kullback-Leibler
• For P Q probability measures on E:
DKL(P Q) = ln
dP
dQ
dP.
• Properties:
DKL(P Q) ≥ 0 DKL(P Q) = 0 ⇐⇒ P = Q.
• If p = dP
dµ
and q = dQ
dµ
:
DKL(P Q) = p ln
p
q
dµ.
• DKL is not symmetric and
defined only for P Q.
23
Kullback-Leibler and Jensen divergences
Kullback-Leibler
• For P Q probability measures on E:
DKL(P Q) = ln
dP
dQ
dP.
• Properties:
DKL(P Q) ≥ 0 DKL(P Q) = 0 ⇐⇒ P = Q.
• If p = dP
dµ
and q = dQ
dµ
:
DKL(P Q) = p ln
p
q
dµ.
• DKL is not symmetric and
defined only for P Q.
23
Kullback-Leibler and Jensen divergences
Jensen-Shannon
• For P and Q probability measures on E:
DJS(P, Q) =
1
2
DKL P
P + Q
2
+
1
2
DKL Q
P + Q
2
.
• Property:
0 ≤ DJS(P, Q) ≤ ln 2.
• (P, Q) → DJS(P, Q) is a distance.
24
Kullback-Leibler and Jensen divergences
Jensen-Shannon
• For P and Q probability measures on E:
DJS(P, Q) =
1
2
DKL P
P + Q
2
+
1
2
DKL Q
P + Q
2
.
• Property:
0 ≤ DJS(P, Q) ≤ ln 2.
• (P, Q) → DJS(P, Q) is a distance.
24
Kullback-Leibler and Jensen divergences
Jensen-Shannon
• For P and Q probability measures on E:
DJS(P, Q) =
1
2
DKL P
P + Q
2
+
1
2
DKL Q
P + Q
2
.
• Property:
0 ≤ DJS(P, Q) ≤ ln 2.
• (P, Q) → DJS(P, Q) is a distance.
24
GAN and Jensen-Shannon divergence
GANs
• Empirical criteria:
ˆL(θ, D) =
1
n
n
i=1
ln(D(Xi )) +
n
i=1
ln(1 − D(Gθ(Zi ))) .
• Problem:
inf
θ∈Θ
sup
D∈D
ˆL(θ, D) .
25
GAN and Jensen-Shannon divergence
GANs
• Empirical criteria:
ˆL(θ, D) =
1
n
n
i=1
ln(D(Xi )) +
n
i=1
ln(1 − D(Gθ(Zi ))) .
• Problem:
inf
θ∈Θ
sup
D∈D
ˆL(θ, D) .
Ideal GANs
• Population version of the criteria:
L(θ, D) = ln(D)p dµ + ln(1 − D)pθdµ.
• No constraint: D = D∞, set of all functions from E to [0, 1].
• Problem:
inf
θ∈Θ
sup
D∈D∞
L(θ, D) .
25
GAN and Jensen-Shannon divergence
From GAN to JS divergence
• Criteria:
sup
D∈D∞
L(θ, D) = sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ
≤ sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ.
26
GAN and Jensen-Shannon divergence
From GAN to JS divergence
• Criteria:
sup
D∈D∞
L(θ, D) = sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ
≤ sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ.
• Optimal discriminator:
Dθ =
p
p + pθ
,
with convention 0/0 = 0.
26
GAN and Jensen-Shannon divergence
From GAN to JS divergence
• Criteria:
sup
D∈D∞
L(θ, D) = sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ
≤ sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ.
• Optimal discriminator:
Dθ =
p
p + pθ
,
with convention 0/0 = 0.
• Optimal criteria:
sup
D∈D∞
L(θ, D) = L(θ, Dθ ) = 2DJS(p , pθ) − ln 4.
26
GAN and Jensen-Shannon divergence
From GAN to JS divergence
• Criteria:
sup
D∈D∞
L(θ, D) = sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ
≤ sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ.
• Optimal discriminator:
Dθ =
p
p + pθ
,
with convention 0/0 = 0.
• Optimal criteria:
sup
D∈D∞
L(θ, D) = L(θ, Dθ ) = 2DJS(p , pθ) − ln 4.
• Problem:
inf
θ∈Θ
sup
D∈D∞
L(θ, D) = inf
θ∈Θ
L(θ, Dθ ) = 2 inf
θ∈Θ
DJS(p , pθ) − ln 4.
26
The quest for Dθ
Numerical approach
• Big n, big D: try to approximate Dθ with arg maxD∈D
ˆL(θ, D).
• Close to divergence minimization: supD∈D
ˆL(θ, D) ≈ 2DJS(p , pθ) − ln 4.
17
Goodfellow et al., “Generative Adversarial Nets”.
27
The quest for Dθ
Numerical approach
• Big n, big D: try to approximate Dθ with arg maxD∈D
ˆL(θ, D).
• Close to divergence minimization: supD∈D
ˆL(θ, D) ≈ 2DJS(p , pθ) − ln 4.
Theorem
Let θ ∈ Θ and Aθ = {p = pθ = 0}.
If µ(Aθ) = 0, then
{Dθ } = arg maxD∈D∞
L(θ, D).
If µ(Aθ) > 0, then Dθ is unique only on EAθ.
Completes Proposition 1 in17
.
17
Goodfellow et al., “Generative Adversarial Nets”.
27
Oracle parameter
• Oracle parameter regarding the Jensen-Shannon divergence:
θ ∈ arg minθ∈Θ L(θ, Dθ ) = arg minθ∈Θ DJS(p , pθ).
• Gθ is the ideal generator.
• If p ∈ P,
p = pθ DJS(p , pθ ) = 0 Dθ =
1
2
.
• What if p /∈ P? Existence and uniqueness of θ ?
28
Oracle parameter
• Oracle parameter regarding the Jensen-Shannon divergence:
θ ∈ arg minθ∈Θ L(θ, Dθ ) = arg minθ∈Θ DJS(p , pθ).
• Gθ is the ideal generator.
• If p ∈ P,
p = pθ DJS(p , pθ ) = 0 Dθ =
1
2
.
• What if p /∈ P? Existence and uniqueness of θ ?
Theorem
Assume that P is a convex and compact set for the JS distance.
If p > 0 µ-almost everywhere, then there exists ¯p ∈ P such that
{¯p} = arg minp∈P DJS(p , p).
In addition, if the model P is identifiable, then there exists θ ∈ Θ such
mathematical
{θ } = arg minθ∈Θ L(θ, Dθ ).
28
Oracle parameter
Existence and uniqueness
• Compactness of P and continuity of DJS(p , ·).
• p > 0 µ-a.e. enables strict convexity of DJS(p , ·).
29
Oracle parameter
Existence and uniqueness
• Compactness of P and continuity of DJS(p , ·).
• p > 0 µ-a.e. enables strict convexity of DJS(p , ·).
Compactness of P with respect to the JS distance
1. Θ compact and P convex.
2. For all x ∈ E, θ ∈ Θ → pθ(x) is continuous.
3. sup(θ,θ )∈Θ2 |pθ ln pθ | ∈ L1
(µ).
29
Oracle parameter
Existence and uniqueness
• Compactness of P and continuity of DJS(p , ·).
• p > 0 µ-a.e. enables strict convexity of DJS(p , ·).
Compactness of P with respect to the JS distance
1. Θ compact and P convex.
2. For all x ∈ E, θ ∈ Θ → pθ(x) is continuous.
3. sup(θ,θ )∈Θ2 |pθ ln pθ | ∈ L1
(µ).
Identifiability
High-dimensional parametric setting often misspecified =⇒ identifiability
not satisfied.
29
Approximation properties
From JS divergence to likelihood
GAN = JS divergence
• GANs don’t minimize the Jensen-Shannon divergence.
• Considering supD∈D∞
L(θ, D) means knowing Dθ = p
p +pθ
, thus knowing
p .
30
From JS divergence to likelihood
GAN = JS divergence
• GANs don’t minimize the Jensen-Shannon divergence.
• Considering supD∈D∞
L(θ, D) means knowing Dθ = p
p +pθ
, thus knowing
p .
Parametrized discriminators
• D = {Dα}α∈Λ, Λ ⊂ Rq
: parametric family of discriminators.
• Likelihood-type problem with two parametric families:
inf
θ∈Θ
sup
α∈Λ
L(θ, Dα).
• Likelihood parameter:
¯θ ∈ arg minθ∈Θ sup
α∈Λ
L(θ, Dα).
• How close the best candidate p¯θ is to the ideal density pθ ?
• How does it depend on the capability of D to approximate Dθ ?
30
Approximation result
(Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2
(µ) such that
m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε.
31
Approximation result
(Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2
(µ) such that
m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε.
Theorem
Assume that, for some M > 0, p ≤ M and p¯θ ≤ M.
Then, under Assumption (Hε) with ε < 1/(2M), there exists a constant
c1 > 0 (depending only upon m and M) such that
DJS(p , p¯θ) − min
θ∈Θ
DJS(p , pθ) ≤ c1ε2
.
31
Approximation result
(Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2
(µ) such that
m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε.
Theorem
Assume that, for some M > 0, p ≤ M and p¯θ ≤ M.
Then, under Assumption (Hε) with ε < 1/(2M), there exists a constant
c1 > 0 (depending only upon m and M) such that
DJS(p , p¯θ) − min
θ∈Θ
DJS(p , pθ) ≤ c1ε2
.
Remarks
As soon as the class D becomes richer:
• minimizing supα∈Λ L(θ, Dα) over Θ helps minimizing DJS(p , pθ).
• since under some assumptions {pθ } = arg minpθ:θ∈Θ DJS(p , pθ), p¯θ
comes closer to pθ . 31
Statistical analysis
The estimation problem
Estimator
ˆθ ∈ arg minθ∈Θ sup
α∈Λ
ˆL(θ, α),
where
ˆL(θ, α) =
1
n
n
i=1
ln(Dα(Xi )) +
n
i=1
ln(1 − Dα(Gθ(Zi ))) .
32
The estimation problem
Estimator
ˆθ ∈ arg minθ∈Θ sup
α∈Λ
ˆL(θ, α),
where
ˆL(θ, α) =
1
n
n
i=1
ln(Dα(Xi )) +
n
i=1
ln(1 − Dα(Gθ(Zi ))) .
(Hreg) Regularity conditions of order 1 on the models (Gθ, pθ and Dα).
Existence
Under (Hreg), ˆθ exists (and so for ¯θ).
32
The estimation problem
Estimator
ˆθ ∈ arg minθ∈Θ sup
α∈Λ
ˆL(θ, α),
where
ˆL(θ, α) =
1
n
n
i=1
ln(Dα(Xi )) +
n
i=1
ln(1 − Dα(Gθ(Zi ))) .
(Hreg) Regularity conditions of order 1 on the models (Gθ, pθ and Dα).
Existence
Under (Hreg), ˆθ exists (and so for ¯θ).
Questions
• How far DJS(p , pˆθ) is from minθ∈Θ DJS(p , pθ) = DJS(p , pθ )?
• Does ˆθ converge towards ¯θ as n → ∞?
• What is the asymptotic distribution of ˆθ − ¯θ?
32
Non-asymptotic bound on the JS divergence
(Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists
D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε.
33
Non-asymptotic bound on the JS divergence
(Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists
D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε.
Theorem
Assume that, for some M > 0, p ≤ M and pθ ≤ M for all θ ∈ Θ.
Then, under Assumptions (Hreg) and (Hε) with ε < 1/(2M), there exist two
constants c1 > 0 (depending only upon m and M) and c2 such that
E DJS(p , pˆθ) − min
θ∈Θ
DJS(p , pθ) ≤ c1ε2
+ c2
1
√
n
.
33
Non-asymptotic bound on the JS divergence
(Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists
D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε.
Theorem
Assume that, for some M > 0, p ≤ M and pθ ≤ M for all θ ∈ Θ.
Then, under Assumptions (Hreg) and (Hε) with ε < 1/(2M), there exist two
constants c1 > 0 (depending only upon m and M) and c2 such that
E DJS(p , pˆθ) − min
θ∈Θ
DJS(p , pθ) ≤ c1ε2
+ c2
1
√
n
.
Remarks
• Under (Hreg), {ˆL(θ, α) − L(θ, α)}θ∈Θ,α∈Λ is a subgaussian process for
· /
√
n.
• Dudley’s inequality: E supθ∈Θ,α∈Λ |ˆL(θ, α) − L(θ, α)| = O 1√
n
.
• c2 scales as p + q =⇒ loose bound in the usual over-parametrized
regime (LSUN, FACES:
√
n ≈ 1000 p + q ≈ 1500000).
33
Illustration
Setting
• p (x) = e−x/s
s(1+e−x/s)2 , x ∈ R : logistic density (x ∈ R).
• Gθ and Dα are two fully connected neural networks.
• Z ∼ U([0, 1]): scalar noise.
• n = 100000 (1/
√
n is negligible) and 30 replications.
34
Illustration
Setting
• p (x) = e−x/s
s(1+e−x/s)2 , x ∈ R : logistic density (x ∈ R).
• Gθ and Dα are two fully connected neural networks.
• Z ∼ U([0, 1]): scalar noise.
• n = 100000 (1/
√
n is negligible) and 30 replications.
34
Illustration
Setting
• Generator depth: 3.
• Discriminator depth: 2 then 5.
35
Convergence of ˆθ
(Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα).
Existence
Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist.
36
Convergence of ˆθ
(Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα).
Existence
Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist.
(H1) The pair (¯θ, ¯α) is unique and belongs to int(Θ) × int(Λ).
Theorem
Under Assumptions (Hreg) and (H1),
ˆθ
a.s.
→ ¯θ and ˆα
a.s.
→ ¯α.
36
Convergence of ˆθ
(Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα).
Existence
Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist.
(H1) The pair (¯θ, ¯α) is unique and belongs to int(Θ) × int(Λ).
Theorem
Under Assumptions (Hreg) and (H1),
ˆθ
a.s.
→ ¯θ and ˆα
a.s.
→ ¯α.
Remarks
• Convergence of ˆθ comes from supθ∈Θ,α∈Λ |ˆL(θ, α) − L(θ, α)|
a.s.
→ 0.
• It does not need uniqueness of ¯α.
• Convergence of ˆα comes from that of ˆθ.
36
Illustration
Setting
• Three models:
1. Laplace: p (x) = 1
3
e−
2|x|
3 vs pθ(x) = 1√
2πθ
e
− x2
2θ2 .
2. Claw: p (x) = pclaw(x) vs pθ(x) = 1√
2πθ
e
− x2
2θ2 .
3. Exponential: p (x) = e−x 1R+ vs pθ(x) = 1
θ
1[0,θ](x).
• Gθ: generalized inverse of the cdf of pθ.
• Z ∼ U([0, 1]): scalar noise.
• Dα =
pα1
pα1
+pα0
.
• n = 10 to 10000 and 200 replications.
37
Illustration
Claw vs Gaussian Exponential vs Uniform
38
Central limit theorem
(Hloc) Local smoothness conditions around (¯θ, ¯α) (such that Hessians are
invertible).
Theorem
Under Assumptions (Hreg), (H1) and (Hloc),
√
n(ˆθ − ¯θ)
d
→ N(0, Σ).
39
Central limit theorem
(Hloc) Local smoothness conditions around (¯θ, ¯α) (such that Hessians are
invertible).
Theorem
Under Assumptions (Hreg), (H1) and (Hloc),
√
n(ˆθ − ¯θ)
d
→ N(0, Σ).
Remark
One has Σ 2 = O(p3
q4
), which suggests that ˆθ has a large dispersion
around ¯θ in the over-parametrized regime.
39
Illustration
Histograms of
√
n(ˆθ − ¯θ):
Claw vs Gaussian Exponential vs Uniform
40
Conclusion
Take-home message
A first step for understanding GANs
• From data to sampling.
• The richness of the class of discriminators D controls the gap between
GANs and the JS divergence.
• The generator parameters θ are asymptotically normal with rate
√
n.
41
Take-home message
A first step for understanding GANs
• From data to sampling.
• The richness of the class of discriminators D controls the gap between
GANs and the JS divergence.
• The generator parameters θ are asymptotically normal with rate
√
n.
Future investigations
1. Impact of the latent variable Z (dimension, distribution) and the networks
(number of layers in Gθ, dimensionality of Θ) on the performance of
GANs (currently it is assumed p µ, pθ µ =⇒ information on the
supporting manifold of p ).
2. How much assumptions (Hε) and (Hε) are satisfied for neural nets as
discriminators?
3. Over-parametrized regime: convergence of distributions instead of
parameters.
41
1 of 88

Recommended

ChatGPT and the Future of Work - Clark Boyd by
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
24.3K views69 slides
Getting into the tech field. what next by
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
5.7K views22 slides
Google's Just Not That Into You: Understanding Core Updates & Search Intent by
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
6.4K views99 slides
How to have difficult conversations by
How to have difficult conversations How to have difficult conversations
How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC
5K views19 slides
Introduction to Data Science by
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceChristy Abraham Joy
82.3K views51 slides
Time Management & Productivity - Best Practices by
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
169.7K views42 slides

More Related Content

Recently uploaded

Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...The Digital Insurer
24 views52 slides
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Safe Software
317 views86 slides
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...Bernd Ruecker
48 views69 slides
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc
72 views29 slides
Future of Indian ConsumerTech by
Future of Indian ConsumerTechFuture of Indian ConsumerTech
Future of Indian ConsumerTechKapil Khandelwal (KK)
24 views68 slides
Ransomware is Knocking your Door_Final.pdf by
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
66 views46 slides

Recently uploaded(20)

Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software317 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker48 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc72 views
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely29 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 views
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana17 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays33 views

Featured

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
55.5K views138 slides
12 Ways to Increase Your Influence at Work by
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
401.7K views64 slides
ChatGPT webinar slides by
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slidesAlireza Esmikhani
30.4K views36 slides
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
3.6K views12 slides
Barbie - Brand Strategy Presentation by
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
25.1K views46 slides

Featured(20)

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by Applitools
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools55.5K views
12 Ways to Increase Your Influence at Work by GetSmarter
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter401.7K views
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by DevGAMM Conference
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference3.6K views
Barbie - Brand Strategy Presentation by Erica Santiago
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago25.1K views
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well by Saba Software
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software25.2K views
Introduction to C Programming Language by Simplilearn
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn8.4K views
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr... by Palo Alto Software
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
Palo Alto Software88.4K views
9 Tips for a Work-free Vacation by Weekdone.com
9 Tips for a Work-free Vacation9 Tips for a Work-free Vacation
9 Tips for a Work-free Vacation
Weekdone.com7.2K views
How to Map Your Future by SlideShop.com
How to Map Your FutureHow to Map Your Future
How to Map Your Future
SlideShop.com275.1K views
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -... by AccuraCast
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
AccuraCast3.4K views
Exploring ChatGPT for Effective Teaching and Learning.pptx by Stan Skrabut, Ed.D.
Exploring ChatGPT for Effective Teaching and Learning.pptxExploring ChatGPT for Effective Teaching and Learning.pptx
Exploring ChatGPT for Effective Teaching and Learning.pptx
Stan Skrabut, Ed.D.57.7K views
How to train your robot (with Deep Reinforcement Learning) by Lucas García, PhD
How to train your robot (with Deep Reinforcement Learning)How to train your robot (with Deep Reinforcement Learning)
How to train your robot (with Deep Reinforcement Learning)
Lucas García, PhD42.5K views
4 Strategies to Renew Your Career Passion by Daniel Goleman
4 Strategies to Renew Your Career Passion4 Strategies to Renew Your Career Passion
4 Strategies to Renew Your Career Passion
Daniel Goleman122K views
The Student's Guide to LinkedIn by LinkedIn
The Student's Guide to LinkedInThe Student's Guide to LinkedIn
The Student's Guide to LinkedIn
LinkedIn88K views
Different Roles in Machine Learning Career by Intellipaat
Different Roles in Machine Learning CareerDifferent Roles in Machine Learning Career
Different Roles in Machine Learning Career
Intellipaat12.4K views
Defining a Tech Project Vision in Eight Quick Steps pdf by TechSoup
Defining a Tech Project Vision in Eight Quick Steps pdfDefining a Tech Project Vision in Eight Quick Steps pdf
Defining a Tech Project Vision in Eight Quick Steps pdf
TechSoup 9.7K views

What can a statistician expect from GANs?

  • 1. GANs from a statistical point of view Maxime Sangnier International workshop Machine Learning & Artificial Intelligence September 17, 2018 Sorbonne Université, CNRS, LPSM, LIP6, Paris, France Joint work with Gérard Biau1 , Benoît Cadre2 and Ugo Tanielian1,3 1 Sorbonne Université, CNRS, LPSM, Paris, France 2 ENS Rennes, Univ Rennes, CNRS, IRMAR, Rennes, France 3 Criteo, Paris, France
  • 2. Contributors Gérard Biau (Sorbonne Université) Benoît Cadre (ENS Rennes) Ugo Tanielian (Sorbonne Université & Criteo) 1
  • 4. Motivation Generative models aim at generating artificial contents. • Images: • merchandising; • painting; • art; • super-resolution and denoising; • text to image. • Movies: • pose to movie; • Audio: • speech synthesis ; • music. 2
  • 7. Painting Interactive GAN.1 1 J.-Y. Zhu et al. “Generative Visual Manipulation on the Natural Image Manifold”. In: European Conference on Computer Vision. 2016. 5
  • 8. Superresolution SuperResolution GAN.2 2 C. Ledig et al. “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network”. In: arXiv:1609.04802 [cs, stat] (2016). 6
  • 9. Text-to-image Stacked GAN.3 3 H. Zhang et al. “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”. In: arXiv:1612.03242 [cs, stat] (2016). 7
  • 10. Movies Everybody Dance Now.4 4 C. Chan et al. “Everybody Dance Now”. In: arXiv:1808.07371 [cs] (2018). 8
  • 12. Motivation Generative models aim at generating artificial contents. • Outstanding image generation and extrapolation5 . • And even more I’m not aware of. . . 5 T. Karras et al. “Progressive Growing of GANs for Improved Quality, Stability, and Variation”. In: International Conference on Learning Representations. 2018. 10
  • 13. Motivation Generative models aim at generating artificial contents. • Outstanding image generation and extrapolation5 . • And even more I’m not aware of. . . Generative models are used for: • exploring unseen realities; • providing many answers to a single question. 5 Karras et al., “Progressive Growing of GANs for Improved Quality, Stability, and Variation”. 10
  • 14. Generate from data X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd . How to sample according to p ? 11
  • 15. Generate from data X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd . How to sample according to p ? Naive approach 1. estimate p by ˆp; 2. sample according to ˆp. 11
  • 16. Generate from data X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd . How to sample according to p ? Naive approach 1. estimate p by ˆp; 2. sample according to ˆp. Drawbacks • both problems are difficult in themselves; • we cannot define a realistic parametric statistical model; • non-parametric density estimation inefficient in high dimension; • this approach violates Vapnik’s principle: When solving a problem of interest, do not solve a more general problem as an intermediate step. 11
  • 17. Some generative methods METHOD DENSITY- FREE FLEXIBILITY SIMPLE SAMPLING Autoregressive models (WaveNet6 ) Nonlinear independent components analy- sis (Real NVP7 ) Variational autoencoders8 Boltzmann machines9 Generative stochastic networks10 Generative adversarial networks 6 A.v.d. Oord et al. “WaveNet: A Generative Model for Raw Audio”. In: arXiv:1609.03499 [cs] (2016). 7 L. Dinh, J. Sohl-Dickstein, and S. Bengio. “Density estimation using Real NVP”. . In: arXiv:1605.08803 [cs, stat] (2016). 8 D.P. Kingma and M. Welling. “Auto-Encoding Variational Bayes”. In: International Conference on Learning Representations. 2013. 9 S.E. Fahlman, G.E. Hinton, and T.J. Sejnowski. “Massively Parallel Architectures for AI: Netl, Thistle, and Boltzmann Machines”. In: Proceedings of the Third AAAI Conference on Artificial Intelligence. 1983. 10 Y. Bengio et al. “Deep Generative Stochastic Networks Trainable by Backprop”. In: International Conference on Machine Learning. 2014. 12
  • 19. A direct approach Cornerstone: don’t estimate p . 11 I. Goodfellow et al. “Generative Adversarial Nets”. In: Advances in Neural Information Processing Systems. 2014. 13
  • 20. A direct approach Cornerstone: don’t estimate p . General procedure: • sample U1, . . . , Un i.i.d. thanks to a parametric model; • compare X1, . . . , Xn and U1, . . . , Un and update the model. 11 Goodfellow et al., “Generative Adversarial Nets”. 13
  • 21. A direct approach Cornerstone: don’t estimate p . General procedure: • sample U1, . . . , Un i.i.d. thanks to a parametric model; • compare X1, . . . , Xn and U1, . . . , Un and update the model. GANs11 follow this principle. 11 Goodfellow et al., “Generative Adversarial Nets”. 13
  • 22. Generating a random sample Inverse transform sampling • S: scalar random variable; • FS: cumulative distribution function of S; • Z ∼ U([0, 1]). • F−1 S (Z) d = S. 14
  • 23. Generating a random sample Inverse transform sampling • S: scalar random variable; • FS: cumulative distribution function of S; • Z ∼ U([0, 1]). • F−1 S (Z) d = S. Generators • X1, . . . , Xn i.i.d. according to a density p on E ⊆ Rd , dominated by a known measure µ. • G = {Gθ : Rd → E}θ∈Θ, Θ ⊂ Rp : parametric family of generators (d d); • Z1, . . . , Zn random vectors from Rd (typically U([0, 1]d )); • Ui = Gθ(Zi ): generated sample; • P = {pθ}θ∈Θ: associated family of densities with by definition Gθ(Z1) d = pθdµ. 14
  • 24. Generating a random sample Remarks • Each pθ is a candidate to represent p . 15
  • 25. Generating a random sample Remarks • Each pθ is a candidate to represent p . • The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the analysis. 15
  • 26. Generating a random sample Remarks • Each pθ is a candidate to represent p . • The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the analysis. • It is not assumed that p belongs to P. 15
  • 27. Generating a random sample Remarks • Each pθ is a candidate to represent p . • The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the analysis. • It is not assumed that p belongs to P. • In GANs: Gθ is a neural network with p weights, stored in θ ∈ Rp . 15
  • 28. Comparing two samples The next step • The procedure should drive θ such that Gθ(Z1) d = X1. • Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ. 16
  • 29. Comparing two samples The next step • The procedure should drive θ such that Gθ(Z1) d = X1. • Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ. Supervised learning • Both samples have same distribution as soon as we cannot distinguish them. • This is a classification problem: Class Y = 0 Class Y = 1 Gθ(Z1), . . . , Gθ(Zn) X1, . . . , Xn 16
  • 30. Comparing two samples The next step • The procedure should drive θ such that Gθ(Z1) d = X1. • Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ. Supervised learning • Both samples have same distribution as soon as we cannot distinguish them. • This is a classification problem: Class Y = 0 Class Y = 1 Gθ(Z1), . . . , Gθ(Zn) X1, . . . , Xn 16
  • 31. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. 17
  • 32. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) 17
  • 33. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) • Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with same distribution as (X, Y). 17
  • 34. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) • Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with same distribution as (X, Y). • Classification model: Y|X = x ∼ B(D(x)), i.e. P(Y = 1|X = x) = D(x). 17
  • 35. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) • Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with same distribution as (X, Y). • Classification model: Y|X = x ∼ B(D(x)), i.e. P(Y = 1|X = x) = D(x). • Maximum (conditional) likelihood estimation: sup D∈D n i=1 D(Xi ) × n i=1 (1 − D(Gθ(Zi ))) or sup D∈D ˆL(θ, D), with ˆL(θ, D) = 1 n n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))) . 17
  • 36. Adversarial principle Generator • supD∈D ˆL(θ, D) acts like a divergence between the distributions of Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn. 18
  • 37. Adversarial principle Generator • supD∈D ˆL(θ, D) acts like a divergence between the distributions of Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn. • Minimum divergence estimation: inf θ∈Θ sup D∈D ˆL(θ, D) . or inf θ∈Θ sup D∈D n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))). 18
  • 38. Adversarial principle Generator • supD∈D ˆL(θ, D) acts like a divergence between the distributions of Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn. • Minimum divergence estimation: inf θ∈Θ sup D∈D ˆL(θ, D) . or inf θ∈Θ sup D∈D n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))). • Adversarial, minimax or zero-sum game. 18
  • 39. The GAN Zoo Avinash Hindupur’s Github. 19
  • 40. The GAN Zoo Curbing the discriminator • least squares12 : inf D∈D n i=1 (D(Xi ) − 1)2 + n i=1 D(Gθ(Zi ))2 , inf θ∈Θ n i=1 (D(Gθ(Zi )) − 1)2 . • asymmetric hinge13 : inf D∈D − n i=1 D(Xi ) + n i=1 max (0, 1 − D(Gθ(Zi ))) , inf θ∈Θ − n i=1 D(Gθ(Zi )). 12 X. Mao et al. “Least Squares Generative Adversarial Networks”. In: IEEE International Conference on Computer Vision. 2017. 13 J. Zhao, M. Mathieu, and Y. LeCun. “Energy-based Generative Adversarial Network”. In: International Conference on Learning Representations. 2017. 20
  • 41. The GAN Zoo Metrics as minimax games • Maximum mean discrepancy14 and Wasserstein15 : inf θ∈Θ sup T∈T Tp dµ − Tpθdµ. • f-divergences16 : inf θ∈Θ sup T∈T Tp dµ − (f ◦ T)pθdµ. With T being a prescribed class of functions and f the convex conjugate of a lower-semicontinuous function f. 14 G.K. Dziugaite, D.M. Roy, and Z. Ghahramani. “Training generative neural networks via Maximum Mean Discrepancy optimization”. In: Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence. 2015; Y. Li, K. Swersky, and R. Zemel. “Generative Moment Matching Networks”. In: International Conference on Machine Learning. 2015. 15 M. Arjovsky, S. Chintala, and L. Bottou. “Wasserstein Generative Adversarial Networks”. In: International Conference on Machine Learning. 2017. 16 S. Nowozin, B. Cseke, and R. Tomioka. “f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization”. In: Neural Information Processing Systems. June 2016. 21
  • 42. Roadmap • Minimum divergence estimation: uniqueness of minimizers. • Approximation properties: importance of the family of discriminators on the quality of the approximation • Statistical analysis: consistency and rate of convergence. 22
  • 44. Kullback-Leibler and Jensen divergences Kullback-Leibler • For P Q probability measures on E: DKL(P Q) = ln dP dQ dP. • Properties: DKL(P Q) ≥ 0 DKL(P Q) = 0 ⇐⇒ P = Q. • If p = dP dµ and q = dQ dµ : DKL(P Q) = p ln p q dµ. • DKL is not symmetric and defined only for P Q. 23
  • 45. Kullback-Leibler and Jensen divergences Kullback-Leibler • For P Q probability measures on E: DKL(P Q) = ln dP dQ dP. • Properties: DKL(P Q) ≥ 0 DKL(P Q) = 0 ⇐⇒ P = Q. • If p = dP dµ and q = dQ dµ : DKL(P Q) = p ln p q dµ. • DKL is not symmetric and defined only for P Q. 23
  • 46. Kullback-Leibler and Jensen divergences Jensen-Shannon • For P and Q probability measures on E: DJS(P, Q) = 1 2 DKL P P + Q 2 + 1 2 DKL Q P + Q 2 . • Property: 0 ≤ DJS(P, Q) ≤ ln 2. • (P, Q) → DJS(P, Q) is a distance. 24
  • 47. Kullback-Leibler and Jensen divergences Jensen-Shannon • For P and Q probability measures on E: DJS(P, Q) = 1 2 DKL P P + Q 2 + 1 2 DKL Q P + Q 2 . • Property: 0 ≤ DJS(P, Q) ≤ ln 2. • (P, Q) → DJS(P, Q) is a distance. 24
  • 48. Kullback-Leibler and Jensen divergences Jensen-Shannon • For P and Q probability measures on E: DJS(P, Q) = 1 2 DKL P P + Q 2 + 1 2 DKL Q P + Q 2 . • Property: 0 ≤ DJS(P, Q) ≤ ln 2. • (P, Q) → DJS(P, Q) is a distance. 24
  • 49. GAN and Jensen-Shannon divergence GANs • Empirical criteria: ˆL(θ, D) = 1 n n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))) . • Problem: inf θ∈Θ sup D∈D ˆL(θ, D) . 25
  • 50. GAN and Jensen-Shannon divergence GANs • Empirical criteria: ˆL(θ, D) = 1 n n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))) . • Problem: inf θ∈Θ sup D∈D ˆL(θ, D) . Ideal GANs • Population version of the criteria: L(θ, D) = ln(D)p dµ + ln(1 − D)pθdµ. • No constraint: D = D∞, set of all functions from E to [0, 1]. • Problem: inf θ∈Θ sup D∈D∞ L(θ, D) . 25
  • 51. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. 26
  • 52. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. • Optimal discriminator: Dθ = p p + pθ , with convention 0/0 = 0. 26
  • 53. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. • Optimal discriminator: Dθ = p p + pθ , with convention 0/0 = 0. • Optimal criteria: sup D∈D∞ L(θ, D) = L(θ, Dθ ) = 2DJS(p , pθ) − ln 4. 26
  • 54. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. • Optimal discriminator: Dθ = p p + pθ , with convention 0/0 = 0. • Optimal criteria: sup D∈D∞ L(θ, D) = L(θ, Dθ ) = 2DJS(p , pθ) − ln 4. • Problem: inf θ∈Θ sup D∈D∞ L(θ, D) = inf θ∈Θ L(θ, Dθ ) = 2 inf θ∈Θ DJS(p , pθ) − ln 4. 26
  • 55. The quest for Dθ Numerical approach • Big n, big D: try to approximate Dθ with arg maxD∈D ˆL(θ, D). • Close to divergence minimization: supD∈D ˆL(θ, D) ≈ 2DJS(p , pθ) − ln 4. 17 Goodfellow et al., “Generative Adversarial Nets”. 27
  • 56. The quest for Dθ Numerical approach • Big n, big D: try to approximate Dθ with arg maxD∈D ˆL(θ, D). • Close to divergence minimization: supD∈D ˆL(θ, D) ≈ 2DJS(p , pθ) − ln 4. Theorem Let θ ∈ Θ and Aθ = {p = pθ = 0}. If µ(Aθ) = 0, then {Dθ } = arg maxD∈D∞ L(θ, D). If µ(Aθ) > 0, then Dθ is unique only on EAθ. Completes Proposition 1 in17 . 17 Goodfellow et al., “Generative Adversarial Nets”. 27
  • 57. Oracle parameter • Oracle parameter regarding the Jensen-Shannon divergence: θ ∈ arg minθ∈Θ L(θ, Dθ ) = arg minθ∈Θ DJS(p , pθ). • Gθ is the ideal generator. • If p ∈ P, p = pθ DJS(p , pθ ) = 0 Dθ = 1 2 . • What if p /∈ P? Existence and uniqueness of θ ? 28
  • 58. Oracle parameter • Oracle parameter regarding the Jensen-Shannon divergence: θ ∈ arg minθ∈Θ L(θ, Dθ ) = arg minθ∈Θ DJS(p , pθ). • Gθ is the ideal generator. • If p ∈ P, p = pθ DJS(p , pθ ) = 0 Dθ = 1 2 . • What if p /∈ P? Existence and uniqueness of θ ? Theorem Assume that P is a convex and compact set for the JS distance. If p > 0 µ-almost everywhere, then there exists ¯p ∈ P such that {¯p} = arg minp∈P DJS(p , p). In addition, if the model P is identifiable, then there exists θ ∈ Θ such mathematical {θ } = arg minθ∈Θ L(θ, Dθ ). 28
  • 59. Oracle parameter Existence and uniqueness • Compactness of P and continuity of DJS(p , ·). • p > 0 µ-a.e. enables strict convexity of DJS(p , ·). 29
  • 60. Oracle parameter Existence and uniqueness • Compactness of P and continuity of DJS(p , ·). • p > 0 µ-a.e. enables strict convexity of DJS(p , ·). Compactness of P with respect to the JS distance 1. Θ compact and P convex. 2. For all x ∈ E, θ ∈ Θ → pθ(x) is continuous. 3. sup(θ,θ )∈Θ2 |pθ ln pθ | ∈ L1 (µ). 29
  • 61. Oracle parameter Existence and uniqueness • Compactness of P and continuity of DJS(p , ·). • p > 0 µ-a.e. enables strict convexity of DJS(p , ·). Compactness of P with respect to the JS distance 1. Θ compact and P convex. 2. For all x ∈ E, θ ∈ Θ → pθ(x) is continuous. 3. sup(θ,θ )∈Θ2 |pθ ln pθ | ∈ L1 (µ). Identifiability High-dimensional parametric setting often misspecified =⇒ identifiability not satisfied. 29
  • 63. From JS divergence to likelihood GAN = JS divergence • GANs don’t minimize the Jensen-Shannon divergence. • Considering supD∈D∞ L(θ, D) means knowing Dθ = p p +pθ , thus knowing p . 30
  • 64. From JS divergence to likelihood GAN = JS divergence • GANs don’t minimize the Jensen-Shannon divergence. • Considering supD∈D∞ L(θ, D) means knowing Dθ = p p +pθ , thus knowing p . Parametrized discriminators • D = {Dα}α∈Λ, Λ ⊂ Rq : parametric family of discriminators. • Likelihood-type problem with two parametric families: inf θ∈Θ sup α∈Λ L(θ, Dα). • Likelihood parameter: ¯θ ∈ arg minθ∈Θ sup α∈Λ L(θ, Dα). • How close the best candidate p¯θ is to the ideal density pθ ? • How does it depend on the capability of D to approximate Dθ ? 30
  • 65. Approximation result (Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2 (µ) such that m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε. 31
  • 66. Approximation result (Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2 (µ) such that m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and p¯θ ≤ M. Then, under Assumption (Hε) with ε < 1/(2M), there exists a constant c1 > 0 (depending only upon m and M) such that DJS(p , p¯θ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 . 31
  • 67. Approximation result (Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2 (µ) such that m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and p¯θ ≤ M. Then, under Assumption (Hε) with ε < 1/(2M), there exists a constant c1 > 0 (depending only upon m and M) such that DJS(p , p¯θ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 . Remarks As soon as the class D becomes richer: • minimizing supα∈Λ L(θ, Dα) over Θ helps minimizing DJS(p , pθ). • since under some assumptions {pθ } = arg minpθ:θ∈Θ DJS(p , pθ), p¯θ comes closer to pθ . 31
  • 69. The estimation problem Estimator ˆθ ∈ arg minθ∈Θ sup α∈Λ ˆL(θ, α), where ˆL(θ, α) = 1 n n i=1 ln(Dα(Xi )) + n i=1 ln(1 − Dα(Gθ(Zi ))) . 32
  • 70. The estimation problem Estimator ˆθ ∈ arg minθ∈Θ sup α∈Λ ˆL(θ, α), where ˆL(θ, α) = 1 n n i=1 ln(Dα(Xi )) + n i=1 ln(1 − Dα(Gθ(Zi ))) . (Hreg) Regularity conditions of order 1 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ˆθ exists (and so for ¯θ). 32
  • 71. The estimation problem Estimator ˆθ ∈ arg minθ∈Θ sup α∈Λ ˆL(θ, α), where ˆL(θ, α) = 1 n n i=1 ln(Dα(Xi )) + n i=1 ln(1 − Dα(Gθ(Zi ))) . (Hreg) Regularity conditions of order 1 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ˆθ exists (and so for ¯θ). Questions • How far DJS(p , pˆθ) is from minθ∈Θ DJS(p , pθ) = DJS(p , pθ )? • Does ˆθ converge towards ¯θ as n → ∞? • What is the asymptotic distribution of ˆθ − ¯θ? 32
  • 72. Non-asymptotic bound on the JS divergence (Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε. 33
  • 73. Non-asymptotic bound on the JS divergence (Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and pθ ≤ M for all θ ∈ Θ. Then, under Assumptions (Hreg) and (Hε) with ε < 1/(2M), there exist two constants c1 > 0 (depending only upon m and M) and c2 such that E DJS(p , pˆθ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 + c2 1 √ n . 33
  • 74. Non-asymptotic bound on the JS divergence (Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and pθ ≤ M for all θ ∈ Θ. Then, under Assumptions (Hreg) and (Hε) with ε < 1/(2M), there exist two constants c1 > 0 (depending only upon m and M) and c2 such that E DJS(p , pˆθ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 + c2 1 √ n . Remarks • Under (Hreg), {ˆL(θ, α) − L(θ, α)}θ∈Θ,α∈Λ is a subgaussian process for · / √ n. • Dudley’s inequality: E supθ∈Θ,α∈Λ |ˆL(θ, α) − L(θ, α)| = O 1√ n . • c2 scales as p + q =⇒ loose bound in the usual over-parametrized regime (LSUN, FACES: √ n ≈ 1000 p + q ≈ 1500000). 33
  • 75. Illustration Setting • p (x) = e−x/s s(1+e−x/s)2 , x ∈ R : logistic density (x ∈ R). • Gθ and Dα are two fully connected neural networks. • Z ∼ U([0, 1]): scalar noise. • n = 100000 (1/ √ n is negligible) and 30 replications. 34
  • 76. Illustration Setting • p (x) = e−x/s s(1+e−x/s)2 , x ∈ R : logistic density (x ∈ R). • Gθ and Dα are two fully connected neural networks. • Z ∼ U([0, 1]): scalar noise. • n = 100000 (1/ √ n is negligible) and 30 replications. 34
  • 77. Illustration Setting • Generator depth: 3. • Discriminator depth: 2 then 5. 35
  • 78. Convergence of ˆθ (Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist. 36
  • 79. Convergence of ˆθ (Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist. (H1) The pair (¯θ, ¯α) is unique and belongs to int(Θ) × int(Λ). Theorem Under Assumptions (Hreg) and (H1), ˆθ a.s. → ¯θ and ˆα a.s. → ¯α. 36
  • 80. Convergence of ˆθ (Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist. (H1) The pair (¯θ, ¯α) is unique and belongs to int(Θ) × int(Λ). Theorem Under Assumptions (Hreg) and (H1), ˆθ a.s. → ¯θ and ˆα a.s. → ¯α. Remarks • Convergence of ˆθ comes from supθ∈Θ,α∈Λ |ˆL(θ, α) − L(θ, α)| a.s. → 0. • It does not need uniqueness of ¯α. • Convergence of ˆα comes from that of ˆθ. 36
  • 81. Illustration Setting • Three models: 1. Laplace: p (x) = 1 3 e− 2|x| 3 vs pθ(x) = 1√ 2πθ e − x2 2θ2 . 2. Claw: p (x) = pclaw(x) vs pθ(x) = 1√ 2πθ e − x2 2θ2 . 3. Exponential: p (x) = e−x 1R+ vs pθ(x) = 1 θ 1[0,θ](x). • Gθ: generalized inverse of the cdf of pθ. • Z ∼ U([0, 1]): scalar noise. • Dα = pα1 pα1 +pα0 . • n = 10 to 10000 and 200 replications. 37
  • 82. Illustration Claw vs Gaussian Exponential vs Uniform 38
  • 83. Central limit theorem (Hloc) Local smoothness conditions around (¯θ, ¯α) (such that Hessians are invertible). Theorem Under Assumptions (Hreg), (H1) and (Hloc), √ n(ˆθ − ¯θ) d → N(0, Σ). 39
  • 84. Central limit theorem (Hloc) Local smoothness conditions around (¯θ, ¯α) (such that Hessians are invertible). Theorem Under Assumptions (Hreg), (H1) and (Hloc), √ n(ˆθ − ¯θ) d → N(0, Σ). Remark One has Σ 2 = O(p3 q4 ), which suggests that ˆθ has a large dispersion around ¯θ in the over-parametrized regime. 39
  • 85. Illustration Histograms of √ n(ˆθ − ¯θ): Claw vs Gaussian Exponential vs Uniform 40
  • 87. Take-home message A first step for understanding GANs • From data to sampling. • The richness of the class of discriminators D controls the gap between GANs and the JS divergence. • The generator parameters θ are asymptotically normal with rate √ n. 41
  • 88. Take-home message A first step for understanding GANs • From data to sampling. • The richness of the class of discriminators D controls the gap between GANs and the JS divergence. • The generator parameters θ are asymptotically normal with rate √ n. Future investigations 1. Impact of the latent variable Z (dimension, distribution) and the networks (number of layers in Gθ, dimensionality of Θ) on the performance of GANs (currently it is assumed p µ, pθ µ =⇒ information on the supporting manifold of p ). 2. How much assumptions (Hε) and (Hε) are satisfied for neural nets as discriminators? 3. Over-parametrized regime: convergence of distributions instead of parameters. 41