InfoGAN is a method for learning disentangled and interpretable representations using generative adversarial networks (GANs). It introduces a mutual information objective that maximizes the mutual information between a small subset of latent codes and the generated images. This allows it to learn representations where the latent codes correspond to interpretable factors of variation in the image domain. It presents results on MNIST, faces, chairs, SVHN and CelebA datasets where the latent codes discover meaningful and interpretable factors such as digit identity, azimuth, lighting conditions and more without any supervision.
2. Introduction
In ordinary GANs, the latent vector z has an arbitrary
representation. The vector has no intrinsic meaning.
It would be desirable to make a more meaningful
representation of the outputs.
3. Disentangled Representation
We wish to disentangle the representation of the output
images in the input latent vectors.
That is, we wish to make the values of the latent vector c
correspond to features in the generated images.
5. Entropy
Entropy can be intuitively
understood the amount of
information that a variable contains.
It is borrowed from statistical
thermodynamics and is directly
analogous to the physical definition
of randomness of particle states.
𝐻 𝑋 = −
𝑖=1
𝑛
𝑃 𝑥𝑖 log 𝑏 𝑃 𝑥𝑖
𝐻 𝑋 𝑌 = −
𝑖,𝑗
𝑝 𝑥𝑖, 𝑦𝑗 𝑙𝑜𝑔
𝑝(𝑥𝑖, 𝑦𝑖)
𝑝 𝑦𝑖
6. Mutual Information: Definition
In information theory, I(X;Y),
the mutual information
between X and Y, measures
the “amount of information”
learned from knowledge of
random variable Y about the
random variable X.
The mutual information can be
expressed as the difference of
two entropy terms:
I(X;Y) = H(X)-H(X|Y) = H(Y)-H(Y|X)
7. Mutual Information: Interpretation
If X and Y are independent,
then I(X;Y) = 0, because
knowing one variable reveals
nothing about the other.
If X and Y are related by a
deterministic, invertible
function, I(X;Y) is at its
maximum.
I(X;Y) is the reduction of
uncertainty in X when Y is
observed and vice versa.
8. Mutual Information: Implications
Formulation of a cost function using mutual information.
Information regularized mini-max game.
min
𝐺
max
𝐷
𝑉𝐼 𝐷, 𝐺 = 𝑉 𝐷, 𝐺 − 𝜆𝐼(𝑐; 𝐺 𝑧, 𝑐 )
z: vector for representing intrinsic noise
c: Latent code representing meaningful information.
10. Variational Mutual Information Maximization
In practice, 𝐼(𝑐; 𝐺 𝑧, 𝑐 ) cannot be maximized directly as this
requires 𝑃(𝑐|𝑥), the posterior distribution.
We can obtain a lower bound for 𝐼(𝑐; 𝐺 𝑧, 𝑐 ) by using an
auxiliary distribution, 𝑄 𝑐 𝑥 , to approximate 𝑃(𝑐|𝑥).
This technique is known as Variational Mutual Information
Maximization. The equations are in the next slide.
11. Variational Mutual Information Maximization
The Mutual Information is decomposed into its components.
Q is introduced to use the definition of KL divergence.
KL Divergence is always greater than 0 by definition.
12. Variational Mutual Information Maximization
Although 𝐻(𝑐) can also be optimized, it is set as a constant for
simplicity. This is done by drawing c from a fixed distribution.
The equality in this part is proven in the appendix of the paper. It
holds under most conditions.
13. Variational Mutual Information Maximization
By the proofs discussed previously, the actual information
regularization is as follows.
min
𝐺,𝑄
max
𝐷
𝑉𝐼𝑛𝑓𝑜𝐺𝐴𝑁 𝐷, 𝐺, 𝑄 = 𝑉 𝐷, 𝐺 − 𝜆𝐿𝐼(𝐺, 𝑄)
𝐿𝐼: Information Lower Bound
𝜆: Weighting Hyperparameter
14. Practical Implementation
In practice, we use a neural
network to represent Q.
KL Divergence is minimized
when Q converges to P.
Q is just D with an extra FC
layer. It outputs an estimate
of the c latent vector(s).
𝐿𝐼(𝐺, 𝑄) has been observed
to converge faster than GAN
objectives.
InfoGAN thus comes
essentially for free with GAN.
16. Additional explanation
Cross Entropy Loss is used on the estimate of c and the actual c
(Some implementations use MSE).
For discrete cases, the outputs are softmax activations. For
continuous cases, they may be sigmoid, tanh, or linear.
The original is much more complicated and does not directly
output the estimate of the latent vector but estimates of its
distribution (mean and standard deviation).
17. Analysis
The information lower bound
𝐿𝐼 does not increase without
information regularization.
The maximum appears to be
2.3 for this case.
18. Results: MNIST
The discrete latent vector input for digits is
highly correlated with the output number.
It can be used to classify MNIST with a 5%
error rate, even when InfoGAN is trained
without any labels.
The continuous latent vector has found
the angle of the characters.
We confirm that the latent code validity by
extending the range beyond the original.
The ordinary GAN has learned nothing.
There is no control over which vector
learns what.
19. Results: Faces
Multiple vectors with a range between -1
and 1 are used.
The 4 latent vectors appear to have
captured azimuth, elevation, lighting, and
width.
There is smooth interpolation within and
even beyond the range.
Moreover, the other details change to
make a much more natural image.
It is not a simple case where only the
target feature changes while the other
factors remain unnaturally constant.
23. Conclusion
InfoGAN can learn and interpret salient features of data
without any labels or supervision.
InfoGAN can discover salient latent factors of variation
automatically and produce a latent vector containing that
information.