Paper Summary of Disentangling by Factorising (Factor-VAE)
1. Paper Summary :
Disentangling by Factorising
Jun-sik Choi
Department of Brain and Cognitive Engineering,
Korea University
November 26, 2019
2. Overview of paper [2]
To enhancing the disentangled representation, Factor-VAE is
proposed.
Factor-VAE enhances disentanglement by encouraging the
distribution of representations to be factorial (independent
accross the dimensions).
Factor-VAE provides a better trade-off between
disentanglement and reconstruction quality than β-VAE [1].
Also, a new disentnaglement metirc is proposed.
3. Unsupervised Disentangled Representation
Disentangled Representation
a representation where a change in one dimension corresponds
to a change in one factor of variation, while being relatively
invariant to changes in other factors. [3]
Why disentangled representation matters?[4]
Data can be represented in more interpretable and semantic
manner.
Learned disentangled representations are more transferrable.
Why disentangled representation in unsupervised manner
1. Humans are able to learn factors of variation unsupervised.
2. Labels are costly as obtaining them requires a human in the
loop.
3. Labels assigned by humans might be inconsistent or leave out
the factors that are difficult for humans to identify.
4. Factor-VAE
Goal
Obtain a better trade-off between disentnaglement and
reconstruction, which is one drawback of β-VAE [1].
How?
Factor-VAE augments the VAE objective with a penalty that
encourages the marginal distribution of representations to be
factorial without substantially affecting the quality of
reconstructions.
The penalty is expressed as a KL divergence between the
marginal distribution and the product of its marginals,
optimized by a discriminator network following the divergence
minimisation view of GANs.
5. Trade-off between Disentanglement and Reconstruction in
beta-VAE I
Notations and assumptions
- Observations: x(i)
∈ X, i = 1, . . . , N
- Underlying generative factors: f = (f1, . . . , fK )
- Latent code that models f : z ∈ Rd
- p(z) = N(0, I), decoder: pθ(x|z), encoder: qθ(z|x)
Disentanglement of Representation
- Variational posterior for an observation:
qθ(z|x) =
d
j=1
N zj |µj (x), σ2
j (x)
can be seen as the distribution of representation corresponding
to the data point x.
6. Trade-off between Disentanglement and Reconstruction in
beta-VAE II
- Marginal posterior and disentanglement
q(z) = Epdata (x)[q(z|x)] =
1
N
N
i=1
q z|x(i)
A disentangled represent would have each zj correspond to
precisely one underlying factor fk , so we want q(z) be
independently factorized:
q(z) =
d
j=1
q (zj )
7. Trade-off between Disentanglement and Reconstruction in
beta-VAE III
Further Decomposition of β-VAE objective
- The β-VAE objective:
1
N
N
i=1
Eq(z|x(i)
) log p x(i)
|z − βKL q z|x(i)
p(z)
is a lower bound of Epdata (x) log p x(i)
Where,
Eq(z|x(i)
) log p x(i)
|z : negative reconstruction error
KL q z|x(i)
p(z) : complexity penalty.
8. Trade-off between Disentanglement and Reconstruction in
beta-VAE IV
- The KL term can be further decomposed as:
Epdata(x)[KL(q(z|x) p(z))] = I(x; z) + KL(q(z) p(z))
proof
Epdata(x)[KL(q(z|x) p(z))]
= Epdata(x)Eq(z|x) log q(z|x)
p(z)
= Epdata(x)Eq(z|x) log q(z|x)
q(z)
q(z)
p(z)
= Epdata(x)Eq(z|x) log q(z|x)
q(z) + log q(z)
p(z)
= Epdata(x)[KL(q(z|x) q(z))] + Eq(x,z) log q(z)
p(z)
= Iq(x; z) + Eq(z) log q(z)
p(z)
= Iq(x; z) + KL(q(z) p(z))
9. Trade-off between Disentanglement and Reconstruction in
beta-VAE V
Epdata(x)[KL(q(z|x) p(z))] = I(x; z) + KL(q(z) p(z))
- When increasing penalty for complexity by setting β > 1,
KL(q(z) p(z)) and I(x; z) are both penalized.
- Penalizing KL(q(z) p(z)) makes q(z) to be factorized as prior
p(z).
- Penalizing I(x; z) reduces amount of information about x
stored in z, which lead to poor reconstruction.
10. Total Correlation Penalty I
Factor-VAE objective
1
N
N
i=1
Eq(z|x(i)
) log p x(i)
|z −KL q z|x(i)
p(z)
− γKL(q(z) ¯q(z))
where, ¯q(z) := d
j=1 q (zj ) is a lower bound on the marginal
log likelihood Epdata(x)[log p(x)] and directly encourages
independence in the code distribution.
Total correlation [5] KL(q(z) ¯q(z))
A popular measure of dependence for multiple random
variables.
As both q(z)and ¯q(z) are intractable, an alternative approach
for optimizing total correlation is required.
Total Correlation
11. Total Correlation Penalty II
Alternative way to optimize total correlation
1. Sample q z|x(i)
with uniformly sampled x(i)
.
2. Generate d samples from q(z) and ignoring all but one
dimension for each sample.
Or,
1. Sample a batch from q(z)
2. Randomly permuting across the batch for eatch latent
dimension.
As long as the batch is large enough, the distribution of these
samples will closely approximate ¯q(z).
12. Total Correlation Penalty III
Minimization of KL divergence
By training a classifier (Discriminator), approximate the density
ratio that arises in the KL term (Density-ratio trick [6]).
TC(z) = KL(q(z) ¯q(z)) = Eq(z) log
q(z)
¯q(z)
≈ Eq(z) log
D(z)
1 − D(z)
The discriminator and VAE trained jointly.
The discriminator is trained to classify between samples from
q(z) and ¯q(z).
15. Metric for Disentanglement I
Disentanglement metric proposed in [1]
Weaknesses
1. The metric is sensitive to hyperparameters of the linear
classifier optimization.
2. Learned representations can be a linear combination of several
dimensions, so using linear classifier is inppropiate.
3. The metric has a failure mode. When only K − 1 factors out
of K factors are disentangled, the classifier still gives 100%
accuracy.
16. Metric for Disentanglement II
Proposed metric for disentanglement
1. Choose a factor k and generate data with this factor fixed, but
all other factors varying randomly.
2. Obtain their representations.
3. Normalize each dimension by its empirical standard deviation s
over the full data (or a large enough random subset).
4. Take the empirical variance Var z
(l)
d /sd in each dimension of
normalized representations.
5. The target index k and index of dimension with the lowest
variance are fed to the majority-vote classifier.
If the representation is perfectly disentangled, the variance of
dimension corresponding to the fixed factor will be 0.
17. Metric for Disentanglement III
As representations are normalized, the argmin Varl z
(l)
d /sd is
invariant to rescaling of the representations in each dimension.
Majority-vote classification1
1. For each L samples, one vote (ai , bi ),
ai ∈ {1, . . . , D} , bi ∈ {1, . . . , K} is achieved.
2. Given M votes (ai , bi )M
i=1, Voting matrix
Vjk =
M
i=1 I (ai = j, bi = k) is achieved.
3. Then, the majority vote classifier is defined to be
C(j) = arg maxk Vjk .
4. In other words, C(j) is the index of generative factor k which
produces largest number of lowest variance for latent
dimension j.
5. The metric is the accuracy of the classifier
ΣD
j=1VjC(j)
Σj Σk Vjk
.
Note that for majority-vote classifier, there are no optimisation
hyperparameters to tune, and the resulting classifier is a
deterministic function of the training data.
18. Metric for Disentanglement IV
Comparison between metrics ([1, 2])
1. New disentanglement metric of [2] is much less sensitive to
hyperparameters than old metric of [1].
2. Old metric is very sensitive to number of iterations, and metric
is constantly improves with more iterations.
1
Please refer the code [Link] for more details.
19. Experiments I
Datasets
Dataset with known generative factors
1. 2D Shapes dataset[7] with n : 737,280, dim : 64 × 64
fk : shape(3), scale(6), orientation(40), x-position(32),
y-position(32)
2. 3D Shales dataset[8] with n : 480,000, dim : 64 × 64 × 3
fk : shape(4), scale(8), orientation(15), floor color(10), wall
color(10), object color(10)
Dataset with unknown generative factors
1. 3D Faces dataset[9] with n : 239,840, dim : 64 × 64 × 3
2. 3D Chairs dataset[10] with n : 86,366, dim : 64 × 64 × 3
3. CelebA dataset (Cropped)[11] with n : 202,599,
dim : 64 × 64 × 3
27. Conclusion
This work introduces FactorVAE, a novel method for
disentangled representation.
A new disentanglement metric is prorposed.
Limitations
Low total correlation is necessary but not sufficient for
disentangling of independent factors of variation. (When all
but one of the latent dimension were to collapse to prior,
TC=0 but not disentangled.)
The proposed metric requires to generate samples holding one
factor fixed, which is not always possible. (When training set
does not cover all possible factors)
The metric is also unsuitable for data with non-independent
factors of variation.
28. References
I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot,
M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae:
Learning basic visual concepts with a constrained variational
framework.,” ICLR, vol. 2, no. 5, p. 6, 2017.
H. Kim and A. Mnih, “Disentangling by factorising,” arXiv
preprint arXiv:1802.05983, 2018.
Y. Bengio, A. Courville, and P. Vincent, “Representation
learning: A review and new perspectives,” IEEE transactions on
pattern analysis and machine intelligence, vol. 35, no. 8,
pp. 1798–1828, 2013.
B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J.
Gershman, “Building machines that learn and think like
people,” Behavioral and Brain Sciences, vol. 40, no. 2017,
2017.
S. Watanabe, “Information theoretical analysis of multivariate
correlation,” IBM Journal of research and development, vol. 4,
29. Total Correlation
Definition
For a given n random variables {X1, X2, . . . , Xn},
Total correlation is defined as the KL divergence from the joint
distribution p(X1, . . . , Xn) to the independent distribution of
p(X1)p(X2) · · · p(Xn).
TC (X1, X2, . . . , Xn) ≡ DKL [p (X1, . . . , Xn) p (X1) p (X2) · · · p (Xn)]
TC (X1, X2, . . . , Xn) =
n
i=1
H (Xi ) − H (X1, X2, . . . , Xn)
= The amount of information shared
among the variables in the set.
A near-zero TC indicates that the variables in the group are
essentially statistically independent.
Back