Disentangled representation is the holy grail for representation learning which factorizes human-understandable factors in unsupervised way what help us move forward to interpretable machine learning.
2. Overview
1 Background Knowledge
Information Quantities
Rate Distortion Theory and Information Bottleneck
Variational Inference
2 Build up Frameworks for Disentangle Representations
3 Isolating Sources of Disentanglement
ELBO Surgery
Evaluate Disentanglement
Experiments
4 Conclusion
kv (Viscovery) ELBO February 17, 2019 2 / 53
3. Before we start ...
What is disentanglement?
Disentangled Representation = Fatorized + Interpretable
Reuse and generalize knowledge
Extrapolate beyond training data distribution
Questions will be answered in series of discussion
(Part I) Why VAE is the main framework to realize disentanglement?
[Chen, 2018]
(Part II) Why there is a trade-off between reconstruction and
disentanglement? [Alemi, 2018]
(Part II) Is disentanglement a task or principal? [Achille, 2018]
kv (Viscovery) ELBO February 17, 2019 3 / 53
5. Quick Review
Consider beta decay process, we observe N electrons
↑↑↓↓↑ ... ↑
Number of the possible states of N spins
N!
(pN)!((1 − p)N)!
∼
NN
(pN)pN((1 − p)N)(1−p)N
=
1
ppN(1 − p)(1−p)N
= 2NS
where S is called Shannon Entropy per spin
S = −p log p − (1 − p) log(1 − p)
Number of bits of information one gains in actually observe such state is
NS. [?]
kv (Viscovery) ELBO February 17, 2019 5 / 53
6. Quick Review
In general case:
N!
(p1N)p1N(p2N)p2N...(pkN)pk N
∼
NN
k
i=1(pi N)pi N
= 2NS
Such that
S = −
i
pi log pi
kv (Viscovery) ELBO February 17, 2019 6 / 53
7. Quick Review
We have a theory that predicts a probability distribution Q for the final
state, however, the correct probability distribution is P, then after
observing N decays, we will see outcome i approximately pi N times.
P =
N
i
qpi N
i
N!
j (pj N)!
We already calculated N!
j (pj N)! ∼ 2−N i pi log pi ,
so
P ∼ 2−N i pi (log pi −log qi )
The quantity, we called, relative entropy or Kullback-Liebler divergence
DKL(p||q) =
i
pi (log pi − log qi )
kv (Viscovery) ELBO February 17, 2019 7 / 53
8. Quick Review
Quantities:
Entropy: Sx = − x p(x) log p(x), information we don’t know.
Relative Entropy, KL-Divergence:
DKL(p(x)||q(x)) = x p(x) log p(x) − log q(x)
Mutual Information:
I(x; y) = Sx − Sx|y = Sy − Sy|x = Sx,y − Sx|y − Sy|x
I(x; y) = DKL p(x, y)||p(x)p(y)
Symmetric between x and y
Extreme Cases: Independence, Deterministic Relation
Relation:
Chain rule: p(x, y) = p(x|y)p(y)
Bayesian: p(y|x) = p(x|y)p(y)
p(x)
kv (Viscovery) ELBO February 17, 2019 8 / 53
11. Rate-Distortion Theory
What makes good encoding? Low rate, low distortion.
min
p(˜x)|p(x)
I(X; ˜X) s.t d(X, ˜X) < D
Theorem (Rate Distortion, Shannon and Kolmogorov)
Define the function as the minimum achievable rate under distortion
constraint D.
R(D) = min
p(˜x|x) s.t d(x,˜x)<D
I(X; ˜X)
Then an encoding that achieves this rate is
p(˜x|x) =
p(˜x)
Z(x, β)
e−βd(x,˜x)
kv (Viscovery) ELBO February 17, 2019 11 / 53
13. Information System
Sending a signal from Alice to Bob: X → ˜X, where Y the relevant
information about X.
kv (Viscovery) ELBO February 17, 2019 13 / 53
14. Information Bottleneck Theory
What makes good encoding? Low rate, high relevance.
min
p(˜x)|p(x)
I( ˜X; X) s.t I( ˜X, Y ) > L
Theorem (Information Bottleneck, Tishby, Pereira, and Bialek)
Define the function as the minimum achievable rate while preserving L bits
of mutual information.
R(L) = min
p(˜x|x) s.t I(˜x;y)≥L
I(X; ˜X)
Then an encoding that achieves this rate is
p(˜x|x) =
p(˜x)
Z(x, β)
e−βDKL[p(y|x)||p(y|˜x)]
kv (Viscovery) ELBO February 17, 2019 14 / 53
15. Comparison
What makes good encoding?: Low
Rate, Low Distortion
p(˜x|x) =
p(˜x)
Z(x, β)
e−βd(x,˜x)
What makes good code?: Low
Rate, High Relevance
p(˜x|x) =
p(˜x)
Z(x, β)
e−βDKL[p(y|x)||p(y|˜x)]
kv (Viscovery) ELBO February 17, 2019 15 / 53
16. Structure of the Solution
On the structure of solution
L[p(˜x|x)] = I(X; ˜X) − βI( ˜X; Y )
The Lagrangian multiplier operates as the trade-off parameter between
complexity of representation and preserved relevant information.
I( ˜X; Y ) is the measure of performance
I(X; ˜X) as the regularization term
kv (Viscovery) ELBO February 17, 2019 16 / 53
18. Inference under Posterior
X: observation, input data
Z: latent variable, representation, embedding
p(z|x)
posterior
=
likelihood of z
p(x|z)
prior
p(z)
p(x)
evidence
Due to p(x) is intractable, there are two parallel ways to solve
MCMC
Variational Inference
kv (Viscovery) ELBO February 17, 2019 18 / 53
19. Variational Inference
Propose a simpler, tractable distribution q(z) to approximate posterior
p(z|x)
DKL(q(z|x)||p(z|x)) = Eq(z|x) log(
q(z|x)p(x)
p(x, z)
)
= Eq(z|x) log
q(z|x)
p(x, z)
+ log p(x)
Swap the left and right hand-sides, we get
log p(x) = DKL(q(z|x)||p(z|x)) + Eq(z|x) log
p(x, z)
q(z|x)
kv (Viscovery) ELBO February 17, 2019 19 / 53
20. Reduce KL Divergence to ELBO
Due to the positivity of divergence
log p(x) = DKL(q(z|x)||p(z|x)) + Eq(z|x)[log p(x|z)] − DKL[q(z|x)||p(z)]
Evidence Lower Bound, ELBO (per sample)
log p(x) ≤ Eq(z|x)
p(xn, z)
q(z|xn)
ELBO
LELBO = Eq(z|x) log p(xn|z)
reconstruction
− DKL q(z|xn)||p(z)
regularization
kv (Viscovery) ELBO February 17, 2019 20 / 53
21. Implement ELBO using VAE
Figure: ELBO Structure in VAE
kv (Viscovery) ELBO February 17, 2019 21 / 53
22. Implement ELBO using VAE
Figure: Parameterization Trick for Back-propagation
kv (Viscovery) ELBO February 17, 2019 22 / 53
23. Build up Framework for Disentanglement
kv (Viscovery) ELBO February 17, 2019 23 / 53
24. Build up β-VAE framework
Suppose the data generation process are affected by two type of factors
p(x|z) ≈ p(x|v, w)
where v is conditionally independent; w is conditionally dependent factor.
Maximization data likelihood of observed data over the whole latent
distribution. Also, the aim of disentanglement is to ensure the inferred
latent capture the generative factors v in an independent manner.
max
θ
Epθ(z)[pθ(x|z)] s.t DKL(q(z|x)||p(z)) <
kv (Viscovery) ELBO February 17, 2019 24 / 53
25. β-VAE
The objective function of β-VAE[Higgins, 2017] goes to
L = Eq(z|x)[log p(x|z)] − β DKL[q(z|x)||p(z)]
Understanding effect of β
Reconstruction quality is the poor indicator of learnt disentanglement
Good disentanglement often lead to blurry reconstruction
Disentangled representation lacks capability of latent channel
kv (Viscovery) ELBO February 17, 2019 25 / 53
28. ELBO Surgery
Conjecture two criteria may be important
MI between data variable and latent variable
Independence of latent variable
kv (Viscovery) ELBO February 17, 2019 28 / 53
29. ELBO Surgery
To further understand ELBO, we use the average encoding distribution as
the expression. [Hoffman, 2017] Identify each training example with
unique index {1, 2, 3, ...N}
Define q(z|n) = q(z|xn), q(z, n) = p(n)q(z|n) = 1
N q(z|n) where
p(n) = 1
N . The marginal distribution q(z) = Ep(n)[q(z|n)] and
q(z) = n q(z|n)p(n).
kv (Viscovery) ELBO February 17, 2019 29 / 53
33. ELBO TC-Decomposition
Modified ELBO =
Reconstruction
Eq(z,n)[log p(n|z)] +
− α Iq(z; n)
Index-Code MI
−β DKL[q(z)||Πj q(zj )]
Total Correlation
−γ
j
DKL[q(zj )||p(zj )]
Dim-wise KL
Index-Code MI: DKL[q(z, n)||q(z)p(n))] = Iq(z; n)
Drop the penalty to improve disentanglement
Keep this term to improve disentanglement according to IB
Dataset dependent
Total Correlation: DKL[q(z)||Πj q(zj )]
Heavier penalty on this term induces disentanglement
TC forces model to find statistically independent factors
Dim-wise KL: j DKL[q(zj )||p(zj )]
Prevent latent space deviating from corresponding prior
kv (Viscovery) ELBO February 17, 2019 33 / 53
34. Minibatch Sampling: Stochastic Estimation of log q(z)
The evaluation of density q(z) requires sampling the whole dataset.
Random chosen n will lead to q(z|n) close to zero. Inspired by importance
sampling, for given batch of samples {n1, n2, ...nM}, we can use the
estimator re-utilize the batch
Eq(z|x)[log q(z)] = Eq(z|x) log En ∼p(n )[q(z|n )]
≈
1
M
M
i=1
log
1
MN
M
j=1
q(z(ni )|nj )
where z(ni ) is a sample from q(z|ni ). Treat q(z) as mixture of
distribution, where the data index n indicates the mixture components.
kv (Viscovery) ELBO February 17, 2019 34 / 53
35. Special case: β-TCVAE
With MBS, it is available to assign different weights (α, β, γ) to terms
Lβ−TC = Eq(z|n)p(n)[log p(n|z)]
− αIq(z; n) − β DKL(q(z)||
j
q(zj )) − γ
j
DKL(q(zj )||p(zj ))
Proposed β-TCVAE uses α = γ = 1, and set β as hyper-parameter.
kv (Viscovery) ELBO February 17, 2019 35 / 53
38. Evaluate Disentanglement: Mutual Information Gap
Suppose we have some groundtruth factor {vk}K
k=1 Define joint
distribution q(zj , vk) = N
n=1 p(vk)p(n|vk)q(zj |n)
In(zj ; vk) = Eq(zj ,vk ) log q(zj |n)p(n|vk) + S(zj )
kv (Viscovery) ELBO February 17, 2019 38 / 53
39. Feature Factor and Latent Variable
Figure: Correlation between factors and latent space
kv (Viscovery) ELBO February 17, 2019 39 / 53
40. Evaluate Disentanglement: Mutual Information Gap
Mutual Information Gap
MIG =
1
K
K
k=1
1
S(vk)
I(zj(k) ; vk) − maxj=j(k) I(zj ; vk)
where j(k)
= arg max
j
I(zj ; vk)
where 0 ≤ I(zj ; vk) = S(vk) − S(vk|zj ) ≤ S(vk) naturally serve as the
normalization condition. Benefits of this metric
Axis-alignment
Compactness of representation
kv (Viscovery) ELBO February 17, 2019 40 / 53
52. Conclusion
In this paper
Regularization term in ELBO contains various factors which naturally
encourage disentangling
Total correlation (independence of latent variable) is the major factor
force machine to learn statistically independent reprensentation
New information-theoretic metric quantity
kv (Viscovery) ELBO February 17, 2019 52 / 53
53. References
Chen, T.Q., and etc,
Isolating Sources of Disentanglement in Variational Autoencoders. NIPS. 2018
Tishby, Naftali, and etc.
The information bottleneck method. physics/0004057 (2000).
Hoffman, Matthew D., and Matthew J. Johnson.
Elbo surgery: yet another way to carve up the variational evidence lower bound.
NIPS. 2016.
Higgins and etc.
beta-vae: Learning basic visual concepts with a constrained variational framework.
ICLR 2017
Burgess and ect.
Understanding disentangling in beta-VAE. arXiv preprint arXiv:1804.03599.
Achille, Alessandro, and etc.
Emergence of invariance and disentanglement in deep representations.
Alemi, Alexander, et al.
Fixing a broken ELBO. International Conference on Machine Learning. 2018.
kv (Viscovery) ELBO February 17, 2019 53 / 53