Toward Disentanglement through Understand ELBO

Toward Disentanglement
through Understanding ELBO (Part I)
kv
Viscovery: Algorithm Team
kelispinor@gmail.com
February 17, 2019
kv (Viscovery) ELBO February 17, 2019 1 / 53

Overview
1 Background Knowledge
Information Quantities
Rate Distortion Theory and Information Bottleneck
Variational Inference
2 Build up Frameworks for Disentangle Representations
3 Isolating Sources of Disentanglement
ELBO Surgery
Evaluate Disentanglement
Experiments
4 Conclusion

Before we start ...
What is disentanglement?
Disentangled Representation = Fatorized + Interpretable
Reuse and generalize knowledge
Extrapolate beyond training data distribution
Questions will be answered in series of discussion
(Part I) Why VAE is the main framework to realize disentanglement?
[Chen, 2018]
(Part II) Why there is a trade-oﬀ between reconstruction and
disentanglement? [Alemi, 2018]
(Part II) Is disentanglement a task or principal? [Achille, 2018]

Background Knowledge

Quick Review
Consider beta decay process, we observe N electrons
↑↑↓↓↑ ... ↑
Number of the possible states of N spins
N!
(pN)!((1 − p)N)!
∼
NN
(pN)pN((1 − p)N)(1−p)N
=
1
ppN(1 − p)(1−p)N
= 2NS
where S is called Shannon Entropy per spin
S = −p log p − (1 − p) log(1 − p)
Number of bits of information one gains in actually observe such state is
NS. [?]

Quick Review
In general case:
N!
(p1N)p1N(p2N)p2N...(pkN)pk N
∼
NN
k
i=1(pi N)pi N
= 2NS
Such that
S = −
i
pi log pi

Quick Review
We have a theory that predicts a probability distribution Q for the ﬁnal
state, however, the correct probability distribution is P, then after
observing N decays, we will see outcome i approximately pi N times.
P =
N
i
qpi N
i
N!
j (pj N)!
We already calculated N!
j (pj N)! ∼ 2−N i pi log pi ,
so
P ∼ 2−N i pi (log pi −log qi )
The quantity, we called, relative entropy or Kullback-Liebler divergence
DKL(p||q) =
i
pi (log pi − log qi )

Quick Review
Quantities:
Entropy: Sx = − x p(x) log p(x), information we don’t know.
Relative Entropy, KL-Divergence:
DKL(p(x)||q(x)) = x p(x) log p(x) − log q(x)
Mutual Information:
I(x; y) = Sx − Sx|y = Sy − Sy|x = Sx,y − Sx|y − Sy|x
I(x; y) = DKL p(x, y)||p(x)p(y)
Symmetric between x and y
Extreme Cases: Independence, Deterministic Relation
Relation:
Chain rule: p(x, y) = p(x|y)p(y)
Bayesian: p(y|x) = p(x|y)p(y)
p(x)

Information Theory

Information System
Sending a signal from Alice to Bob: X → ˜X

Rate-Distortion Theory
What makes good encoding? Low rate, low distortion.
min
p(˜x)|p(x)
I(X; ˜X) s.t d(X, ˜X) < D
Theorem (Rate Distortion, Shannon and Kolmogorov)
Deﬁne the function as the minimum achievable rate under distortion
constraint D.
R(D) = min
p(˜x|x) s.t d(x,˜x)<D
I(X; ˜X)
Then an encoding that achieves this rate is
p(˜x|x) =
p(˜x)
Z(x, β)
e−βd(x,˜x)

Rate-Distortion: RD-Curve
Figure: Trade-oﬀ between transmission rate and distortion

Information System
Sending a signal from Alice to Bob: X → ˜X, where Y the relevant
information about X.

Information Bottleneck Theory
What makes good encoding? Low rate, high relevance.
min
p(˜x)|p(x)
I( ˜X; X) s.t I( ˜X, Y ) > L
Theorem (Information Bottleneck, Tishby, Pereira, and Bialek)
Deﬁne the function as the minimum achievable rate while preserving L bits
of mutual information.
R(L) = min
p(˜x|x) s.t I(˜x;y)≥L
I(X; ˜X)
Then an encoding that achieves this rate is
p(˜x|x) =
p(˜x)
Z(x, β)
e−βDKL[p(y|x)||p(y|˜x)]

Comparison
What makes good encoding?: Low
Rate, Low Distortion
p(˜x|x) =
p(˜x)
Z(x, β)
e−βd(x,˜x)
What makes good code?: Low
Rate, High Relevance
p(˜x|x) =
p(˜x)
Z(x, β)
e−βDKL[p(y|x)||p(y|˜x)]

Structure of the Solution
On the structure of solution
L[p(˜x|x)] = I(X; ˜X) − βI( ˜X; Y )
The Lagrangian multiplier operates as the trade-oﬀ parameter between
complexity of representation and preserved relevant information.
I( ˜X; Y ) is the measure of performance
I(X; ˜X) as the regularization term

Inference

Inference under Posterior
X: observation, input data
Z: latent variable, representation, embedding
p(z|x)
posterior
=
likelihood of z
p(x|z)
prior
p(z)
p(x)
evidence
Due to p(x) is intractable, there are two parallel ways to solve
MCMC

Propose a simpler, tractable distribution q(z) to approximate posterior
p(z|x)
DKL(q(z|x)||p(z|x)) = Eq(z|x) log(
q(z|x)p(x)
p(x, z)
)
= Eq(z|x) log
q(z|x)
p(x, z)
+ log p(x)
Swap the left and right hand-sides, we get
log p(x) = DKL(q(z|x)||p(z|x)) + Eq(z|x) log
p(x, z)
q(z|x)

Reduce KL Divergence to ELBO
Due to the positivity of divergence
log p(x) = DKL(q(z|x)||p(z|x)) + Eq(z|x)[log p(x|z)] − DKL[q(z|x)||p(z)]
Evidence Lower Bound, ELBO (per sample)
log p(x) ≤ Eq(z|x)
p(xn, z)
q(z|xn)
ELBO
LELBO = Eq(z|x) log p(xn|z)
reconstruction
− DKL q(z|xn)||p(z)
regularization

Implement ELBO using VAE
Figure: ELBO Structure in VAE

Implement ELBO using VAE
Figure: Parameterization Trick for Back-propagation

Build up Framework for Disentanglement

Build up β-VAE framework
Suppose the data generation process are aﬀected by two type of factors
p(x|z) ≈ p(x|v, w)
where v is conditionally independent; w is conditionally dependent factor.
Maximization data likelihood of observed data over the whole latent
distribution. Also, the aim of disentanglement is to ensure the inferred
latent capture the generative factors v in an independent manner.
max
θ
Epθ(z)[pθ(x|z)] s.t DKL(q(z|x)||p(z)) <

β-VAE
The objective function of β-VAE[Higgins, 2017] goes to
L = Eq(z|x)[log p(x|z)] − β DKL[q(z|x)||p(z)]
Understanding eﬀect of β
Reconstruction quality is the poor indicator of learnt disentanglement
Good disentanglement often lead to blurry reconstruction
Disentangled representation lacks capability of latent channel

Disentanglement of β-VAE

ELBO Surgery

ELBO Surgery
Conjecture two criteria may be important
MI between data variable and latent variable
Independence of latent variable

ELBO Surgery
To further understand ELBO, we use the average encoding distribution as
the expression. [Hoﬀman, 2017] Identify each training example with
unique index {1, 2, 3, ...N}
Deﬁne q(z|n) = q(z|xn), q(z, n) = p(n)q(z|n) = 1
N q(z|n) where
p(n) = 1
N . The marginal distribution q(z) = Ep(n)[q(z|n)] and
q(z) = n q(z|n)p(n).

TC Decomposition: Sources of Disentanglement in ELBO
Ep(n) DKL[q(z|n)||p(z)] (regularization term)
= Ep(n) Eq(z|n) log q(z|n) − log p(z)
+ log q(z) − log q(z) + log
j
q(zj ) − log
j
q(zj )

TC Decomposition: Sources of Disentanglement in ELBO
Ep(n) DKL[q(z|n)||p(z)] (regularization term)
= Ep(n) Eq(z|n) log q(z|n) − log p(z)
+ log q(z) − log q(z) + log
j
q(zj ) − log
j
q(zj )
= DKL[q(z, n)||q(z)p(n))]
(1) Index-Code MI,Iq(z;n)
+ DKL[q(z)||
j
q(zj )]
(2) Total Correlation
+
j
DKL[q(zj )||p(zj )]
(3) Dim-wise KL
Index-Code MI: Mutual information between data and latent variable
Total Correlation: Generalization of MI between latent variables
Dim-wise KL: Marginal KL for prior distribution
Note that β-VAE penalizes three terms evenly.

ELBO TC-Decomposition
Modiﬁed ELBO =
Reconstruction
Eq(z,n)[log p(n|z)] +
− α Iq(z; n)
Index-Code MI
−β DKL[q(z)||Πj q(zj )]
Total Correlation
−γ
j
DKL[q(zj )||p(zj )]
Dim-wise KL

ELBO TC-Decomposition
Modiﬁed ELBO =
Reconstruction
Eq(z,n)[log p(n|z)] +
− α Iq(z; n)
Index-Code MI
−β DKL[q(z)||Πj q(zj )]
Total Correlation
−γ
j
DKL[q(zj )||p(zj )]
Dim-wise KL
Index-Code MI: DKL[q(z, n)||q(z)p(n))] = Iq(z; n)
Drop the penalty to improve disentanglement
Keep this term to improve disentanglement according to IB
Dataset dependent
Total Correlation: DKL[q(z)||Πj q(zj )]
Heavier penalty on this term induces disentanglement
TC forces model to ﬁnd statistically independent factors
Dim-wise KL: j DKL[q(zj )||p(zj )]
Prevent latent space deviating from corresponding prior

Minibatch Sampling: Stochastic Estimation of log q(z)
The evaluation of density q(z) requires sampling the whole dataset.
Random chosen n will lead to q(z|n) close to zero. Inspired by importance
sampling, for given batch of samples {n1, n2, ...nM}, we can use the
estimator re-utilize the batch
Eq(z|x)[log q(z)] = Eq(z|x) log En ∼p(n )[q(z|n )]
≈
1
M
M
i=1
log
1
MN
M
j=1
q(z(ni )|nj )
where z(ni ) is a sample from q(z|ni ). Treat q(z) as mixture of
distribution, where the data index n indicates the mixture components.

Special case: β-TCVAE
With MBS, it is available to assign diﬀerent weights (α, β, γ) to terms
Lβ−TC = Eq(z|n)p(n)[log p(n|z)]
− αIq(z; n) − β DKL(q(z)||
j
q(zj )) − γ
j
DKL(q(zj )||p(zj ))
Proposed β-TCVAE uses α = γ = 1, and set β as hyper-parameter.

Pseudo Code: VAE
Using Tensorﬂow Probability
latent_prior = make_mixture_prior() # p(z)
approx_posterior = encoder(features) # q(z|x)
# z ~ q(z|x)
approx_posterior_sample = approx_posterior.sample()
# p(x|z)
decoder_likelihood = decoder(approx_posterior_sample)
# log(p(x|z))
rate = decoder_likelihood.log_prob(features)
log_qz_x = approx_posterior.log_prob(approx_posterior_sample
log_pz = latent_prior.log_prob(approx_posterior_sample)
kl_div = log_qz_x - log_pz # D_kl(q(z|x) || p(z))
elbo = tf.reduce_sum(rate - kl_div)

Pseudo Code: β-TCVAE
Lβ−TC = Eq(z|n)p(n)[log p(n|z)]
− αIq(z; n) − β DKL(q(z)||
j
q(zj )) − γ
j
DKL(q(zj )||p(zj ))
log_qz = tf.logsumexp(tf.reduce_sum(log_qz_x, 1), 0)
-tf.log(M * N)
log_qz_factorized =
tf.reduce_sum(tf.logsumexp(log_qz_x)-tf.log(M * N), 1)
Iq = log_qz_x - log_qz
TC = log_qz - log_qz_factorized
Dim_kl = log_qz_factorized - latent_prior
modified_elbo = rate - Iq - TC - Dim_kl

Evaluate Disentanglement: Mutual Information Gap
Suppose we have some groundtruth factor {vk}K
k=1 Deﬁne joint
distribution q(zj , vk) = N
n=1 p(vk)p(n|vk)q(zj |n)
In(zj ; vk) = Eq(zj ,vk ) log q(zj |n)p(n|vk) + S(zj )

Feature Factor and Latent Variable
Figure: Correlation between factors and latent space

Evaluate Disentanglement: Mutual Information Gap
Mutual Information Gap
MIG =
1
K
K
k=1
1
S(vk)
I(zj(k) ; vk) − maxj=j(k) I(zj ; vk)
where j(k)
= arg max
j
I(zj ; vk)
where 0 ≤ I(zj ; vk) = S(vk) − S(vk|zj ) ≤ S(vk) naturally serve as the
normalization condition. Beneﬁts of this metric
Axis-alignment
Compactness of representation

Experiments and Conclusion

Dataset
Figure: Specs of Datasets

Experiments: Performance of β-TCVAE
Figure: Left: fully connected; Right: convolution

Experiments: Performance of β-TCVAE

Experiments: Trade-oﬀ
ELBO-Disentanglement Trade-oﬀ
Figure: DSprites

Experiments: Trade-oﬀ
ELBO-Disentanglement Trade-oﬀ
Figure: 3D Faces

Experiments: TC versus MIG
How Independence relates to Disentanglement?
Figure: DSprites

Experiments: TC versus MIG
How Independence relates to Disentanglement?
Figure: 3D Faces

Extra Experiments
Removing Index-Code MI
Batch-size Effect
make no significant difference.

Results

Conclusion
In this paper
Regularization term in ELBO contains various factors which naturally
encourage disentangling
Total correlation (independence of latent variable) is the major factor
force machine to learn statistically independent reprensentation
New information-theoretic metric quantity

References
Chen, T.Q., and etc,
Isolating Sources of Disentanglement in Variational Autoencoders. NIPS. 2018
Tishby, Naftali, and etc.
The information bottleneck method. physics/0004057 (2000).
Hoﬀman, Matthew D., and Matthew J. Johnson.
Elbo surgery: yet another way to carve up the variational evidence lower bound.
NIPS. 2016.
Higgins and etc.
beta-vae: Learning basic visual concepts with a constrained variational framework.
ICLR 2017
Burgess and ect.
Understanding disentangling in beta-VAE. arXiv preprint arXiv:1804.03599.
Achille, Alessandro, and etc.
Emergence of invariance and disentanglement in deep representations.
Alemi, Alexander, et al.
Fixing a broken ELBO. International Conference on Machine Learning. 2018.

Toward Disentanglement through Understand ELBO

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Toward Disentanglement through Understand ELBO

Similar to Toward Disentanglement through Understand ELBO (20)

More from Kai-Wen Zhao

More from Kai-Wen Zhao (8)

Recently uploaded

Recently uploaded (20)

Toward Disentanglement through Understand ELBO