SlideShare a Scribd company logo
Toward Disentanglement
through Understanding ELBO (Part I)
kv
Viscovery: Algorithm Team
kelispinor@gmail.com
February 17, 2019
kv (Viscovery) ELBO February 17, 2019 1 / 53
Overview
1 Background Knowledge
Information Quantities
Rate Distortion Theory and Information Bottleneck
Variational Inference
2 Build up Frameworks for Disentangle Representations
3 Isolating Sources of Disentanglement
ELBO Surgery
Evaluate Disentanglement
Experiments
4 Conclusion
kv (Viscovery) ELBO February 17, 2019 2 / 53
Before we start ...
What is disentanglement?
Disentangled Representation = Fatorized + Interpretable
Reuse and generalize knowledge
Extrapolate beyond training data distribution
Questions will be answered in series of discussion
(Part I) Why VAE is the main framework to realize disentanglement?
[Chen, 2018]
(Part II) Why there is a trade-off between reconstruction and
disentanglement? [Alemi, 2018]
(Part II) Is disentanglement a task or principal? [Achille, 2018]
kv (Viscovery) ELBO February 17, 2019 3 / 53
Background Knowledge
kv (Viscovery) ELBO February 17, 2019 4 / 53
Quick Review
Consider beta decay process, we observe N electrons
↑↑↓↓↑ ... ↑
Number of the possible states of N spins
N!
(pN)!((1 − p)N)!
∼
NN
(pN)pN((1 − p)N)(1−p)N
=
1
ppN(1 − p)(1−p)N
= 2NS
where S is called Shannon Entropy per spin
S = −p log p − (1 − p) log(1 − p)
Number of bits of information one gains in actually observe such state is
NS. [?]
kv (Viscovery) ELBO February 17, 2019 5 / 53
Quick Review
In general case:
N!
(p1N)p1N(p2N)p2N...(pkN)pk N
∼
NN
k
i=1(pi N)pi N
= 2NS
Such that
S = −
i
pi log pi
kv (Viscovery) ELBO February 17, 2019 6 / 53
Quick Review
We have a theory that predicts a probability distribution Q for the final
state, however, the correct probability distribution is P, then after
observing N decays, we will see outcome i approximately pi N times.
P =
N
i
qpi N
i
N!
j (pj N)!
We already calculated N!
j (pj N)! ∼ 2−N i pi log pi ,
so
P ∼ 2−N i pi (log pi −log qi )
The quantity, we called, relative entropy or Kullback-Liebler divergence
DKL(p||q) =
i
pi (log pi − log qi )
kv (Viscovery) ELBO February 17, 2019 7 / 53
Quick Review
Quantities:
Entropy: Sx = − x p(x) log p(x), information we don’t know.
Relative Entropy, KL-Divergence:
DKL(p(x)||q(x)) = x p(x) log p(x) − log q(x)
Mutual Information:
I(x; y) = Sx − Sx|y = Sy − Sy|x = Sx,y − Sx|y − Sy|x
I(x; y) = DKL p(x, y)||p(x)p(y)
Symmetric between x and y
Extreme Cases: Independence, Deterministic Relation
Relation:
Chain rule: p(x, y) = p(x|y)p(y)
Bayesian: p(y|x) = p(x|y)p(y)
p(x)
kv (Viscovery) ELBO February 17, 2019 8 / 53
Information Theory
kv (Viscovery) ELBO February 17, 2019 9 / 53
Information System
Sending a signal from Alice to Bob: X → ˜X
kv (Viscovery) ELBO February 17, 2019 10 / 53
Rate-Distortion Theory
What makes good encoding? Low rate, low distortion.
min
p(˜x)|p(x)
I(X; ˜X) s.t d(X, ˜X) < D
Theorem (Rate Distortion, Shannon and Kolmogorov)
Define the function as the minimum achievable rate under distortion
constraint D.
R(D) = min
p(˜x|x) s.t d(x,˜x)<D
I(X; ˜X)
Then an encoding that achieves this rate is
p(˜x|x) =
p(˜x)
Z(x, β)
e−βd(x,˜x)
kv (Viscovery) ELBO February 17, 2019 11 / 53
Rate-Distortion: RD-Curve
Figure: Trade-off between transmission rate and distortion
kv (Viscovery) ELBO February 17, 2019 12 / 53
Information System
Sending a signal from Alice to Bob: X → ˜X, where Y the relevant
information about X.
kv (Viscovery) ELBO February 17, 2019 13 / 53
Information Bottleneck Theory
What makes good encoding? Low rate, high relevance.
min
p(˜x)|p(x)
I( ˜X; X) s.t I( ˜X, Y ) > L
Theorem (Information Bottleneck, Tishby, Pereira, and Bialek)
Define the function as the minimum achievable rate while preserving L bits
of mutual information.
R(L) = min
p(˜x|x) s.t I(˜x;y)≥L
I(X; ˜X)
Then an encoding that achieves this rate is
p(˜x|x) =
p(˜x)
Z(x, β)
e−βDKL[p(y|x)||p(y|˜x)]
kv (Viscovery) ELBO February 17, 2019 14 / 53
Comparison
What makes good encoding?: Low
Rate, Low Distortion
p(˜x|x) =
p(˜x)
Z(x, β)
e−βd(x,˜x)
What makes good code?: Low
Rate, High Relevance
p(˜x|x) =
p(˜x)
Z(x, β)
e−βDKL[p(y|x)||p(y|˜x)]
kv (Viscovery) ELBO February 17, 2019 15 / 53
Structure of the Solution
On the structure of solution
L[p(˜x|x)] = I(X; ˜X) − βI( ˜X; Y )
The Lagrangian multiplier operates as the trade-off parameter between
complexity of representation and preserved relevant information.
I( ˜X; Y ) is the measure of performance
I(X; ˜X) as the regularization term
kv (Viscovery) ELBO February 17, 2019 16 / 53
Inference
kv (Viscovery) ELBO February 17, 2019 17 / 53
Inference under Posterior
X: observation, input data
Z: latent variable, representation, embedding
p(z|x)
posterior
=
likelihood of z
p(x|z)
prior
p(z)
p(x)
evidence
Due to p(x) is intractable, there are two parallel ways to solve
MCMC
Variational Inference
kv (Viscovery) ELBO February 17, 2019 18 / 53
Variational Inference
Propose a simpler, tractable distribution q(z) to approximate posterior
p(z|x)
DKL(q(z|x)||p(z|x)) = Eq(z|x) log(
q(z|x)p(x)
p(x, z)
)
= Eq(z|x) log
q(z|x)
p(x, z)
+ log p(x)
Swap the left and right hand-sides, we get
log p(x) = DKL(q(z|x)||p(z|x)) + Eq(z|x) log
p(x, z)
q(z|x)
kv (Viscovery) ELBO February 17, 2019 19 / 53
Reduce KL Divergence to ELBO
Due to the positivity of divergence
log p(x) = DKL(q(z|x)||p(z|x)) + Eq(z|x)[log p(x|z)] − DKL[q(z|x)||p(z)]
Evidence Lower Bound, ELBO (per sample)
log p(x) ≤ Eq(z|x)
p(xn, z)
q(z|xn)
ELBO
LELBO = Eq(z|x) log p(xn|z)
reconstruction
− DKL q(z|xn)||p(z)
regularization
kv (Viscovery) ELBO February 17, 2019 20 / 53
Implement ELBO using VAE
Figure: ELBO Structure in VAE
kv (Viscovery) ELBO February 17, 2019 21 / 53
Implement ELBO using VAE
Figure: Parameterization Trick for Back-propagation
kv (Viscovery) ELBO February 17, 2019 22 / 53
Build up Framework for Disentanglement
kv (Viscovery) ELBO February 17, 2019 23 / 53
Build up β-VAE framework
Suppose the data generation process are affected by two type of factors
p(x|z) ≈ p(x|v, w)
where v is conditionally independent; w is conditionally dependent factor.
Maximization data likelihood of observed data over the whole latent
distribution. Also, the aim of disentanglement is to ensure the inferred
latent capture the generative factors v in an independent manner.
max
θ
Epθ(z)[pθ(x|z)] s.t DKL(q(z|x)||p(z)) <
kv (Viscovery) ELBO February 17, 2019 24 / 53
β-VAE
The objective function of β-VAE[Higgins, 2017] goes to
L = Eq(z|x)[log p(x|z)] − β DKL[q(z|x)||p(z)]
Understanding effect of β
Reconstruction quality is the poor indicator of learnt disentanglement
Good disentanglement often lead to blurry reconstruction
Disentangled representation lacks capability of latent channel
kv (Viscovery) ELBO February 17, 2019 25 / 53
Disentanglement of β-VAE
kv (Viscovery) ELBO February 17, 2019 26 / 53
ELBO Surgery
kv (Viscovery) ELBO February 17, 2019 27 / 53
ELBO Surgery
Conjecture two criteria may be important
MI between data variable and latent variable
Independence of latent variable
kv (Viscovery) ELBO February 17, 2019 28 / 53
ELBO Surgery
To further understand ELBO, we use the average encoding distribution as
the expression. [Hoffman, 2017] Identify each training example with
unique index {1, 2, 3, ...N}
Define q(z|n) = q(z|xn), q(z, n) = p(n)q(z|n) = 1
N q(z|n) where
p(n) = 1
N . The marginal distribution q(z) = Ep(n)[q(z|n)] and
q(z) = n q(z|n)p(n).
kv (Viscovery) ELBO February 17, 2019 29 / 53
TC Decomposition: Sources of Disentanglement in ELBO
Ep(n) DKL[q(z|n)||p(z)] (regularization term)
= Ep(n) Eq(z|n) log q(z|n) − log p(z)
+ log q(z) − log q(z) + log
j
q(zj ) − log
j
q(zj )
kv (Viscovery) ELBO February 17, 2019 30 / 53
TC Decomposition: Sources of Disentanglement in ELBO
Ep(n) DKL[q(z|n)||p(z)] (regularization term)
= Ep(n) Eq(z|n) log q(z|n) − log p(z)
+ log q(z) − log q(z) + log
j
q(zj ) − log
j
q(zj )
= DKL[q(z, n)||q(z)p(n))]
(1) Index-Code MI,Iq(z;n)
+ DKL[q(z)||
j
q(zj )]
(2) Total Correlation
+
j
DKL[q(zj )||p(zj )]
(3) Dim-wise KL
Index-Code MI: Mutual information between data and latent variable
Total Correlation: Generalization of MI between latent variables
Dim-wise KL: Marginal KL for prior distribution
Note that β-VAE penalizes three terms evenly.
kv (Viscovery) ELBO February 17, 2019 31 / 53
ELBO TC-Decomposition
Modified ELBO =
Reconstruction
Eq(z,n)[log p(n|z)] +
− α Iq(z; n)
Index-Code MI
−β DKL[q(z)||Πj q(zj )]
Total Correlation
−γ
j
DKL[q(zj )||p(zj )]
Dim-wise KL
kv (Viscovery) ELBO February 17, 2019 32 / 53
ELBO TC-Decomposition
Modified ELBO =
Reconstruction
Eq(z,n)[log p(n|z)] +
− α Iq(z; n)
Index-Code MI
−β DKL[q(z)||Πj q(zj )]
Total Correlation
−γ
j
DKL[q(zj )||p(zj )]
Dim-wise KL
Index-Code MI: DKL[q(z, n)||q(z)p(n))] = Iq(z; n)
Drop the penalty to improve disentanglement
Keep this term to improve disentanglement according to IB
Dataset dependent
Total Correlation: DKL[q(z)||Πj q(zj )]
Heavier penalty on this term induces disentanglement
TC forces model to find statistically independent factors
Dim-wise KL: j DKL[q(zj )||p(zj )]
Prevent latent space deviating from corresponding prior
kv (Viscovery) ELBO February 17, 2019 33 / 53
Minibatch Sampling: Stochastic Estimation of log q(z)
The evaluation of density q(z) requires sampling the whole dataset.
Random chosen n will lead to q(z|n) close to zero. Inspired by importance
sampling, for given batch of samples {n1, n2, ...nM}, we can use the
estimator re-utilize the batch
Eq(z|x)[log q(z)] = Eq(z|x) log En ∼p(n )[q(z|n )]
≈
1
M
M
i=1
log
1
MN
M
j=1
q(z(ni )|nj )
where z(ni ) is a sample from q(z|ni ). Treat q(z) as mixture of
distribution, where the data index n indicates the mixture components.
kv (Viscovery) ELBO February 17, 2019 34 / 53
Special case: β-TCVAE
With MBS, it is available to assign different weights (α, β, γ) to terms
Lβ−TC = Eq(z|n)p(n)[log p(n|z)]
− αIq(z; n) − β DKL(q(z)||
j
q(zj )) − γ
j
DKL(q(zj )||p(zj ))
Proposed β-TCVAE uses α = γ = 1, and set β as hyper-parameter.
kv (Viscovery) ELBO February 17, 2019 35 / 53
Pseudo Code: VAE
Using Tensorflow Probability
latent_prior = make_mixture_prior() # p(z)
approx_posterior = encoder(features) # q(z|x)
# z ~ q(z|x)
approx_posterior_sample = approx_posterior.sample()
# p(x|z)
decoder_likelihood = decoder(approx_posterior_sample)
# log(p(x|z))
rate = decoder_likelihood.log_prob(features)
log_qz_x = approx_posterior.log_prob(approx_posterior_sample
log_pz = latent_prior.log_prob(approx_posterior_sample)
kl_div = log_qz_x - log_pz # D_kl(q(z|x) || p(z))
elbo = tf.reduce_sum(rate - kl_div)
kv (Viscovery) ELBO February 17, 2019 36 / 53
Pseudo Code: β-TCVAE
Lβ−TC = Eq(z|n)p(n)[log p(n|z)]
− αIq(z; n) − β DKL(q(z)||
j
q(zj )) − γ
j
DKL(q(zj )||p(zj ))
log_qz = tf.logsumexp(tf.reduce_sum(log_qz_x, 1), 0)
-tf.log(M * N)
log_qz_factorized =
tf.reduce_sum(tf.logsumexp(log_qz_x)-tf.log(M * N), 1)
Iq = log_qz_x - log_qz
TC = log_qz - log_qz_factorized
Dim_kl = log_qz_factorized - latent_prior
modified_elbo = rate - Iq - TC - Dim_kl
kv (Viscovery) ELBO February 17, 2019 37 / 53
Evaluate Disentanglement: Mutual Information Gap
Suppose we have some groundtruth factor {vk}K
k=1 Define joint
distribution q(zj , vk) = N
n=1 p(vk)p(n|vk)q(zj |n)
In(zj ; vk) = Eq(zj ,vk ) log q(zj |n)p(n|vk) + S(zj )
kv (Viscovery) ELBO February 17, 2019 38 / 53
Feature Factor and Latent Variable
Figure: Correlation between factors and latent space
kv (Viscovery) ELBO February 17, 2019 39 / 53
Evaluate Disentanglement: Mutual Information Gap
Mutual Information Gap
MIG =
1
K
K
k=1
1
S(vk)
I(zj(k) ; vk) − maxj=j(k) I(zj ; vk)
where j(k)
= arg max
j
I(zj ; vk)
where 0 ≤ I(zj ; vk) = S(vk) − S(vk|zj ) ≤ S(vk) naturally serve as the
normalization condition. Benefits of this metric
Axis-alignment
Compactness of representation
kv (Viscovery) ELBO February 17, 2019 40 / 53
Experiments and Conclusion
kv (Viscovery) ELBO February 17, 2019 41 / 53
Dataset
Figure: Specs of Datasets
kv (Viscovery) ELBO February 17, 2019 42 / 53
Experiments: Performance of β-TCVAE
Figure: Left: fully connected; Right: convolution
kv (Viscovery) ELBO February 17, 2019 43 / 53
Experiments: Performance of β-TCVAE
kv (Viscovery) ELBO February 17, 2019 44 / 53
Experiments: Trade-off
ELBO-Disentanglement Trade-off
Figure: DSprites
kv (Viscovery) ELBO February 17, 2019 45 / 53
Experiments: Trade-off
ELBO-Disentanglement Trade-off
Figure: 3D Faces
kv (Viscovery) ELBO February 17, 2019 46 / 53
Experiments: TC versus MIG
How Independence relates to Disentanglement?
Figure: DSprites
kv (Viscovery) ELBO February 17, 2019 47 / 53
Experiments: TC versus MIG
How Independence relates to Disentanglement?
Figure: 3D Faces
kv (Viscovery) ELBO February 17, 2019 48 / 53
Extra Experiments
Removing Index-Code MI
Batch-size Effect
make no significant difference.
kv (Viscovery) ELBO February 17, 2019 49 / 53
Results
kv (Viscovery) ELBO February 17, 2019 50 / 53
Results
kv (Viscovery) ELBO February 17, 2019 51 / 53
Conclusion
In this paper
Regularization term in ELBO contains various factors which naturally
encourage disentangling
Total correlation (independence of latent variable) is the major factor
force machine to learn statistically independent reprensentation
New information-theoretic metric quantity
kv (Viscovery) ELBO February 17, 2019 52 / 53
References
Chen, T.Q., and etc,
Isolating Sources of Disentanglement in Variational Autoencoders. NIPS. 2018
Tishby, Naftali, and etc.
The information bottleneck method. physics/0004057 (2000).
Hoffman, Matthew D., and Matthew J. Johnson.
Elbo surgery: yet another way to carve up the variational evidence lower bound.
NIPS. 2016.
Higgins and etc.
beta-vae: Learning basic visual concepts with a constrained variational framework.
ICLR 2017
Burgess and ect.
Understanding disentangling in beta-VAE. arXiv preprint arXiv:1804.03599.
Achille, Alessandro, and etc.
Emergence of invariance and disentanglement in deep representations.
Alemi, Alexander, et al.
Fixing a broken ELBO. International Conference on Machine Learning. 2018.
kv (Viscovery) ELBO February 17, 2019 53 / 53

More Related Content

What's hot

About Unsupervised Image-to-Image Translation
About Unsupervised Image-to-Image TranslationAbout Unsupervised Image-to-Image Translation
About Unsupervised Image-to-Image Translation
Mehdi Shibahara
 
Policy Gradient Theorem
Policy Gradient TheoremPolicy Gradient Theorem
Policy Gradient Theorem
Ashwin Rao
 
Disentangled Representation Learning of Deep Generative Models
Disentangled Representation Learning of Deep Generative ModelsDisentangled Representation Learning of Deep Generative Models
Disentangled Representation Learning of Deep Generative Models
Ryohei Suzuki
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
Shunta Saito
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
Chanuk Lim
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
Jinwon Lee
 
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
Jin-Hwa Kim
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
Universitat Politècnica de Catalunya
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
Brodmann17
 
マルチモーダル深層学習の研究動向
マルチモーダル深層学習の研究動向マルチモーダル深層学習の研究動向
マルチモーダル深層学習の研究動向
Koichiro Mori
 
Denoising Diffusion Probabilistic Modelsの重要な式の解説
Denoising Diffusion Probabilistic Modelsの重要な式の解説Denoising Diffusion Probabilistic Modelsの重要な式の解説
Denoising Diffusion Probabilistic Modelsの重要な式の解説
Tomonari Masada
 
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Universitat Politècnica de Catalunya
 
An introduction on normalizing flows
An introduction on normalizing flowsAn introduction on normalizing flows
An introduction on normalizing flows
Grigoris C
 
[Mmlab seminar 2016] deep learning for human pose estimation
[Mmlab seminar 2016] deep learning for human pose estimation[Mmlab seminar 2016] deep learning for human pose estimation
[Mmlab seminar 2016] deep learning for human pose estimation
Wei Yang
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Christopher Morris
 
[DL輪読会]Hindsight Experience Replay
[DL輪読会]Hindsight Experience Replay[DL輪読会]Hindsight Experience Replay
[DL輪読会]Hindsight Experience Replay
Deep Learning JP
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector Machine
Shao-Chuan Wang
 
Linear algebra
Linear algebraLinear algebra
Linear algebra
Sungbin Lim
 
Deep Learning for Computer Vision: Generative models and adversarial training...
Deep Learning for Computer Vision: Generative models and adversarial training...Deep Learning for Computer Vision: Generative models and adversarial training...
Deep Learning for Computer Vision: Generative models and adversarial training...
Universitat Politècnica de Catalunya
 
Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 I
Sungbin Lim
 

What's hot (20)

About Unsupervised Image-to-Image Translation
About Unsupervised Image-to-Image TranslationAbout Unsupervised Image-to-Image Translation
About Unsupervised Image-to-Image Translation
 
Policy Gradient Theorem
Policy Gradient TheoremPolicy Gradient Theorem
Policy Gradient Theorem
 
Disentangled Representation Learning of Deep Generative Models
Disentangled Representation Learning of Deep Generative ModelsDisentangled Representation Learning of Deep Generative Models
Disentangled Representation Learning of Deep Generative Models
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
 
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
 
マルチモーダル深層学習の研究動向
マルチモーダル深層学習の研究動向マルチモーダル深層学習の研究動向
マルチモーダル深層学習の研究動向
 
Denoising Diffusion Probabilistic Modelsの重要な式の解説
Denoising Diffusion Probabilistic Modelsの重要な式の解説Denoising Diffusion Probabilistic Modelsの重要な式の解説
Denoising Diffusion Probabilistic Modelsの重要な式の解説
 
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
 
An introduction on normalizing flows
An introduction on normalizing flowsAn introduction on normalizing flows
An introduction on normalizing flows
 
[Mmlab seminar 2016] deep learning for human pose estimation
[Mmlab seminar 2016] deep learning for human pose estimation[Mmlab seminar 2016] deep learning for human pose estimation
[Mmlab seminar 2016] deep learning for human pose estimation
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
 
[DL輪読会]Hindsight Experience Replay
[DL輪読会]Hindsight Experience Replay[DL輪読会]Hindsight Experience Replay
[DL輪読会]Hindsight Experience Replay
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector Machine
 
Linear algebra
Linear algebraLinear algebra
Linear algebra
 
Deep Learning for Computer Vision: Generative models and adversarial training...
Deep Learning for Computer Vision: Generative models and adversarial training...Deep Learning for Computer Vision: Generative models and adversarial training...
Deep Learning for Computer Vision: Generative models and adversarial training...
 
Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 I
 

Similar to Toward Disentanglement through Understand ELBO

Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
Yoonho Lee
 
Deep generative model.pdf
Deep generative model.pdfDeep generative model.pdf
Deep generative model.pdf
Hyungjoo Cho
 
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Olga Zinkevych
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clustering
Frank Nielsen
 
Muchtadi
MuchtadiMuchtadi
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applications
Frank Nielsen
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
Christian Robert
 
Uncertainty in deep learning
Uncertainty in deep learningUncertainty in deep learning
Uncertainty in deep learning
Yujiro Katagiri
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
zukun
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
Christian Robert
 
On Convolution of Graph Signals and Deep Learning on Graph Domains
On Convolution of Graph Signals and Deep Learning on Graph DomainsOn Convolution of Graph Signals and Deep Learning on Graph Domains
On Convolution of Graph Signals and Deep Learning on Graph Domains
Jean-Charles Vialatte
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
Christian Robert
 
BAYSM'14, Wien, Austria
BAYSM'14, Wien, AustriaBAYSM'14, Wien, Austria
BAYSM'14, Wien, Austria
Christian Robert
 
Slides: The dual Voronoi diagrams with respect to representational Bregman di...
Slides: The dual Voronoi diagrams with respect to representational Bregman di...Slides: The dual Voronoi diagrams with respect to representational Bregman di...
Slides: The dual Voronoi diagrams with respect to representational Bregman di...
Frank Nielsen
 
Computational Information Geometry: A quick review (ICMS)
Computational Information Geometry: A quick review (ICMS)Computational Information Geometry: A quick review (ICMS)
Computational Information Geometry: A quick review (ICMS)
Frank Nielsen
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
Frank Nielsen
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
Christian Robert
 
Presentation OCIP 2015
Presentation OCIP 2015Presentation OCIP 2015
Presentation OCIP 2015
Fabian Froehlich
 
Poisson factorization
Poisson factorizationPoisson factorization
Poisson factorization
Tomonari Masada
 
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
Frank Nielsen
 

Similar to Toward Disentanglement through Understand ELBO (20)

Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
Deep generative model.pdf
Deep generative model.pdfDeep generative model.pdf
Deep generative model.pdf
 
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clustering
 
Muchtadi
MuchtadiMuchtadi
Muchtadi
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applications
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
 
Uncertainty in deep learning
Uncertainty in deep learningUncertainty in deep learning
Uncertainty in deep learning
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
 
On Convolution of Graph Signals and Deep Learning on Graph Domains
On Convolution of Graph Signals and Deep Learning on Graph DomainsOn Convolution of Graph Signals and Deep Learning on Graph Domains
On Convolution of Graph Signals and Deep Learning on Graph Domains
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
 
BAYSM'14, Wien, Austria
BAYSM'14, Wien, AustriaBAYSM'14, Wien, Austria
BAYSM'14, Wien, Austria
 
Slides: The dual Voronoi diagrams with respect to representational Bregman di...
Slides: The dual Voronoi diagrams with respect to representational Bregman di...Slides: The dual Voronoi diagrams with respect to representational Bregman di...
Slides: The dual Voronoi diagrams with respect to representational Bregman di...
 
Computational Information Geometry: A quick review (ICMS)
Computational Information Geometry: A quick review (ICMS)Computational Information Geometry: A quick review (ICMS)
Computational Information Geometry: A quick review (ICMS)
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
Presentation OCIP 2015
Presentation OCIP 2015Presentation OCIP 2015
Presentation OCIP 2015
 
Poisson factorization
Poisson factorizationPoisson factorization
Poisson factorization
 
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
 

More from Kai-Wen Zhao

Learning visual representation without human label
Learning visual representation without human labelLearning visual representation without human label
Learning visual representation without human label
Kai-Wen Zhao
 
Deep Double Descent
Deep Double DescentDeep Double Descent
Deep Double Descent
Kai-Wen Zhao
 
Recent Object Detection Research & Person Detection
Recent Object Detection Research & Person DetectionRecent Object Detection Research & Person Detection
Recent Object Detection Research & Person Detection
Kai-Wen Zhao
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
Kai-Wen Zhao
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
Kai-Wen Zhao
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...
Kai-Wen Zhao
 
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
Kai-Wen Zhao
 
High Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNEHigh Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNE
Kai-Wen Zhao
 

More from Kai-Wen Zhao (8)

Learning visual representation without human label
Learning visual representation without human labelLearning visual representation without human label
Learning visual representation without human label
 
Deep Double Descent
Deep Double DescentDeep Double Descent
Deep Double Descent
 
Recent Object Detection Research & Person Detection
Recent Object Detection Research & Person DetectionRecent Object Detection Research & Person Detection
Recent Object Detection Research & Person Detection
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...
 
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
 
High Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNEHigh Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNE
 

Recently uploaded

DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 

Recently uploaded (20)

DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 

Toward Disentanglement through Understand ELBO

  • 1. Toward Disentanglement through Understanding ELBO (Part I) kv Viscovery: Algorithm Team kelispinor@gmail.com February 17, 2019 kv (Viscovery) ELBO February 17, 2019 1 / 53
  • 2. Overview 1 Background Knowledge Information Quantities Rate Distortion Theory and Information Bottleneck Variational Inference 2 Build up Frameworks for Disentangle Representations 3 Isolating Sources of Disentanglement ELBO Surgery Evaluate Disentanglement Experiments 4 Conclusion kv (Viscovery) ELBO February 17, 2019 2 / 53
  • 3. Before we start ... What is disentanglement? Disentangled Representation = Fatorized + Interpretable Reuse and generalize knowledge Extrapolate beyond training data distribution Questions will be answered in series of discussion (Part I) Why VAE is the main framework to realize disentanglement? [Chen, 2018] (Part II) Why there is a trade-off between reconstruction and disentanglement? [Alemi, 2018] (Part II) Is disentanglement a task or principal? [Achille, 2018] kv (Viscovery) ELBO February 17, 2019 3 / 53
  • 4. Background Knowledge kv (Viscovery) ELBO February 17, 2019 4 / 53
  • 5. Quick Review Consider beta decay process, we observe N electrons ↑↑↓↓↑ ... ↑ Number of the possible states of N spins N! (pN)!((1 − p)N)! ∼ NN (pN)pN((1 − p)N)(1−p)N = 1 ppN(1 − p)(1−p)N = 2NS where S is called Shannon Entropy per spin S = −p log p − (1 − p) log(1 − p) Number of bits of information one gains in actually observe such state is NS. [?] kv (Viscovery) ELBO February 17, 2019 5 / 53
  • 6. Quick Review In general case: N! (p1N)p1N(p2N)p2N...(pkN)pk N ∼ NN k i=1(pi N)pi N = 2NS Such that S = − i pi log pi kv (Viscovery) ELBO February 17, 2019 6 / 53
  • 7. Quick Review We have a theory that predicts a probability distribution Q for the final state, however, the correct probability distribution is P, then after observing N decays, we will see outcome i approximately pi N times. P = N i qpi N i N! j (pj N)! We already calculated N! j (pj N)! ∼ 2−N i pi log pi , so P ∼ 2−N i pi (log pi −log qi ) The quantity, we called, relative entropy or Kullback-Liebler divergence DKL(p||q) = i pi (log pi − log qi ) kv (Viscovery) ELBO February 17, 2019 7 / 53
  • 8. Quick Review Quantities: Entropy: Sx = − x p(x) log p(x), information we don’t know. Relative Entropy, KL-Divergence: DKL(p(x)||q(x)) = x p(x) log p(x) − log q(x) Mutual Information: I(x; y) = Sx − Sx|y = Sy − Sy|x = Sx,y − Sx|y − Sy|x I(x; y) = DKL p(x, y)||p(x)p(y) Symmetric between x and y Extreme Cases: Independence, Deterministic Relation Relation: Chain rule: p(x, y) = p(x|y)p(y) Bayesian: p(y|x) = p(x|y)p(y) p(x) kv (Viscovery) ELBO February 17, 2019 8 / 53
  • 9. Information Theory kv (Viscovery) ELBO February 17, 2019 9 / 53
  • 10. Information System Sending a signal from Alice to Bob: X → ˜X kv (Viscovery) ELBO February 17, 2019 10 / 53
  • 11. Rate-Distortion Theory What makes good encoding? Low rate, low distortion. min p(˜x)|p(x) I(X; ˜X) s.t d(X, ˜X) < D Theorem (Rate Distortion, Shannon and Kolmogorov) Define the function as the minimum achievable rate under distortion constraint D. R(D) = min p(˜x|x) s.t d(x,˜x)<D I(X; ˜X) Then an encoding that achieves this rate is p(˜x|x) = p(˜x) Z(x, β) e−βd(x,˜x) kv (Viscovery) ELBO February 17, 2019 11 / 53
  • 12. Rate-Distortion: RD-Curve Figure: Trade-off between transmission rate and distortion kv (Viscovery) ELBO February 17, 2019 12 / 53
  • 13. Information System Sending a signal from Alice to Bob: X → ˜X, where Y the relevant information about X. kv (Viscovery) ELBO February 17, 2019 13 / 53
  • 14. Information Bottleneck Theory What makes good encoding? Low rate, high relevance. min p(˜x)|p(x) I( ˜X; X) s.t I( ˜X, Y ) > L Theorem (Information Bottleneck, Tishby, Pereira, and Bialek) Define the function as the minimum achievable rate while preserving L bits of mutual information. R(L) = min p(˜x|x) s.t I(˜x;y)≥L I(X; ˜X) Then an encoding that achieves this rate is p(˜x|x) = p(˜x) Z(x, β) e−βDKL[p(y|x)||p(y|˜x)] kv (Viscovery) ELBO February 17, 2019 14 / 53
  • 15. Comparison What makes good encoding?: Low Rate, Low Distortion p(˜x|x) = p(˜x) Z(x, β) e−βd(x,˜x) What makes good code?: Low Rate, High Relevance p(˜x|x) = p(˜x) Z(x, β) e−βDKL[p(y|x)||p(y|˜x)] kv (Viscovery) ELBO February 17, 2019 15 / 53
  • 16. Structure of the Solution On the structure of solution L[p(˜x|x)] = I(X; ˜X) − βI( ˜X; Y ) The Lagrangian multiplier operates as the trade-off parameter between complexity of representation and preserved relevant information. I( ˜X; Y ) is the measure of performance I(X; ˜X) as the regularization term kv (Viscovery) ELBO February 17, 2019 16 / 53
  • 17. Inference kv (Viscovery) ELBO February 17, 2019 17 / 53
  • 18. Inference under Posterior X: observation, input data Z: latent variable, representation, embedding p(z|x) posterior = likelihood of z p(x|z) prior p(z) p(x) evidence Due to p(x) is intractable, there are two parallel ways to solve MCMC Variational Inference kv (Viscovery) ELBO February 17, 2019 18 / 53
  • 19. Variational Inference Propose a simpler, tractable distribution q(z) to approximate posterior p(z|x) DKL(q(z|x)||p(z|x)) = Eq(z|x) log( q(z|x)p(x) p(x, z) ) = Eq(z|x) log q(z|x) p(x, z) + log p(x) Swap the left and right hand-sides, we get log p(x) = DKL(q(z|x)||p(z|x)) + Eq(z|x) log p(x, z) q(z|x) kv (Viscovery) ELBO February 17, 2019 19 / 53
  • 20. Reduce KL Divergence to ELBO Due to the positivity of divergence log p(x) = DKL(q(z|x)||p(z|x)) + Eq(z|x)[log p(x|z)] − DKL[q(z|x)||p(z)] Evidence Lower Bound, ELBO (per sample) log p(x) ≤ Eq(z|x) p(xn, z) q(z|xn) ELBO LELBO = Eq(z|x) log p(xn|z) reconstruction − DKL q(z|xn)||p(z) regularization kv (Viscovery) ELBO February 17, 2019 20 / 53
  • 21. Implement ELBO using VAE Figure: ELBO Structure in VAE kv (Viscovery) ELBO February 17, 2019 21 / 53
  • 22. Implement ELBO using VAE Figure: Parameterization Trick for Back-propagation kv (Viscovery) ELBO February 17, 2019 22 / 53
  • 23. Build up Framework for Disentanglement kv (Viscovery) ELBO February 17, 2019 23 / 53
  • 24. Build up β-VAE framework Suppose the data generation process are affected by two type of factors p(x|z) ≈ p(x|v, w) where v is conditionally independent; w is conditionally dependent factor. Maximization data likelihood of observed data over the whole latent distribution. Also, the aim of disentanglement is to ensure the inferred latent capture the generative factors v in an independent manner. max θ Epθ(z)[pθ(x|z)] s.t DKL(q(z|x)||p(z)) < kv (Viscovery) ELBO February 17, 2019 24 / 53
  • 25. β-VAE The objective function of β-VAE[Higgins, 2017] goes to L = Eq(z|x)[log p(x|z)] − β DKL[q(z|x)||p(z)] Understanding effect of β Reconstruction quality is the poor indicator of learnt disentanglement Good disentanglement often lead to blurry reconstruction Disentangled representation lacks capability of latent channel kv (Viscovery) ELBO February 17, 2019 25 / 53
  • 26. Disentanglement of β-VAE kv (Viscovery) ELBO February 17, 2019 26 / 53
  • 27. ELBO Surgery kv (Viscovery) ELBO February 17, 2019 27 / 53
  • 28. ELBO Surgery Conjecture two criteria may be important MI between data variable and latent variable Independence of latent variable kv (Viscovery) ELBO February 17, 2019 28 / 53
  • 29. ELBO Surgery To further understand ELBO, we use the average encoding distribution as the expression. [Hoffman, 2017] Identify each training example with unique index {1, 2, 3, ...N} Define q(z|n) = q(z|xn), q(z, n) = p(n)q(z|n) = 1 N q(z|n) where p(n) = 1 N . The marginal distribution q(z) = Ep(n)[q(z|n)] and q(z) = n q(z|n)p(n). kv (Viscovery) ELBO February 17, 2019 29 / 53
  • 30. TC Decomposition: Sources of Disentanglement in ELBO Ep(n) DKL[q(z|n)||p(z)] (regularization term) = Ep(n) Eq(z|n) log q(z|n) − log p(z) + log q(z) − log q(z) + log j q(zj ) − log j q(zj ) kv (Viscovery) ELBO February 17, 2019 30 / 53
  • 31. TC Decomposition: Sources of Disentanglement in ELBO Ep(n) DKL[q(z|n)||p(z)] (regularization term) = Ep(n) Eq(z|n) log q(z|n) − log p(z) + log q(z) − log q(z) + log j q(zj ) − log j q(zj ) = DKL[q(z, n)||q(z)p(n))] (1) Index-Code MI,Iq(z;n) + DKL[q(z)|| j q(zj )] (2) Total Correlation + j DKL[q(zj )||p(zj )] (3) Dim-wise KL Index-Code MI: Mutual information between data and latent variable Total Correlation: Generalization of MI between latent variables Dim-wise KL: Marginal KL for prior distribution Note that β-VAE penalizes three terms evenly. kv (Viscovery) ELBO February 17, 2019 31 / 53
  • 32. ELBO TC-Decomposition Modified ELBO = Reconstruction Eq(z,n)[log p(n|z)] + − α Iq(z; n) Index-Code MI −β DKL[q(z)||Πj q(zj )] Total Correlation −γ j DKL[q(zj )||p(zj )] Dim-wise KL kv (Viscovery) ELBO February 17, 2019 32 / 53
  • 33. ELBO TC-Decomposition Modified ELBO = Reconstruction Eq(z,n)[log p(n|z)] + − α Iq(z; n) Index-Code MI −β DKL[q(z)||Πj q(zj )] Total Correlation −γ j DKL[q(zj )||p(zj )] Dim-wise KL Index-Code MI: DKL[q(z, n)||q(z)p(n))] = Iq(z; n) Drop the penalty to improve disentanglement Keep this term to improve disentanglement according to IB Dataset dependent Total Correlation: DKL[q(z)||Πj q(zj )] Heavier penalty on this term induces disentanglement TC forces model to find statistically independent factors Dim-wise KL: j DKL[q(zj )||p(zj )] Prevent latent space deviating from corresponding prior kv (Viscovery) ELBO February 17, 2019 33 / 53
  • 34. Minibatch Sampling: Stochastic Estimation of log q(z) The evaluation of density q(z) requires sampling the whole dataset. Random chosen n will lead to q(z|n) close to zero. Inspired by importance sampling, for given batch of samples {n1, n2, ...nM}, we can use the estimator re-utilize the batch Eq(z|x)[log q(z)] = Eq(z|x) log En ∼p(n )[q(z|n )] ≈ 1 M M i=1 log 1 MN M j=1 q(z(ni )|nj ) where z(ni ) is a sample from q(z|ni ). Treat q(z) as mixture of distribution, where the data index n indicates the mixture components. kv (Viscovery) ELBO February 17, 2019 34 / 53
  • 35. Special case: β-TCVAE With MBS, it is available to assign different weights (α, β, γ) to terms Lβ−TC = Eq(z|n)p(n)[log p(n|z)] − αIq(z; n) − β DKL(q(z)|| j q(zj )) − γ j DKL(q(zj )||p(zj )) Proposed β-TCVAE uses α = γ = 1, and set β as hyper-parameter. kv (Viscovery) ELBO February 17, 2019 35 / 53
  • 36. Pseudo Code: VAE Using Tensorflow Probability latent_prior = make_mixture_prior() # p(z) approx_posterior = encoder(features) # q(z|x) # z ~ q(z|x) approx_posterior_sample = approx_posterior.sample() # p(x|z) decoder_likelihood = decoder(approx_posterior_sample) # log(p(x|z)) rate = decoder_likelihood.log_prob(features) log_qz_x = approx_posterior.log_prob(approx_posterior_sample log_pz = latent_prior.log_prob(approx_posterior_sample) kl_div = log_qz_x - log_pz # D_kl(q(z|x) || p(z)) elbo = tf.reduce_sum(rate - kl_div) kv (Viscovery) ELBO February 17, 2019 36 / 53
  • 37. Pseudo Code: β-TCVAE Lβ−TC = Eq(z|n)p(n)[log p(n|z)] − αIq(z; n) − β DKL(q(z)|| j q(zj )) − γ j DKL(q(zj )||p(zj )) log_qz = tf.logsumexp(tf.reduce_sum(log_qz_x, 1), 0) -tf.log(M * N) log_qz_factorized = tf.reduce_sum(tf.logsumexp(log_qz_x)-tf.log(M * N), 1) Iq = log_qz_x - log_qz TC = log_qz - log_qz_factorized Dim_kl = log_qz_factorized - latent_prior modified_elbo = rate - Iq - TC - Dim_kl kv (Viscovery) ELBO February 17, 2019 37 / 53
  • 38. Evaluate Disentanglement: Mutual Information Gap Suppose we have some groundtruth factor {vk}K k=1 Define joint distribution q(zj , vk) = N n=1 p(vk)p(n|vk)q(zj |n) In(zj ; vk) = Eq(zj ,vk ) log q(zj |n)p(n|vk) + S(zj ) kv (Viscovery) ELBO February 17, 2019 38 / 53
  • 39. Feature Factor and Latent Variable Figure: Correlation between factors and latent space kv (Viscovery) ELBO February 17, 2019 39 / 53
  • 40. Evaluate Disentanglement: Mutual Information Gap Mutual Information Gap MIG = 1 K K k=1 1 S(vk) I(zj(k) ; vk) − maxj=j(k) I(zj ; vk) where j(k) = arg max j I(zj ; vk) where 0 ≤ I(zj ; vk) = S(vk) − S(vk|zj ) ≤ S(vk) naturally serve as the normalization condition. Benefits of this metric Axis-alignment Compactness of representation kv (Viscovery) ELBO February 17, 2019 40 / 53
  • 41. Experiments and Conclusion kv (Viscovery) ELBO February 17, 2019 41 / 53
  • 42. Dataset Figure: Specs of Datasets kv (Viscovery) ELBO February 17, 2019 42 / 53
  • 43. Experiments: Performance of β-TCVAE Figure: Left: fully connected; Right: convolution kv (Viscovery) ELBO February 17, 2019 43 / 53
  • 44. Experiments: Performance of β-TCVAE kv (Viscovery) ELBO February 17, 2019 44 / 53
  • 45. Experiments: Trade-off ELBO-Disentanglement Trade-off Figure: DSprites kv (Viscovery) ELBO February 17, 2019 45 / 53
  • 46. Experiments: Trade-off ELBO-Disentanglement Trade-off Figure: 3D Faces kv (Viscovery) ELBO February 17, 2019 46 / 53
  • 47. Experiments: TC versus MIG How Independence relates to Disentanglement? Figure: DSprites kv (Viscovery) ELBO February 17, 2019 47 / 53
  • 48. Experiments: TC versus MIG How Independence relates to Disentanglement? Figure: 3D Faces kv (Viscovery) ELBO February 17, 2019 48 / 53
  • 49. Extra Experiments Removing Index-Code MI Batch-size Effect make no significant difference. kv (Viscovery) ELBO February 17, 2019 49 / 53
  • 50. Results kv (Viscovery) ELBO February 17, 2019 50 / 53
  • 51. Results kv (Viscovery) ELBO February 17, 2019 51 / 53
  • 52. Conclusion In this paper Regularization term in ELBO contains various factors which naturally encourage disentangling Total correlation (independence of latent variable) is the major factor force machine to learn statistically independent reprensentation New information-theoretic metric quantity kv (Viscovery) ELBO February 17, 2019 52 / 53
  • 53. References Chen, T.Q., and etc, Isolating Sources of Disentanglement in Variational Autoencoders. NIPS. 2018 Tishby, Naftali, and etc. The information bottleneck method. physics/0004057 (2000). Hoffman, Matthew D., and Matthew J. Johnson. Elbo surgery: yet another way to carve up the variational evidence lower bound. NIPS. 2016. Higgins and etc. beta-vae: Learning basic visual concepts with a constrained variational framework. ICLR 2017 Burgess and ect. Understanding disentangling in beta-VAE. arXiv preprint arXiv:1804.03599. Achille, Alessandro, and etc. Emergence of invariance and disentanglement in deep representations. Alemi, Alexander, et al. Fixing a broken ELBO. International Conference on Machine Learning. 2018. kv (Viscovery) ELBO February 17, 2019 53 / 53