VAEs for multimodal disentanglement

AGREEMENT
• If you plan to share these slides or to use the content in these slides for your own work,
please include the following reference:
• 本スライドの公開や個人使用の場合には、下記の様にリファレンスをお願いします：
Tejero-de-Pablos A. (2022) “VAEs for multimodal disentanglement”. All Japan Computer Vision Study Group.

VAEs for multimodal
disentanglement
2022/05/15
Antonio TEJERO DE PABLOS
antonio_tejero@cyberagent.co.jp

1.Self-introduction
2.Background
3.Paper introduction
4.Final remarks

Background
• Present: Research scientist @ CyberAgent (AI Lab)
• ~2021: Researcher @ U-Tokyo (Harada Lab) & RIKEN (AIP)
• ~2017: PhD @ NAIST (Yokoya Lab)
Research interests
• Learning of multimodal data (RGB, depth, audio, text)
• and its applications (action recognition, advertisement
classification, etc.)
父母
分野：コンピュータビジョン

What is a VAE?
• Auto-encoder • Variational auto-encoder
With the proper regularization:

There is more!
• Vector Quantized-VAE
Quantize the bottleneck using a discrete codebook
There are a number of algorithms (like transformers) that are designed to work on discrete data, so we
would like to have a discrete representation of the data for these algorithms to use.
Advantages of VQ-VAE:
- Simplified latent space (easier to train)
- Likelihood-based model: do not suffer from the
problems of mode collapse and lack of diversity
- Real world data favors a discrete representation
(number of images that make sense is kind of finite)

Why are VAEs cool?
• Usage of VAEs (state-of-the-art)
Multimodal generation (DALL-E)
Representation learning, latent space disentanglement

Paper introduction
An interesting usage of VAEs for disentangling
multimodal data that grabbed my attention

Today I’m introducing:
1) Shi, Y., Paige, B., & Torr, P. (2019). Variational mixture-of-experts autoencoders for multi-modal
deep generative models. Advances in Neural Information Processing Systems, 32.
2) Lee, M., & Pavlovic, V. (2021). Private-shared disentangled multimodal VAE for learning of
latent representations. Conference on Computer Vision and Pattern Recognition (pp. 1692-
1700).
3) Joy, T., Shi, Y., Torr, P. H., Rainforth, T., Schmon, S. M., & Siddharth, N. (2022). Learning
Multimodal VAEs through Mutual Supervision. International Conference on Learning
Representations.

Motivation and goal
• Importance of multimodal data
Learning in the real world involves multiple perspectives: visual, auditive, linguistic
Understanding them individually allows only a partial learning of concepts
• Understanding how different modalities work together is not trivial
A similar joint-embedding process happens in the brain for reasoning and understanding
• Multimodal VAE facilitate representation learning on data with multiple views/modalities
Capture common underlying factors between the modalities

Motivation and goal
• Normally, only the shared aspects of modalities are modeled
The private information of each modality is totally LOST
E.g., image captioning
• Leverage VAE’s latent space for disentanglement
Private spaces are leveraged for modeling the disjoint properties of each
modality, and cross-modal generation
• Basically, such disentanglement can be used as:
An analytical tool to understand how modalities intertwine
A way of cross-generating modalities

Motivation and goal
• [1] and [2] propose a similar methodology
According to [1], a true multimodal generative model should meet four criteria:
Today I will introduce [2] (most recent), and explain briefly the differences with [3]

Dataset
• Digit images: MNIST & SVHN
- Shared features: Digit class
- Private features: Number style, background, etc.
Image domains as different modalities?
• Flower images and text description: Oxford-102 Flowers
- Shared features: Words and image features present in both
modalities
- Private features: Words and image features exclusive from
their modality
MNIST
SVHN

Related work
• Multimodal generation and joint multimodal VAEs (e.g., JMVAE, MVAE)
The learning of a common disentangled embedding (i.e., private-shared) is often ignored
Only some works in image-to-image translation separate ”content” (~shared) and ”style” (~private) in the
latent space (e.g., via adversarial loss)
Exclusively for between-image modalities: Not suitable for different modalities such as image and text
• Domain adaptation
Learning joint embeddings of multimodal observations

Proposed method: DMVAE
• Generative variational model: Introducing separate shared and private spaces
Usage: Cross-generation (analytical tool)
• Representations induced using pairs of individual modalities (encoder, decoder)
• Consistency of representations via Product of Experts (PoE). For a number of modalities N:
𝑞 𝑧! 𝑥", 𝑥#, ⋯ , 𝑥$ ∝ 𝑝(𝑧!) *
%&"
$
𝑞(𝑧!|𝑥%)
In VAE, inference networks and priors assume conditional Gaussian forms
𝑝 𝑧 = 𝑁 𝑧 0, 𝐼 , 𝑞 𝑧 𝑥% = 𝑁 𝑧 𝜇%, 𝐶%
𝑧"~𝑞'!
𝑧 𝑥" , 𝑧#~𝑞'"
𝑧 𝑥#
𝑧" = 𝑧(!
, 𝑧!!
, 𝑧# = 𝑧("
, 𝑧!"
We want: 𝑧) = 𝑧!!
= 𝑧!"
→ PoE

Proposed method: DMVAE
• Reconstruction inference
PoE-induced shared inference allows for inference when one or more modalities are missing
Thus, we consider three reconstruction tasks:
- Reconstruct both modalities at the same time: 𝑥", 𝑥# → 4
𝑥", 4
𝑥# 𝑧(!
, 𝑧("
, 𝑧)
- Reconstruct a single modality from its own input: 𝑥" → 4
𝑥" 𝑧(!
, 𝑧) or 𝑥# → 4
𝑥# 𝑧("
, 𝑧)
- Reconstruct a single modality from the opposite modality’s input: 𝑥# → 4
𝑥" 𝑧(!
, 𝑧) or 𝑥" → 4
𝑥# 𝑧("
, 𝑧)
• Loss function
Accuracy of reconstruction for jointly learned shared latent + KL-divergence of each normal distribution
Accuracy of cross-modal and self reconstruction + KL-divergence

Experiments: Digits (image-image)
• Evaluation
Qualitative: Cross-generation between modalities
Quantitative: Accuracy of the cross-generated images using a pre-trained classifier for each modality
- Joint: A sample from zs generates two image modalities that must be assigned the same class
Input
Output for
different
samples of zp2
Input
Output for
different
samples of zp1

Experiments: Digits (image-image)
• Ablation study: DMVAE [2] vs MMVAE [1]’s shared latent space

Experiments: Flowers (image-text)
• This task is more complex
Instead of the image-text, the intermediate features are reconstructed
• Quantitative evaluation
Class recognition (image-to-text) and cosine-similarity retrieval (text-to-image) on the shared latent space
• Qualitative evaluation
Retrieval

Conclusions
• Multimodal VAE for disentangling private and shared spaces
Improve the representational performance of multimodal VAEs
Successful application to image-image and image-text modalities
• Shaping a latent space into subspaces that capture the private-shared aspects of the
modalities
“is important from the perspective of downstream tasks, where better decomposed representations are more
amenable for using on a wider variety of tasks”

[3] Multimodal VAEs via mutual supervision
• Main differences with [1] and [2]
A type of multimodal VAE, without private-shared disentanglement
Does not rely on factorizations such as MoE or PoE for modeling modality-shared information
Instead, it repurposes semi-supervised VAEs for combining inter-modality information
- Allows learning from partially-observed modalities (Reg. = KL divergence)
• Proposed method: Mutually supErvised Multimodal vaE (MEME)

[3] Multimodal VAEs via mutual supervision
• Qualitative evaluation
Cross-modal generation
• Quantitative evaluation
Coherence: Percentage of matching predictions of the cross-generated modality using a pretrained classifier
Relatedness: Wassertein Distance between the representations of two modalities (closer if same class)

Final remarks
• VAE not only for generation but also for reconstruction and disentanglement tasks
Recommended textbook: “An Introduction to Variational Autoencoders”, Kingma & Welling
• Private-shared latent spaces as an effective tool for analyzing multimodal data
• There is still a lot of potential for this research
It has been only applied to a limited number of multimodal problems
• このテーマに興味のある博士課程の学生 → インターン募集中
https://www.cyberagent.co.jp/news/detail/id=27453
違うテーマでも大丈夫！
共同研究も大歓迎！

ありがとうございました︕
antonio_tejero@cyberagent.co.jp

VAEs for multimodal disentanglement

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to VAEs for multimodal disentanglement

Similar to VAEs for multimodal disentanglement (20)

More from Antonio Tejero de Pablos

More from Antonio Tejero de Pablos (6)

Recently uploaded

Recently uploaded (20)

VAEs for multimodal disentanglement