Successfully reported this slideshow.
Your SlideShare is downloading. ×

VAEs for multimodal disentanglement

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Design Patterns.ppt
Design Patterns.ppt
Loading in …3
×

Check these out next

1 of 27 Ad

VAEs for multimodal disentanglement

Download to read offline

Slides presented in the All Japan Computer Vision Study Group on May 15, 2022. Methods for disentangling the relationship between multimodal data are discussed.

Slides presented in the All Japan Computer Vision Study Group on May 15, 2022. Methods for disentangling the relationship between multimodal data are discussed.

Advertisement
Advertisement

More Related Content

Similar to VAEs for multimodal disentanglement (20)

More from Antonio Tejero de Pablos (7)

Advertisement

Recently uploaded (20)

VAEs for multimodal disentanglement

  1. 1. AGREEMENT • If you plan to share these slides or to use the content in these slides for your own work, please include the following reference: • 本スライドの公開や個人使用の場合には、下記の様にリファレンスをお願いします: Tejero-de-Pablos A. (2022) “VAEs for multimodal disentanglement”. All Japan Computer Vision Study Group.
  2. 2. VAEs for multimodal disentanglement 2022/05/15 Antonio TEJERO DE PABLOS antonio_tejero@cyberagent.co.jp
  3. 3. 1.Self-introduction 2.Background 3.Paper introduction 4.Final remarks
  4. 4. Self-introduction
  5. 5. Antonio TEJERO DE PABLOS Background • Present: Research scientist @ CyberAgent (AI Lab) • ~2021: Researcher @ U-Tokyo (Harada Lab) & RIKEN (AIP) • ~2017: PhD @ NAIST (Yokoya Lab) Research interests • Learning of multimodal data (RGB, depth, audio, text) • and its applications (action recognition, advertisement classification, etc.) 父 母 分野:コンピュータビジョン
  6. 6. Background
  7. 7. What is a VAE? • Auto-encoder • Variational auto-encoder With the proper regularization:
  8. 8. There is more! • Vector Quantized-VAE Quantize the bottleneck using a discrete codebook There are a number of algorithms (like transformers) that are designed to work on discrete data, so we would like to have a discrete representation of the data for these algorithms to use. Advantages of VQ-VAE: - Simplified latent space (easier to train) - Likelihood-based model: do not suffer from the problems of mode collapse and lack of diversity - Real world data favors a discrete representation (number of images that make sense is kind of finite)
  9. 9. Why are VAEs cool? • Usage of VAEs (state-of-the-art) Multimodal generation (DALL-E) Representation learning, latent space disentanglement
  10. 10. Paper introduction An interesting usage of VAEs for disentangling multimodal data that grabbed my attention
  11. 11. Today I’m introducing: 1) Shi, Y., Paige, B., & Torr, P. (2019). Variational mixture-of-experts autoencoders for multi-modal deep generative models. Advances in Neural Information Processing Systems, 32. 2) Lee, M., & Pavlovic, V. (2021). Private-shared disentangled multimodal VAE for learning of latent representations. Conference on Computer Vision and Pattern Recognition (pp. 1692- 1700). 3) Joy, T., Shi, Y., Torr, P. H., Rainforth, T., Schmon, S. M., & Siddharth, N. (2022). Learning Multimodal VAEs through Mutual Supervision. International Conference on Learning Representations.
  12. 12. Motivation and goal • Importance of multimodal data Learning in the real world involves multiple perspectives: visual, auditive, linguistic Understanding them individually allows only a partial learning of concepts • Understanding how different modalities work together is not trivial A similar joint-embedding process happens in the brain for reasoning and understanding • Multimodal VAE facilitate representation learning on data with multiple views/modalities Capture common underlying factors between the modalities
  13. 13. Motivation and goal • Normally, only the shared aspects of modalities are modeled The private information of each modality is totally LOST E.g., image captioning • Leverage VAE’s latent space for disentanglement Private spaces are leveraged for modeling the disjoint properties of each modality, and cross-modal generation • Basically, such disentanglement can be used as: An analytical tool to understand how modalities intertwine A way of cross-generating modalities
  14. 14. Motivation and goal • [1] and [2] propose a similar methodology According to [1], a true multimodal generative model should meet four criteria: Today I will introduce [2] (most recent), and explain briefly the differences with [3]
  15. 15. Dataset • Digit images: MNIST & SVHN - Shared features: Digit class - Private features: Number style, background, etc. Image domains as different modalities? • Flower images and text description: Oxford-102 Flowers - Shared features: Words and image features present in both modalities - Private features: Words and image features exclusive from their modality MNIST SVHN
  16. 16. Related work • Multimodal generation and joint multimodal VAEs (e.g., JMVAE, MVAE) The learning of a common disentangled embedding (i.e., private-shared) is often ignored Only some works in image-to-image translation separate ”content” (~shared) and ”style” (~private) in the latent space (e.g., via adversarial loss) Exclusively for between-image modalities: Not suitable for different modalities such as image and text • Domain adaptation Learning joint embeddings of multimodal observations
  17. 17. Proposed method: DMVAE • Generative variational model: Introducing separate shared and private spaces Usage: Cross-generation (analytical tool) • Representations induced using pairs of individual modalities (encoder, decoder) • Consistency of representations via Product of Experts (PoE). For a number of modalities N: 𝑞 𝑧! 𝑥", 𝑥#, ⋯ , 𝑥$ ∝ 𝑝(𝑧!) * %&" $ 𝑞(𝑧!|𝑥%) In VAE, inference networks and priors assume conditional Gaussian forms 𝑝 𝑧 = 𝑁 𝑧 0, 𝐼 , 𝑞 𝑧 𝑥% = 𝑁 𝑧 𝜇%, 𝐶% 𝑧"~𝑞'! 𝑧 𝑥" , 𝑧#~𝑞'" 𝑧 𝑥# 𝑧" = 𝑧(! , 𝑧!! , 𝑧# = 𝑧(" , 𝑧!" We want: 𝑧) = 𝑧!! = 𝑧!" → PoE
  18. 18. Proposed method: DMVAE • Reconstruction inference PoE-induced shared inference allows for inference when one or more modalities are missing Thus, we consider three reconstruction tasks: - Reconstruct both modalities at the same time: 𝑥", 𝑥# → 4 𝑥", 4 𝑥# 𝑧(! , 𝑧(" , 𝑧) - Reconstruct a single modality from its own input: 𝑥" → 4 𝑥" 𝑧(! , 𝑧) or 𝑥# → 4 𝑥# 𝑧(" , 𝑧) - Reconstruct a single modality from the opposite modality’s input: 𝑥# → 4 𝑥" 𝑧(! , 𝑧) or 𝑥" → 4 𝑥# 𝑧(" , 𝑧) • Loss function Accuracy of reconstruction for jointly learned shared latent + KL-divergence of each normal distribution Accuracy of cross-modal and self reconstruction + KL-divergence
  19. 19. Experiments: Digits (image-image) • Evaluation Qualitative: Cross-generation between modalities Quantitative: Accuracy of the cross-generated images using a pre-trained classifier for each modality - Joint: A sample from zs generates two image modalities that must be assigned the same class Input Output for different samples of zp2 Input Output for different samples of zp1
  20. 20. Experiments: Digits (image-image) • Ablation study: DMVAE [2] vs MMVAE [1]’s shared latent space
  21. 21. Experiments: Flowers (image-text) • This task is more complex Instead of the image-text, the intermediate features are reconstructed • Quantitative evaluation Class recognition (image-to-text) and cosine-similarity retrieval (text-to-image) on the shared latent space • Qualitative evaluation Retrieval
  22. 22. Conclusions • Multimodal VAE for disentangling private and shared spaces Improve the representational performance of multimodal VAEs Successful application to image-image and image-text modalities • Shaping a latent space into subspaces that capture the private-shared aspects of the modalities “is important from the perspective of downstream tasks, where better decomposed representations are more amenable for using on a wider variety of tasks”
  23. 23. [3] Multimodal VAEs via mutual supervision • Main differences with [1] and [2] A type of multimodal VAE, without private-shared disentanglement Does not rely on factorizations such as MoE or PoE for modeling modality-shared information Instead, it repurposes semi-supervised VAEs for combining inter-modality information - Allows learning from partially-observed modalities (Reg. = KL divergence) • Proposed method: Mutually supErvised Multimodal vaE (MEME)
  24. 24. [3] Multimodal VAEs via mutual supervision • Qualitative evaluation Cross-modal generation • Quantitative evaluation Coherence: Percentage of matching predictions of the cross-generated modality using a pretrained classifier Relatedness: Wassertein Distance between the representations of two modalities (closer if same class)
  25. 25. Final remarks
  26. 26. Final remarks • VAE not only for generation but also for reconstruction and disentanglement tasks Recommended textbook: “An Introduction to Variational Autoencoders”, Kingma & Welling • Private-shared latent spaces as an effective tool for analyzing multimodal data • There is still a lot of potential for this research It has been only applied to a limited number of multimodal problems • このテーマに興味のある博士課程の学生 → インターン募集中 https://www.cyberagent.co.jp/news/detail/id=27453 違うテーマでも大丈夫! 共同研究も大歓迎!
  27. 27. ありがとうございました︕ Antonio TEJERO DE PABLOS antonio_tejero@cyberagent.co.jp

×