SlideShare a Scribd company logo
AGREEMENT
• If you plan to share these slides or to use the content in these slides for your own work,
please include the following reference:
• 本スライドの公開や個人使用の場合には、下記の様にリファレンスをお願いします:
Tejero-de-Pablos A. (2022) “VAEs for multimodal disentanglement”. All Japan Computer Vision Study Group.
VAEs for multimodal
disentanglement
2022/05/15
Antonio TEJERO DE PABLOS
antonio_tejero@cyberagent.co.jp
1.Self-introduction
2.Background
3.Paper introduction
4.Final remarks
Self-introduction
Antonio TEJERO DE PABLOS
Background
• Present: Research scientist @ CyberAgent (AI Lab)
• ~2021: Researcher @ U-Tokyo (Harada Lab) & RIKEN (AIP)
• ~2017: PhD @ NAIST (Yokoya Lab)
Research interests
• Learning of multimodal data (RGB, depth, audio, text)
• and its applications (action recognition, advertisement
classification, etc.)
父 母
分野:コンピュータビジョン
Background
What is a VAE?
• Auto-encoder • Variational auto-encoder
With the proper regularization:
There is more!
• Vector Quantized-VAE
Quantize the bottleneck using a discrete codebook
There are a number of algorithms (like transformers) that are designed to work on discrete data, so we
would like to have a discrete representation of the data for these algorithms to use.
Advantages of VQ-VAE:
- Simplified latent space (easier to train)
- Likelihood-based model: do not suffer from the
problems of mode collapse and lack of diversity
- Real world data favors a discrete representation
(number of images that make sense is kind of finite)
Why are VAEs cool?
• Usage of VAEs (state-of-the-art)
Multimodal generation (DALL-E)
Representation learning, latent space disentanglement
Paper introduction
An interesting usage of VAEs for disentangling
multimodal data that grabbed my attention
Today I’m introducing:
1) Shi, Y., Paige, B., & Torr, P. (2019). Variational mixture-of-experts autoencoders for multi-modal
deep generative models. Advances in Neural Information Processing Systems, 32.
2) Lee, M., & Pavlovic, V. (2021). Private-shared disentangled multimodal VAE for learning of
latent representations. Conference on Computer Vision and Pattern Recognition (pp. 1692-
1700).
3) Joy, T., Shi, Y., Torr, P. H., Rainforth, T., Schmon, S. M., & Siddharth, N. (2022). Learning
Multimodal VAEs through Mutual Supervision. International Conference on Learning
Representations.
Motivation and goal
• Importance of multimodal data
Learning in the real world involves multiple perspectives: visual, auditive, linguistic
Understanding them individually allows only a partial learning of concepts
• Understanding how different modalities work together is not trivial
A similar joint-embedding process happens in the brain for reasoning and understanding
• Multimodal VAE facilitate representation learning on data with multiple views/modalities
Capture common underlying factors between the modalities
Motivation and goal
• Normally, only the shared aspects of modalities are modeled
The private information of each modality is totally LOST
E.g., image captioning
• Leverage VAE’s latent space for disentanglement
Private spaces are leveraged for modeling the disjoint properties of each
modality, and cross-modal generation
• Basically, such disentanglement can be used as:
An analytical tool to understand how modalities intertwine
A way of cross-generating modalities
Motivation and goal
• [1] and [2] propose a similar methodology
According to [1], a true multimodal generative model should meet four criteria:
Today I will introduce [2] (most recent), and explain briefly the differences with [3]
Dataset
• Digit images: MNIST & SVHN
- Shared features: Digit class
- Private features: Number style, background, etc.
Image domains as different modalities?
• Flower images and text description: Oxford-102 Flowers
- Shared features: Words and image features present in both
modalities
- Private features: Words and image features exclusive from
their modality
MNIST
SVHN
Related work
• Multimodal generation and joint multimodal VAEs (e.g., JMVAE, MVAE)
The learning of a common disentangled embedding (i.e., private-shared) is often ignored
Only some works in image-to-image translation separate ”content” (~shared) and ”style” (~private) in the
latent space (e.g., via adversarial loss)
Exclusively for between-image modalities: Not suitable for different modalities such as image and text
• Domain adaptation
Learning joint embeddings of multimodal observations
Proposed method: DMVAE
• Generative variational model: Introducing separate shared and private spaces
Usage: Cross-generation (analytical tool)
• Representations induced using pairs of individual modalities (encoder, decoder)
• Consistency of representations via Product of Experts (PoE). For a number of modalities N:
𝑞 𝑧! 𝑥", 𝑥#, ⋯ , 𝑥$ ∝ 𝑝(𝑧!) *
%&"
$
𝑞(𝑧!|𝑥%)
In VAE, inference networks and priors assume conditional Gaussian forms
𝑝 𝑧 = 𝑁 𝑧 0, 𝐼 , 𝑞 𝑧 𝑥% = 𝑁 𝑧 𝜇%, 𝐶%
𝑧"~𝑞'!
𝑧 𝑥" , 𝑧#~𝑞'"
𝑧 𝑥#
𝑧" = 𝑧(!
, 𝑧!!
, 𝑧# = 𝑧("
, 𝑧!"
We want: 𝑧) = 𝑧!!
= 𝑧!"
→ PoE
Proposed method: DMVAE
• Reconstruction inference
PoE-induced shared inference allows for inference when one or more modalities are missing
Thus, we consider three reconstruction tasks:
- Reconstruct both modalities at the same time: 𝑥", 𝑥# → 4
𝑥", 4
𝑥# 𝑧(!
, 𝑧("
, 𝑧)
- Reconstruct a single modality from its own input: 𝑥" → 4
𝑥" 𝑧(!
, 𝑧) or 𝑥# → 4
𝑥# 𝑧("
, 𝑧)
- Reconstruct a single modality from the opposite modality’s input: 𝑥# → 4
𝑥" 𝑧(!
, 𝑧) or 𝑥" → 4
𝑥# 𝑧("
, 𝑧)
• Loss function
Accuracy of reconstruction for jointly learned shared latent + KL-divergence of each normal distribution
Accuracy of cross-modal and self reconstruction + KL-divergence
Experiments: Digits (image-image)
• Evaluation
Qualitative: Cross-generation between modalities
Quantitative: Accuracy of the cross-generated images using a pre-trained classifier for each modality
- Joint: A sample from zs generates two image modalities that must be assigned the same class
Input
Output for
different
samples of zp2
Input
Output for
different
samples of zp1
Experiments: Digits (image-image)
• Ablation study: DMVAE [2] vs MMVAE [1]’s shared latent space
Experiments: Flowers (image-text)
• This task is more complex
Instead of the image-text, the intermediate features are reconstructed
• Quantitative evaluation
Class recognition (image-to-text) and cosine-similarity retrieval (text-to-image) on the shared latent space
• Qualitative evaluation
Retrieval
Conclusions
• Multimodal VAE for disentangling private and shared spaces
Improve the representational performance of multimodal VAEs
Successful application to image-image and image-text modalities
• Shaping a latent space into subspaces that capture the private-shared aspects of the
modalities
“is important from the perspective of downstream tasks, where better decomposed representations are more
amenable for using on a wider variety of tasks”
[3] Multimodal VAEs via mutual supervision
• Main differences with [1] and [2]
A type of multimodal VAE, without private-shared disentanglement
Does not rely on factorizations such as MoE or PoE for modeling modality-shared information
Instead, it repurposes semi-supervised VAEs for combining inter-modality information
- Allows learning from partially-observed modalities (Reg. = KL divergence)
• Proposed method: Mutually supErvised Multimodal vaE (MEME)
[3] Multimodal VAEs via mutual supervision
• Qualitative evaluation
Cross-modal generation
• Quantitative evaluation
Coherence: Percentage of matching predictions of the cross-generated modality using a pretrained classifier
Relatedness: Wassertein Distance between the representations of two modalities (closer if same class)
Final remarks
Final remarks
• VAE not only for generation but also for reconstruction and disentanglement tasks
Recommended textbook: “An Introduction to Variational Autoencoders”, Kingma & Welling
• Private-shared latent spaces as an effective tool for analyzing multimodal data
• There is still a lot of potential for this research
It has been only applied to a limited number of multimodal problems
• このテーマに興味のある博士課程の学生 → インターン募集中
https://www.cyberagent.co.jp/news/detail/id=27453
違うテーマでも大丈夫!
共同研究も大歓迎!
ありがとうございました︕
Antonio TEJERO DE PABLOS
antonio_tejero@cyberagent.co.jp

More Related Content

What's hot

GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
Masahiro Suzuki
 
深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)
Masahiro Suzuki
 
12. Diffusion Model の数学的基礎.pdf
12. Diffusion Model の数学的基礎.pdf12. Diffusion Model の数学的基礎.pdf
12. Diffusion Model の数学的基礎.pdf
幸太朗 岩澤
 
Optimizer入門&最新動向
Optimizer入門&最新動向Optimizer入門&最新動向
Optimizer入門&最新動向
Motokawa Tetsuya
 
ELBO型VAEのダメなところ
ELBO型VAEのダメなところELBO型VAEのダメなところ
ELBO型VAEのダメなところ
KCS Keio Computer Society
 
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
Deep Learning JP
 
Layer Normalization@NIPS+読み会・関西
Layer Normalization@NIPS+読み会・関西Layer Normalization@NIPS+読み会・関西
Layer Normalization@NIPS+読み会・関西
Keigo Nishida
 
猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder
Sho Tatsuno
 
【DL輪読会】The Forward-Forward Algorithm: Some Preliminary
【DL輪読会】The Forward-Forward Algorithm: Some Preliminary【DL輪読会】The Forward-Forward Algorithm: Some Preliminary
【DL輪読会】The Forward-Forward Algorithm: Some Preliminary
Deep Learning JP
 
実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE
ぱんいち すみもと
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情
Yuta Kikuchi
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用
Yoshitaka Ushiku
 
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
Deep Learning JP
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
Masahiro Suzuki
 
SSII2020 [OS2-03] 深層学習における半教師あり学習の最新動向
SSII2020 [OS2-03] 深層学習における半教師あり学習の最新動向SSII2020 [OS2-03] 深層学習における半教師あり学習の最新動向
SSII2020 [OS2-03] 深層学習における半教師あり学習の最新動向
SSII
 
Semi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learningSemi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learning
Yusuke Uchida
 
幾何と機械学習: A Short Intro
幾何と機械学習: A Short Intro幾何と機械学習: A Short Intro
幾何と機械学習: A Short Intro
Ichigaku Takigawa
 
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
cvpaper. challenge
 
[DL輪読会]Learning Latent Dynamics for Planning from Pixels
[DL輪読会]Learning Latent Dynamics for Planning from Pixels[DL輪読会]Learning Latent Dynamics for Planning from Pixels
[DL輪読会]Learning Latent Dynamics for Planning from Pixels
Deep Learning JP
 
【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ
Deep Learning JP
 

What's hot (20)

GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
 
深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)
 
12. Diffusion Model の数学的基礎.pdf
12. Diffusion Model の数学的基礎.pdf12. Diffusion Model の数学的基礎.pdf
12. Diffusion Model の数学的基礎.pdf
 
Optimizer入門&最新動向
Optimizer入門&最新動向Optimizer入門&最新動向
Optimizer入門&最新動向
 
ELBO型VAEのダメなところ
ELBO型VAEのダメなところELBO型VAEのダメなところ
ELBO型VAEのダメなところ
 
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
 
Layer Normalization@NIPS+読み会・関西
Layer Normalization@NIPS+読み会・関西Layer Normalization@NIPS+読み会・関西
Layer Normalization@NIPS+読み会・関西
 
猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder
 
【DL輪読会】The Forward-Forward Algorithm: Some Preliminary
【DL輪読会】The Forward-Forward Algorithm: Some Preliminary【DL輪読会】The Forward-Forward Algorithm: Some Preliminary
【DL輪読会】The Forward-Forward Algorithm: Some Preliminary
 
実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用
 
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
 
SSII2020 [OS2-03] 深層学習における半教師あり学習の最新動向
SSII2020 [OS2-03] 深層学習における半教師あり学習の最新動向SSII2020 [OS2-03] 深層学習における半教師あり学習の最新動向
SSII2020 [OS2-03] 深層学習における半教師あり学習の最新動向
 
Semi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learningSemi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learning
 
幾何と機械学習: A Short Intro
幾何と機械学習: A Short Intro幾何と機械学習: A Short Intro
幾何と機械学習: A Short Intro
 
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
 
[DL輪読会]Learning Latent Dynamics for Planning from Pixels
[DL輪読会]Learning Latent Dynamics for Planning from Pixels[DL輪読会]Learning Latent Dynamics for Planning from Pixels
[DL輪読会]Learning Latent Dynamics for Planning from Pixels
 
【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ
 

Similar to VAEs for multimodal disentanglement

CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
Antonio Tejero de Pablos
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
Machine Learning Prague
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
SwatiNarkhede1
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
Sudeep Das, Ph.D.
 
Image captioning
Image captioningImage captioning
Image captioning
Muhammad Zbeedat
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer Vision
SwatiNarkhede1
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
StampedeCon
 
How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...
Wee Hyong Tok
 
Bridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionBridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full version
Liad Magen
 
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,Noida
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,NoidaTeaching Object Oriented Programming Courses by Sandeep K Singh JIIT,Noida
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,Noida
Dr. Sandeep Kumar Singh
 
OReilly AI Transfer Learning
OReilly AI Transfer LearningOReilly AI Transfer Learning
OReilly AI Transfer Learning
Danielle Dean
 
TIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdfTIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdf
BoahKim2
 
ImageHubExplorerPosterReduced
ImageHubExplorerPosterReducedImageHubExplorerPosterReduced
ImageHubExplorerPosterReduced
Nenad Toma?ev
 
Gephi icwsm-tutorial
Gephi icwsm-tutorialGephi icwsm-tutorial
Gephi icwsm-tutorial
csedays
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
Goergen Institute for Data Science
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
Goergen Institute for Data Science
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
Paris Open Source Summit
 
SP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiSP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with Gephi
John Breslin
 
Collaborative Immersive Analytics
Collaborative Immersive AnalyticsCollaborative Immersive Analytics
Collaborative Immersive Analytics
Mark Billinghurst
 
Deep learning for real life applications
Deep learning for real life applicationsDeep learning for real life applications
Deep learning for real life applications
Anas Arram, Ph.D
 

Similar to VAEs for multimodal disentanglement (20)

CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
Image captioning
Image captioningImage captioning
Image captioning
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer Vision
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...
 
Bridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionBridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full version
 
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,Noida
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,NoidaTeaching Object Oriented Programming Courses by Sandeep K Singh JIIT,Noida
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,Noida
 
OReilly AI Transfer Learning
OReilly AI Transfer LearningOReilly AI Transfer Learning
OReilly AI Transfer Learning
 
TIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdfTIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdf
 
ImageHubExplorerPosterReduced
ImageHubExplorerPosterReducedImageHubExplorerPosterReduced
ImageHubExplorerPosterReduced
 
Gephi icwsm-tutorial
Gephi icwsm-tutorialGephi icwsm-tutorial
Gephi icwsm-tutorial
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
SP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiSP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with Gephi
 
Collaborative Immersive Analytics
Collaborative Immersive AnalyticsCollaborative Immersive Analytics
Collaborative Immersive Analytics
 
Deep learning for real life applications
Deep learning for real life applicationsDeep learning for real life applications
Deep learning for real life applications
 

More from Antonio Tejero de Pablos

ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
Antonio Tejero de Pablos
 
Inteligencia artificial, visión por ordenador, y datos multimodales - ACE Jap...
Inteligencia artificial,visión por ordenador,y datos multimodales - ACE Jap...Inteligencia artificial,visión por ordenador,y datos multimodales - ACE Jap...
Inteligencia artificial, visión por ordenador, y datos multimodales - ACE Jap...
Antonio Tejero de Pablos
 
WakateMIRU2022 (グループ8) ディープワークによる時間管理
WakateMIRU2022 (グループ8) ディープワークによる時間管理WakateMIRU2022 (グループ8) ディープワークによる時間管理
WakateMIRU2022 (グループ8) ディープワークによる時間管理
Antonio Tejero de Pablos
 
Machine Learning Fundamentals IEEE
Machine Learning Fundamentals IEEEMachine Learning Fundamentals IEEE
Machine Learning Fundamentals IEEE
Antonio Tejero de Pablos
 
Presentation Skills Up! Seminar - Harada Ushiku Lab (English)
Presentation Skills Up! Seminar - Harada Ushiku Lab (English)Presentation Skills Up! Seminar - Harada Ushiku Lab (English)
Presentation Skills Up! Seminar - Harada Ushiku Lab (English)
Antonio Tejero de Pablos
 
プレゼンスキルアップ!講座 原田・牛久研究室(日本語)
プレゼンスキルアップ!講座 原田・牛久研究室(日本語)プレゼンスキルアップ!講座 原田・牛久研究室(日本語)
プレゼンスキルアップ!講座 原田・牛久研究室(日本語)
Antonio Tejero de Pablos
 

More from Antonio Tejero de Pablos (6)

ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
 
Inteligencia artificial, visión por ordenador, y datos multimodales - ACE Jap...
Inteligencia artificial,visión por ordenador,y datos multimodales - ACE Jap...Inteligencia artificial,visión por ordenador,y datos multimodales - ACE Jap...
Inteligencia artificial, visión por ordenador, y datos multimodales - ACE Jap...
 
WakateMIRU2022 (グループ8) ディープワークによる時間管理
WakateMIRU2022 (グループ8) ディープワークによる時間管理WakateMIRU2022 (グループ8) ディープワークによる時間管理
WakateMIRU2022 (グループ8) ディープワークによる時間管理
 
Machine Learning Fundamentals IEEE
Machine Learning Fundamentals IEEEMachine Learning Fundamentals IEEE
Machine Learning Fundamentals IEEE
 
Presentation Skills Up! Seminar - Harada Ushiku Lab (English)
Presentation Skills Up! Seminar - Harada Ushiku Lab (English)Presentation Skills Up! Seminar - Harada Ushiku Lab (English)
Presentation Skills Up! Seminar - Harada Ushiku Lab (English)
 
プレゼンスキルアップ!講座 原田・牛久研究室(日本語)
プレゼンスキルアップ!講座 原田・牛久研究室(日本語)プレゼンスキルアップ!講座 原田・牛久研究室(日本語)
プレゼンスキルアップ!講座 原田・牛久研究室(日本語)
 

Recently uploaded

Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 

Recently uploaded (20)

Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 

VAEs for multimodal disentanglement

  • 1. AGREEMENT • If you plan to share these slides or to use the content in these slides for your own work, please include the following reference: • 本スライドの公開や個人使用の場合には、下記の様にリファレンスをお願いします: Tejero-de-Pablos A. (2022) “VAEs for multimodal disentanglement”. All Japan Computer Vision Study Group.
  • 2. VAEs for multimodal disentanglement 2022/05/15 Antonio TEJERO DE PABLOS antonio_tejero@cyberagent.co.jp
  • 5. Antonio TEJERO DE PABLOS Background • Present: Research scientist @ CyberAgent (AI Lab) • ~2021: Researcher @ U-Tokyo (Harada Lab) & RIKEN (AIP) • ~2017: PhD @ NAIST (Yokoya Lab) Research interests • Learning of multimodal data (RGB, depth, audio, text) • and its applications (action recognition, advertisement classification, etc.) 父 母 分野:コンピュータビジョン
  • 7. What is a VAE? • Auto-encoder • Variational auto-encoder With the proper regularization:
  • 8. There is more! • Vector Quantized-VAE Quantize the bottleneck using a discrete codebook There are a number of algorithms (like transformers) that are designed to work on discrete data, so we would like to have a discrete representation of the data for these algorithms to use. Advantages of VQ-VAE: - Simplified latent space (easier to train) - Likelihood-based model: do not suffer from the problems of mode collapse and lack of diversity - Real world data favors a discrete representation (number of images that make sense is kind of finite)
  • 9. Why are VAEs cool? • Usage of VAEs (state-of-the-art) Multimodal generation (DALL-E) Representation learning, latent space disentanglement
  • 10. Paper introduction An interesting usage of VAEs for disentangling multimodal data that grabbed my attention
  • 11. Today I’m introducing: 1) Shi, Y., Paige, B., & Torr, P. (2019). Variational mixture-of-experts autoencoders for multi-modal deep generative models. Advances in Neural Information Processing Systems, 32. 2) Lee, M., & Pavlovic, V. (2021). Private-shared disentangled multimodal VAE for learning of latent representations. Conference on Computer Vision and Pattern Recognition (pp. 1692- 1700). 3) Joy, T., Shi, Y., Torr, P. H., Rainforth, T., Schmon, S. M., & Siddharth, N. (2022). Learning Multimodal VAEs through Mutual Supervision. International Conference on Learning Representations.
  • 12. Motivation and goal • Importance of multimodal data Learning in the real world involves multiple perspectives: visual, auditive, linguistic Understanding them individually allows only a partial learning of concepts • Understanding how different modalities work together is not trivial A similar joint-embedding process happens in the brain for reasoning and understanding • Multimodal VAE facilitate representation learning on data with multiple views/modalities Capture common underlying factors between the modalities
  • 13. Motivation and goal • Normally, only the shared aspects of modalities are modeled The private information of each modality is totally LOST E.g., image captioning • Leverage VAE’s latent space for disentanglement Private spaces are leveraged for modeling the disjoint properties of each modality, and cross-modal generation • Basically, such disentanglement can be used as: An analytical tool to understand how modalities intertwine A way of cross-generating modalities
  • 14. Motivation and goal • [1] and [2] propose a similar methodology According to [1], a true multimodal generative model should meet four criteria: Today I will introduce [2] (most recent), and explain briefly the differences with [3]
  • 15. Dataset • Digit images: MNIST & SVHN - Shared features: Digit class - Private features: Number style, background, etc. Image domains as different modalities? • Flower images and text description: Oxford-102 Flowers - Shared features: Words and image features present in both modalities - Private features: Words and image features exclusive from their modality MNIST SVHN
  • 16. Related work • Multimodal generation and joint multimodal VAEs (e.g., JMVAE, MVAE) The learning of a common disentangled embedding (i.e., private-shared) is often ignored Only some works in image-to-image translation separate ”content” (~shared) and ”style” (~private) in the latent space (e.g., via adversarial loss) Exclusively for between-image modalities: Not suitable for different modalities such as image and text • Domain adaptation Learning joint embeddings of multimodal observations
  • 17. Proposed method: DMVAE • Generative variational model: Introducing separate shared and private spaces Usage: Cross-generation (analytical tool) • Representations induced using pairs of individual modalities (encoder, decoder) • Consistency of representations via Product of Experts (PoE). For a number of modalities N: 𝑞 𝑧! 𝑥", 𝑥#, ⋯ , 𝑥$ ∝ 𝑝(𝑧!) * %&" $ 𝑞(𝑧!|𝑥%) In VAE, inference networks and priors assume conditional Gaussian forms 𝑝 𝑧 = 𝑁 𝑧 0, 𝐼 , 𝑞 𝑧 𝑥% = 𝑁 𝑧 𝜇%, 𝐶% 𝑧"~𝑞'! 𝑧 𝑥" , 𝑧#~𝑞'" 𝑧 𝑥# 𝑧" = 𝑧(! , 𝑧!! , 𝑧# = 𝑧(" , 𝑧!" We want: 𝑧) = 𝑧!! = 𝑧!" → PoE
  • 18. Proposed method: DMVAE • Reconstruction inference PoE-induced shared inference allows for inference when one or more modalities are missing Thus, we consider three reconstruction tasks: - Reconstruct both modalities at the same time: 𝑥", 𝑥# → 4 𝑥", 4 𝑥# 𝑧(! , 𝑧(" , 𝑧) - Reconstruct a single modality from its own input: 𝑥" → 4 𝑥" 𝑧(! , 𝑧) or 𝑥# → 4 𝑥# 𝑧(" , 𝑧) - Reconstruct a single modality from the opposite modality’s input: 𝑥# → 4 𝑥" 𝑧(! , 𝑧) or 𝑥" → 4 𝑥# 𝑧(" , 𝑧) • Loss function Accuracy of reconstruction for jointly learned shared latent + KL-divergence of each normal distribution Accuracy of cross-modal and self reconstruction + KL-divergence
  • 19. Experiments: Digits (image-image) • Evaluation Qualitative: Cross-generation between modalities Quantitative: Accuracy of the cross-generated images using a pre-trained classifier for each modality - Joint: A sample from zs generates two image modalities that must be assigned the same class Input Output for different samples of zp2 Input Output for different samples of zp1
  • 20. Experiments: Digits (image-image) • Ablation study: DMVAE [2] vs MMVAE [1]’s shared latent space
  • 21. Experiments: Flowers (image-text) • This task is more complex Instead of the image-text, the intermediate features are reconstructed • Quantitative evaluation Class recognition (image-to-text) and cosine-similarity retrieval (text-to-image) on the shared latent space • Qualitative evaluation Retrieval
  • 22. Conclusions • Multimodal VAE for disentangling private and shared spaces Improve the representational performance of multimodal VAEs Successful application to image-image and image-text modalities • Shaping a latent space into subspaces that capture the private-shared aspects of the modalities “is important from the perspective of downstream tasks, where better decomposed representations are more amenable for using on a wider variety of tasks”
  • 23. [3] Multimodal VAEs via mutual supervision • Main differences with [1] and [2] A type of multimodal VAE, without private-shared disentanglement Does not rely on factorizations such as MoE or PoE for modeling modality-shared information Instead, it repurposes semi-supervised VAEs for combining inter-modality information - Allows learning from partially-observed modalities (Reg. = KL divergence) • Proposed method: Mutually supErvised Multimodal vaE (MEME)
  • 24. [3] Multimodal VAEs via mutual supervision • Qualitative evaluation Cross-modal generation • Quantitative evaluation Coherence: Percentage of matching predictions of the cross-generated modality using a pretrained classifier Relatedness: Wassertein Distance between the representations of two modalities (closer if same class)
  • 26. Final remarks • VAE not only for generation but also for reconstruction and disentanglement tasks Recommended textbook: “An Introduction to Variational Autoencoders”, Kingma & Welling • Private-shared latent spaces as an effective tool for analyzing multimodal data • There is still a lot of potential for this research It has been only applied to a limited number of multimodal problems • このテーマに興味のある博士課程の学生 → インターン募集中 https://www.cyberagent.co.jp/news/detail/id=27453 違うテーマでも大丈夫! 共同研究も大歓迎!
  • 27. ありがとうございました︕ Antonio TEJERO DE PABLOS antonio_tejero@cyberagent.co.jp