Multimodal Learning with Severely Missing Modality.pptx

2022-04-21
Sangmin Woo
Computational Intelligence Lab.
School of Electrical Engineering
Korea Advanced Institute of Science and Technology (KAIST)
Multimodal Learning with
Severely Missing Modality
AAAI 2021

2
Background: Multimodal Learning
Multimodal learning utilizes complementary information contained in
multimodal data to improve the performance of various computer vision
tasks
Modality Fusion
 Early fusion is a common method which fuses different modalities by
feature concatenation
 Product operation allows more interactions among different modalities
during the fusion process
Missing Modalities for Multimodal Learning
 Testing-time modality missing [1]
 Learning with data from unpaired modalities [2]
[1] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019.
[2] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019

3
Background: Meta-regularization
Meta Learning
 Meta-learning algorithms focus on designing models that are able to learn
new knowledge and adapt to novel environments quickly with only a few
training samples
 E.g., metric learning, probabilistic modeling, optimization-based
approaches (e.g., MAML)
 MAML is compatible with models that learn through gradient descent
 This work extend MAML by learning two auxiliary networks for missing
modality reconstruction and feature regularization
Regularization
 Conventional regularization techniques regularize model parameters to
avoid overfitting and increase interpretability
 Other than perturbing features, this work regularize the feature by
learning to reduce discrepancy between the reconstructed and true
modality

4
Background: Multimodal Generative
Models
Generative Models for Multimodal Learning
 Cross-modal generation approaches learn a conditional generative model
over all modalities
 E.g., conditional VAE (CVAE), conditional multimodal auto-encoder
 Joint-modal generation approaches learn the joint distribution of
multimodal data
 E.g., multimodal variational autoencoder (MVAE), multimodal VAE (JM-
VAE)
[1] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A. Y. Multimodal deep learning. ICML 2011.

5
Multimodal Learning
Multimodal Learning
 A common assumption in multimodal learning is the completeness of
training data, i.e., full modalities are available in all training examples [1]
 However, such an assumption may not always hold in real world due to
privacy concerns or budget limitations
 Incompleteness of test modalities [2, 3]
 Incompleteness of train modalities X
Question: can we learn a multimodal model from an incomplete dataset
while its performance should as close as possible to the one that learns from
a full-modality dataset?
[1] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A. Y. Multimodal deep learning. ICML 2011.

6
Multimodal Learning
Multimodal Learning Configurations
 In [1], modalities are partially missing in testing examples
 In [2], modalities are unpaired in training examples
 This work consider an even more challenging setting where both training and
testing data contain samples that have missing modalities.

7
Overview
Problem
 Consider a multimodal dataset containing two modalities with severely missing
modalities (e.g., 90%)
 Objective: Build a unified model that can handle missing modalities in training,
testing, or both that can achieve comparable performance as the model trained on
a full-modality dataset
Two Perspectives to Address the Problem
 Flexibility: how to uniformly handle missing modality in training, testing, or both?
 Efficiency: how to improve training efficiency when major data suffers from missing
modality?
Approach
 Bayesian meta-learning framework
 The key idea is to perturb the latent feature space so that embeddings of single
modality can approximate ones of full modality
 Better than typical generative methods (e.g., AE, VAE, GAN) since they often
require a significant amount of full-modality data to learn from

8
Flexibility & Efficiency
Flexibility
 Employ a feature reconstruction network that leverages the available modality to
generate an approximation of the missing modality feature
 This will generate complete data in the feature space
 When training, the model can excavate the full potential of both modality-complete
and modality-incomplete data
 When testing, by turning on or off the feature reconstruction network, the model
can tackle modality-complete or modality-incomplete inputs in a unified manner
Efficiency
 In severely missing modality setting, the feature reconstruction network would be
highly bias-prone, which yields degraded and low-quality feature generation
 Directly train a model with degraded and low-quality features will hinder the
efficiency of the training process
 Feature regularization approach is adopted to address this issue
 The idea is to leverage a Bayesian neural network to assess the data uncertainty
by performing feature perturbations

9
Overview
Approach
 Bayesian meta-learning framework
 The key idea is to perturb the latent feature space so that embeddings of single
modality can approximate ones of full modality
 Better than typical generative methods (e.g., AE, VAE, GAN) since they often
require a significant amount of full-modality data to learn from

10
Dataset
Multimodal IMDb (MM-IMDb)
 Image, text
 Predict movie genre using image or text modality
 Multi-label classification (multiple genres could be assigned to a single movie)
 25,956 movies
 23 classes
 Evaluation metrics: F1 Samples and F1 Micro
CMU Multimodal Opinion Sentiment Intensity (CMU-MOSI)
 Image, text, audio
 Predict the sentiment class of the clips
 Binary classification (negative / positive)
 2,199 opinion video clips (from YouTube movie reviews)
 Evaluation metrics: F1 Score
Audiovision-MNIST (av-MNIST)
 Image, audio
 0~9 classification
 1,500 image & audio modality
 Evaluation metrics: Accuracy

11
Baseline
Lower-bound
 Model trained using single modality of the data
 i.e., 100% image, 100% text
Upper-bound
 Mode trained using all modalities of the data
 i.e., 100% image and 100% test
Autoencoder (AE) / GAN
 First, sample a dataset containing only modality-complete samples from the
original dataset
 Then, assume one modality is missing and train AE to reconstruct the missing
modality
 Finally, impute the missing modality of modality-incomplete data using the trained
AE
 After finishing the imputation, the dataset is now available for multimodal learning
Multimodal Variational Autoencoder (MVAE)
 Linear evaluation protocol: First train MVAE using all the modalities → Freeze
MVAE and train a randomly initialized linear classifier

12
CMU Multimodal Opinion Sentiment Intensity
(CMU-MOSI)
Results
Multimodal IMDb (MM-IMDb)

13
Results
Audiovision-MNIST (avMNIST) CMU Multimodal Opinion Sentiment Intensity
(CMU-MOSI)

Thank You!
Sangmin Woo
sangminwoo.github.i
o
smwoo95@kaist.ac.k
Q&A

Multimodal Learning with Severely Missing Modality.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Multimodal Learning with Severely Missing Modality.pptx

Similar to Multimodal Learning with Severely Missing Modality.pptx (20)

More from Sangmin Woo

More from Sangmin Woo (14)

Recently uploaded

Recently uploaded (20)

Multimodal Learning with Severely Missing Modality.pptx

Editor's Notes