2022-04-21
Sangmin Woo
Computational Intelligence Lab.
School of Electrical Engineering
Korea Advanced Institute of Science and Technology (KAIST)
Multimodal Learning with
Severely Missing Modality
AAAI 2021
2
Background: Multimodal Learning
Multimodal learning utilizes complementary information contained in
multimodal data to improve the performance of various computer vision
tasks
Modality Fusion
 Early fusion is a common method which fuses different modalities by
feature concatenation
 Product operation allows more interactions among different modalities
during the fusion process
Missing Modalities for Multimodal Learning
 Testing-time modality missing [1]
 Learning with data from unpaired modalities [2]
[1] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019.
[2] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
3
Background: Meta-regularization
Meta Learning
 Meta-learning algorithms focus on designing models that are able to learn
new knowledge and adapt to novel environments quickly with only a few
training samples
 E.g., metric learning, probabilistic modeling, optimization-based
approaches (e.g., MAML)
 MAML is compatible with models that learn through gradient descent
 This work extend MAML by learning two auxiliary networks for missing
modality reconstruction and feature regularization
Regularization
 Conventional regularization techniques regularize model parameters to
avoid overfitting and increase interpretability
 Other than perturbing features, this work regularize the feature by
learning to reduce discrepancy between the reconstructed and true
modality
4
Background: Multimodal Generative
Models
Generative Models for Multimodal Learning
 Cross-modal generation approaches learn a conditional generative model
over all modalities
 E.g., conditional VAE (CVAE), conditional multimodal auto-encoder
 Joint-modal generation approaches learn the joint distribution of
multimodal data
 E.g., multimodal variational autoencoder (MVAE), multimodal VAE (JM-
VAE)
[1] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A. Y. Multimodal deep learning. ICML 2011.
[2] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019.
[3] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
5
Multimodal Learning
Multimodal Learning
 A common assumption in multimodal learning is the completeness of
training data, i.e., full modalities are available in all training examples [1]
 However, such an assumption may not always hold in real world due to
privacy concerns or budget limitations
 Incompleteness of test modalities [2, 3]
 Incompleteness of train modalities X
Question: can we learn a multimodal model from an incomplete dataset
while its performance should as close as possible to the one that learns from
a full-modality dataset?
[1] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A. Y. Multimodal deep learning. ICML 2011.
[2] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019.
[3] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
6
Multimodal Learning
Multimodal Learning Configurations
 In [1], modalities are partially missing in testing examples
 In [2], modalities are unpaired in training examples
 This work consider an even more challenging setting where both training and
testing data contain samples that have missing modalities.
[1] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019.
[2] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
7
Overview
Problem
 Consider a multimodal dataset containing two modalities with severely missing
modalities (e.g., 90%)
 Objective: Build a unified model that can handle missing modalities in training,
testing, or both that can achieve comparable performance as the model trained on
a full-modality dataset
Two Perspectives to Address the Problem
 Flexibility: how to uniformly handle missing modality in training, testing, or both?
 Efficiency: how to improve training efficiency when major data suffers from missing
modality?
Approach
 Bayesian meta-learning framework
 The key idea is to perturb the latent feature space so that embeddings of single
modality can approximate ones of full modality
 Better than typical generative methods (e.g., AE, VAE, GAN) since they often
require a significant amount of full-modality data to learn from
8
Flexibility & Efficiency
Flexibility
 Employ a feature reconstruction network that leverages the available modality to
generate an approximation of the missing modality feature
 This will generate complete data in the feature space
 When training, the model can excavate the full potential of both modality-complete
and modality-incomplete data
 When testing, by turning on or off the feature reconstruction network, the model
can tackle modality-complete or modality-incomplete inputs in a unified manner
Efficiency
 In severely missing modality setting, the feature reconstruction network would be
highly bias-prone, which yields degraded and low-quality feature generation
 Directly train a model with degraded and low-quality features will hinder the
efficiency of the training process
 Feature regularization approach is adopted to address this issue
 The idea is to leverage a Bayesian neural network to assess the data uncertainty
by performing feature perturbations
9
Overview
Approach
 Bayesian meta-learning framework
 The key idea is to perturb the latent feature space so that embeddings of single
modality can approximate ones of full modality
 Better than typical generative methods (e.g., AE, VAE, GAN) since they often
require a significant amount of full-modality data to learn from
10
Dataset
Multimodal IMDb (MM-IMDb)
 Image, text
 Predict movie genre using image or text modality
 Multi-label classification (multiple genres could be assigned to a single movie)
 25,956 movies
 23 classes
 Evaluation metrics: F1 Samples and F1 Micro
CMU Multimodal Opinion Sentiment Intensity (CMU-MOSI)
 Image, text, audio
 Predict the sentiment class of the clips
 Binary classification (negative / positive)
 2,199 opinion video clips (from YouTube movie reviews)
 Evaluation metrics: F1 Score
Audiovision-MNIST (av-MNIST)
 Image, audio
 0~9 classification
 1,500 image & audio modality
 Evaluation metrics: Accuracy
11
Baseline
Lower-bound
 Model trained using single modality of the data
 i.e., 100% image, 100% text
Upper-bound
 Mode trained using all modalities of the data
 i.e., 100% image and 100% test
Autoencoder (AE) / GAN
 First, sample a dataset containing only modality-complete samples from the
original dataset
 Then, assume one modality is missing and train AE to reconstruct the missing
modality
 Finally, impute the missing modality of modality-incomplete data using the trained
AE
 After finishing the imputation, the dataset is now available for multimodal learning
Multimodal Variational Autoencoder (MVAE)
 Linear evaluation protocol: First train MVAE using all the modalities → Freeze
MVAE and train a randomly initialized linear classifier
12
CMU Multimodal Opinion Sentiment Intensity
(CMU-MOSI)
Results
Multimodal IMDb (MM-IMDb)
13
Results
Audiovision-MNIST (avMNIST) CMU Multimodal Opinion Sentiment Intensity
(CMU-MOSI)
Thank You!
Sangmin Woo
sangminwoo.github.i
o
smwoo95@kaist.ac.k
Q&A

Multimodal Learning with Severely Missing Modality.pptx

  • 1.
    2022-04-21 Sangmin Woo Computational IntelligenceLab. School of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST) Multimodal Learning with Severely Missing Modality AAAI 2021
  • 2.
    2 Background: Multimodal Learning Multimodallearning utilizes complementary information contained in multimodal data to improve the performance of various computer vision tasks Modality Fusion  Early fusion is a common method which fuses different modalities by feature concatenation  Product operation allows more interactions among different modalities during the fusion process Missing Modalities for Multimodal Learning  Testing-time modality missing [1]  Learning with data from unpaired modalities [2] [1] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019. [2] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
  • 3.
    3 Background: Meta-regularization Meta Learning Meta-learning algorithms focus on designing models that are able to learn new knowledge and adapt to novel environments quickly with only a few training samples  E.g., metric learning, probabilistic modeling, optimization-based approaches (e.g., MAML)  MAML is compatible with models that learn through gradient descent  This work extend MAML by learning two auxiliary networks for missing modality reconstruction and feature regularization Regularization  Conventional regularization techniques regularize model parameters to avoid overfitting and increase interpretability  Other than perturbing features, this work regularize the feature by learning to reduce discrepancy between the reconstructed and true modality
  • 4.
    4 Background: Multimodal Generative Models GenerativeModels for Multimodal Learning  Cross-modal generation approaches learn a conditional generative model over all modalities  E.g., conditional VAE (CVAE), conditional multimodal auto-encoder  Joint-modal generation approaches learn the joint distribution of multimodal data  E.g., multimodal variational autoencoder (MVAE), multimodal VAE (JM- VAE) [1] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A. Y. Multimodal deep learning. ICML 2011. [2] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019. [3] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
  • 5.
    5 Multimodal Learning Multimodal Learning A common assumption in multimodal learning is the completeness of training data, i.e., full modalities are available in all training examples [1]  However, such an assumption may not always hold in real world due to privacy concerns or budget limitations  Incompleteness of test modalities [2, 3]  Incompleteness of train modalities X Question: can we learn a multimodal model from an incomplete dataset while its performance should as close as possible to the one that learns from a full-modality dataset? [1] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A. Y. Multimodal deep learning. ICML 2011. [2] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019. [3] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
  • 6.
    6 Multimodal Learning Multimodal LearningConfigurations  In [1], modalities are partially missing in testing examples  In [2], modalities are unpaired in training examples  This work consider an even more challenging setting where both training and testing data contain samples that have missing modalities. [1] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019. [2] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
  • 7.
    7 Overview Problem  Consider amultimodal dataset containing two modalities with severely missing modalities (e.g., 90%)  Objective: Build a unified model that can handle missing modalities in training, testing, or both that can achieve comparable performance as the model trained on a full-modality dataset Two Perspectives to Address the Problem  Flexibility: how to uniformly handle missing modality in training, testing, or both?  Efficiency: how to improve training efficiency when major data suffers from missing modality? Approach  Bayesian meta-learning framework  The key idea is to perturb the latent feature space so that embeddings of single modality can approximate ones of full modality  Better than typical generative methods (e.g., AE, VAE, GAN) since they often require a significant amount of full-modality data to learn from
  • 8.
    8 Flexibility & Efficiency Flexibility Employ a feature reconstruction network that leverages the available modality to generate an approximation of the missing modality feature  This will generate complete data in the feature space  When training, the model can excavate the full potential of both modality-complete and modality-incomplete data  When testing, by turning on or off the feature reconstruction network, the model can tackle modality-complete or modality-incomplete inputs in a unified manner Efficiency  In severely missing modality setting, the feature reconstruction network would be highly bias-prone, which yields degraded and low-quality feature generation  Directly train a model with degraded and low-quality features will hinder the efficiency of the training process  Feature regularization approach is adopted to address this issue  The idea is to leverage a Bayesian neural network to assess the data uncertainty by performing feature perturbations
  • 9.
    9 Overview Approach  Bayesian meta-learningframework  The key idea is to perturb the latent feature space so that embeddings of single modality can approximate ones of full modality  Better than typical generative methods (e.g., AE, VAE, GAN) since they often require a significant amount of full-modality data to learn from
  • 10.
    10 Dataset Multimodal IMDb (MM-IMDb) Image, text  Predict movie genre using image or text modality  Multi-label classification (multiple genres could be assigned to a single movie)  25,956 movies  23 classes  Evaluation metrics: F1 Samples and F1 Micro CMU Multimodal Opinion Sentiment Intensity (CMU-MOSI)  Image, text, audio  Predict the sentiment class of the clips  Binary classification (negative / positive)  2,199 opinion video clips (from YouTube movie reviews)  Evaluation metrics: F1 Score Audiovision-MNIST (av-MNIST)  Image, audio  0~9 classification  1,500 image & audio modality  Evaluation metrics: Accuracy
  • 11.
    11 Baseline Lower-bound  Model trainedusing single modality of the data  i.e., 100% image, 100% text Upper-bound  Mode trained using all modalities of the data  i.e., 100% image and 100% test Autoencoder (AE) / GAN  First, sample a dataset containing only modality-complete samples from the original dataset  Then, assume one modality is missing and train AE to reconstruct the missing modality  Finally, impute the missing modality of modality-incomplete data using the trained AE  After finishing the imputation, the dataset is now available for multimodal learning Multimodal Variational Autoencoder (MVAE)  Linear evaluation protocol: First train MVAE using all the modalities → Freeze MVAE and train a randomly initialized linear classifier
  • 12.
    12 CMU Multimodal OpinionSentiment Intensity (CMU-MOSI) Results Multimodal IMDb (MM-IMDb)
  • 13.
    13 Results Audiovision-MNIST (avMNIST) CMUMultimodal Opinion Sentiment Intensity (CMU-MOSI)
  • 14.

Editor's Notes