SlideShare a Scribd company logo
2022-04-21
Sangmin Woo
Computational Intelligence Lab.
School of Electrical Engineering
Korea Advanced Institute of Science and Technology (KAIST)
Multimodal Learning with
Severely Missing Modality
AAAI 2021
2
Background: Multimodal Learning
Multimodal learning utilizes complementary information contained in
multimodal data to improve the performance of various computer vision
tasks
Modality Fusion
 Early fusion is a common method which fuses different modalities by
feature concatenation
 Product operation allows more interactions among different modalities
during the fusion process
Missing Modalities for Multimodal Learning
 Testing-time modality missing [1]
 Learning with data from unpaired modalities [2]
[1] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019.
[2] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
3
Background: Meta-regularization
Meta Learning
 Meta-learning algorithms focus on designing models that are able to learn
new knowledge and adapt to novel environments quickly with only a few
training samples
 E.g., metric learning, probabilistic modeling, optimization-based
approaches (e.g., MAML)
 MAML is compatible with models that learn through gradient descent
 This work extend MAML by learning two auxiliary networks for missing
modality reconstruction and feature regularization
Regularization
 Conventional regularization techniques regularize model parameters to
avoid overfitting and increase interpretability
 Other than perturbing features, this work regularize the feature by
learning to reduce discrepancy between the reconstructed and true
modality
4
Background: Multimodal Generative
Models
Generative Models for Multimodal Learning
 Cross-modal generation approaches learn a conditional generative model
over all modalities
 E.g., conditional VAE (CVAE), conditional multimodal auto-encoder
 Joint-modal generation approaches learn the joint distribution of
multimodal data
 E.g., multimodal variational autoencoder (MVAE), multimodal VAE (JM-
VAE)
[1] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A. Y. Multimodal deep learning. ICML 2011.
[2] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019.
[3] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
5
Multimodal Learning
Multimodal Learning
 A common assumption in multimodal learning is the completeness of
training data, i.e., full modalities are available in all training examples [1]
 However, such an assumption may not always hold in real world due to
privacy concerns or budget limitations
 Incompleteness of test modalities [2, 3]
 Incompleteness of train modalities X
Question: can we learn a multimodal model from an incomplete dataset
while its performance should as close as possible to the one that learns from
a full-modality dataset?
[1] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A. Y. Multimodal deep learning. ICML 2011.
[2] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019.
[3] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
6
Multimodal Learning
Multimodal Learning Configurations
 In [1], modalities are partially missing in testing examples
 In [2], modalities are unpaired in training examples
 This work consider an even more challenging setting where both training and
testing data contain samples that have missing modalities.
[1] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019.
[2] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
7
Overview
Problem
 Consider a multimodal dataset containing two modalities with severely missing
modalities (e.g., 90%)
 Objective: Build a unified model that can handle missing modalities in training,
testing, or both that can achieve comparable performance as the model trained on
a full-modality dataset
Two Perspectives to Address the Problem
 Flexibility: how to uniformly handle missing modality in training, testing, or both?
 Efficiency: how to improve training efficiency when major data suffers from missing
modality?
Approach
 Bayesian meta-learning framework
 The key idea is to perturb the latent feature space so that embeddings of single
modality can approximate ones of full modality
 Better than typical generative methods (e.g., AE, VAE, GAN) since they often
require a significant amount of full-modality data to learn from
8
Flexibility & Efficiency
Flexibility
 Employ a feature reconstruction network that leverages the available modality to
generate an approximation of the missing modality feature
 This will generate complete data in the feature space
 When training, the model can excavate the full potential of both modality-complete
and modality-incomplete data
 When testing, by turning on or off the feature reconstruction network, the model
can tackle modality-complete or modality-incomplete inputs in a unified manner
Efficiency
 In severely missing modality setting, the feature reconstruction network would be
highly bias-prone, which yields degraded and low-quality feature generation
 Directly train a model with degraded and low-quality features will hinder the
efficiency of the training process
 Feature regularization approach is adopted to address this issue
 The idea is to leverage a Bayesian neural network to assess the data uncertainty
by performing feature perturbations
9
Overview
Approach
 Bayesian meta-learning framework
 The key idea is to perturb the latent feature space so that embeddings of single
modality can approximate ones of full modality
 Better than typical generative methods (e.g., AE, VAE, GAN) since they often
require a significant amount of full-modality data to learn from
10
Dataset
Multimodal IMDb (MM-IMDb)
 Image, text
 Predict movie genre using image or text modality
 Multi-label classification (multiple genres could be assigned to a single movie)
 25,956 movies
 23 classes
 Evaluation metrics: F1 Samples and F1 Micro
CMU Multimodal Opinion Sentiment Intensity (CMU-MOSI)
 Image, text, audio
 Predict the sentiment class of the clips
 Binary classification (negative / positive)
 2,199 opinion video clips (from YouTube movie reviews)
 Evaluation metrics: F1 Score
Audiovision-MNIST (av-MNIST)
 Image, audio
 0~9 classification
 1,500 image & audio modality
 Evaluation metrics: Accuracy
11
Baseline
Lower-bound
 Model trained using single modality of the data
 i.e., 100% image, 100% text
Upper-bound
 Mode trained using all modalities of the data
 i.e., 100% image and 100% test
Autoencoder (AE) / GAN
 First, sample a dataset containing only modality-complete samples from the
original dataset
 Then, assume one modality is missing and train AE to reconstruct the missing
modality
 Finally, impute the missing modality of modality-incomplete data using the trained
AE
 After finishing the imputation, the dataset is now available for multimodal learning
Multimodal Variational Autoencoder (MVAE)
 Linear evaluation protocol: First train MVAE using all the modalities → Freeze
MVAE and train a randomly initialized linear classifier
12
CMU Multimodal Opinion Sentiment Intensity
(CMU-MOSI)
Results
Multimodal IMDb (MM-IMDb)
13
Results
Audiovision-MNIST (avMNIST) CMU Multimodal Opinion Sentiment Intensity
(CMU-MOSI)
Thank You!
Sangmin Woo
sangminwoo.github.i
o
smwoo95@kaist.ac.k
Q&A

More Related Content

What's hot

Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models
Chia-Wen Cheng
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
Poo Kuan Hoong
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
Chanuk Lim
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
Suraj Aavula
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
Megha Sharma
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
butest
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
Appsilon Data Science
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
ankit_ppt
 
Multimodal Deep Learning
Multimodal Deep LearningMultimodal Deep Learning
Multimodal Deep Learning
Universitat Politècnica de Catalunya
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
leopauly
 
1.Introduction to deep learning
1.Introduction to deep learning1.Introduction to deep learning
1.Introduction to deep learning
KONGU ENGINEERING COLLEGE
 
Word2Vec
Word2VecWord2Vec
Word2Vec
hyunyoung Lee
 
Text clustering
Text clusteringText clustering
Text clustering
KU Leuven
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
ﺁﺻﻒ ﻋﻠﯽ ﻣﯿﺮ
 
Cikm 2018
Cikm 2018Cikm 2018
Cikm 2018
eXascale Infolab
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention Networks
Taeoh Kim
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
NAVER Engineering
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
Knoldus Inc.
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
Pramit Choudhary
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
butest
 

What's hot (20)

Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Multimodal Deep Learning
Multimodal Deep LearningMultimodal Deep Learning
Multimodal Deep Learning
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 
1.Introduction to deep learning
1.Introduction to deep learning1.Introduction to deep learning
1.Introduction to deep learning
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Text clustering
Text clusteringText clustering
Text clustering
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 
Cikm 2018
Cikm 2018Cikm 2018
Cikm 2018
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention Networks
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 

Similar to Multimodal Learning with Severely Missing Modality.pptx

An approach for improved students’ performance prediction using homogeneous ...
An approach for improved students’ performance prediction  using homogeneous ...An approach for improved students’ performance prediction  using homogeneous ...
An approach for improved students’ performance prediction using homogeneous ...
IJECEIAES
 
Incorporating Prior Domain Knowledge Into Inductive Machine ...
Incorporating Prior Domain Knowledge Into Inductive Machine ...Incorporating Prior Domain Knowledge Into Inductive Machine ...
Incorporating Prior Domain Knowledge Into Inductive Machine ...
butest
 
Learning to learn with meta learning
Learning to learn with meta learningLearning to learn with meta learning
Learning to learn with meta learning
ShreeGowriRadhakrish
 
Using the Structure of Tacit Knowing for Acquiring a Holistic View on IS Field
Using the Structure of Tacit Knowing for Acquiring a Holistic View on IS FieldUsing the Structure of Tacit Knowing for Acquiring a Holistic View on IS Field
Using the Structure of Tacit Knowing for Acquiring a Holistic View on IS Field
Ilia Bider
 
PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
butest
 
PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
butest
 
An Ensemble Approach To Improve Homomorphic Encrypted Data Classification Per...
An Ensemble Approach To Improve Homomorphic Encrypted Data Classification Per...An Ensemble Approach To Improve Homomorphic Encrypted Data Classification Per...
An Ensemble Approach To Improve Homomorphic Encrypted Data Classification Per...
IJCI JOURNAL
 
Interpretable Machine Learning
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine Learning
Sri Ambati
 
slides
slidesslides
slides
butest
 
icmi2015_ChaZhang
icmi2015_ChaZhangicmi2015_ChaZhang
icmi2015_ChaZhang
Zhiding Yu
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
butest
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
butest
 
Personalized Retweet Prediction in Twitter
Personalized Retweet Prediction in TwitterPersonalized Retweet Prediction in Twitter
Personalized Retweet Prediction in Twitter
Liangjie Hong
 
Graph Neural Prompting with Large Language Models.pptx
Graph Neural Prompting with Large Language Models.pptxGraph Neural Prompting with Large Language Models.pptx
Graph Neural Prompting with Large Language Models.pptx
ssuser2624f71
 
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCHSENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
ijwscjournal
 
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCHSENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
ijwscjournal
 
Data Collection Methods for Building a Free Response Training Simulation
Data Collection Methods for Building a Free Response Training SimulationData Collection Methods for Building a Free Response Training Simulation
Data Collection Methods for Building a Free Response Training Simulation
Melissa Moody
 
E0322035037
E0322035037E0322035037
E0322035037
inventionjournals
 
Predicting More from Less: Synergies of Learning
Predicting More from Less: Synergies of LearningPredicting More from Less: Synergies of Learning
Predicting More from Less: Synergies of Learning
CS, NcState
 
Multi Task Learning and Meta Learning
Multi Task Learning and Meta LearningMulti Task Learning and Meta Learning
Multi Task Learning and Meta Learning
Srilalitha Veerubhotla
 

Similar to Multimodal Learning with Severely Missing Modality.pptx (20)

An approach for improved students’ performance prediction using homogeneous ...
An approach for improved students’ performance prediction  using homogeneous ...An approach for improved students’ performance prediction  using homogeneous ...
An approach for improved students’ performance prediction using homogeneous ...
 
Incorporating Prior Domain Knowledge Into Inductive Machine ...
Incorporating Prior Domain Knowledge Into Inductive Machine ...Incorporating Prior Domain Knowledge Into Inductive Machine ...
Incorporating Prior Domain Knowledge Into Inductive Machine ...
 
Learning to learn with meta learning
Learning to learn with meta learningLearning to learn with meta learning
Learning to learn with meta learning
 
Using the Structure of Tacit Knowing for Acquiring a Holistic View on IS Field
Using the Structure of Tacit Knowing for Acquiring a Holistic View on IS FieldUsing the Structure of Tacit Knowing for Acquiring a Holistic View on IS Field
Using the Structure of Tacit Knowing for Acquiring a Holistic View on IS Field
 
PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
 
PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
 
An Ensemble Approach To Improve Homomorphic Encrypted Data Classification Per...
An Ensemble Approach To Improve Homomorphic Encrypted Data Classification Per...An Ensemble Approach To Improve Homomorphic Encrypted Data Classification Per...
An Ensemble Approach To Improve Homomorphic Encrypted Data Classification Per...
 
Interpretable Machine Learning
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine Learning
 
slides
slidesslides
slides
 
icmi2015_ChaZhang
icmi2015_ChaZhangicmi2015_ChaZhang
icmi2015_ChaZhang
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
Personalized Retweet Prediction in Twitter
Personalized Retweet Prediction in TwitterPersonalized Retweet Prediction in Twitter
Personalized Retweet Prediction in Twitter
 
Graph Neural Prompting with Large Language Models.pptx
Graph Neural Prompting with Large Language Models.pptxGraph Neural Prompting with Large Language Models.pptx
Graph Neural Prompting with Large Language Models.pptx
 
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCHSENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
 
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCHSENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
 
Data Collection Methods for Building a Free Response Training Simulation
Data Collection Methods for Building a Free Response Training SimulationData Collection Methods for Building a Free Response Training Simulation
Data Collection Methods for Building a Free Response Training Simulation
 
E0322035037
E0322035037E0322035037
E0322035037
 
Predicting More from Less: Synergies of Learning
Predicting More from Less: Synergies of LearningPredicting More from Less: Synergies of Learning
Predicting More from Less: Synergies of Learning
 
Multi Task Learning and Meta Learning
Multi Task Learning and Meta LearningMulti Task Learning and Meta Learning
Multi Task Learning and Meta Learning
 

More from Sangmin Woo

Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
Sangmin Woo
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
Sangmin Woo
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
Sangmin Woo
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptx
Sangmin Woo
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
Sangmin Woo
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
Sangmin Woo
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
Sangmin Woo
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Sangmin Woo
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient Transformers
Sangmin Woo
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
Sangmin Woo
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Sangmin Woo
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
Sangmin Woo
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
Sangmin Woo
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
Sangmin Woo
 

More from Sangmin Woo (14)

Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptx
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient Transformers
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 

Recently uploaded

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 

Recently uploaded (20)

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 

Multimodal Learning with Severely Missing Modality.pptx

  • 1. 2022-04-21 Sangmin Woo Computational Intelligence Lab. School of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST) Multimodal Learning with Severely Missing Modality AAAI 2021
  • 2. 2 Background: Multimodal Learning Multimodal learning utilizes complementary information contained in multimodal data to improve the performance of various computer vision tasks Modality Fusion  Early fusion is a common method which fuses different modalities by feature concatenation  Product operation allows more interactions among different modalities during the fusion process Missing Modalities for Multimodal Learning  Testing-time modality missing [1]  Learning with data from unpaired modalities [2] [1] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019. [2] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
  • 3. 3 Background: Meta-regularization Meta Learning  Meta-learning algorithms focus on designing models that are able to learn new knowledge and adapt to novel environments quickly with only a few training samples  E.g., metric learning, probabilistic modeling, optimization-based approaches (e.g., MAML)  MAML is compatible with models that learn through gradient descent  This work extend MAML by learning two auxiliary networks for missing modality reconstruction and feature regularization Regularization  Conventional regularization techniques regularize model parameters to avoid overfitting and increase interpretability  Other than perturbing features, this work regularize the feature by learning to reduce discrepancy between the reconstructed and true modality
  • 4. 4 Background: Multimodal Generative Models Generative Models for Multimodal Learning  Cross-modal generation approaches learn a conditional generative model over all modalities  E.g., conditional VAE (CVAE), conditional multimodal auto-encoder  Joint-modal generation approaches learn the joint distribution of multimodal data  E.g., multimodal variational autoencoder (MVAE), multimodal VAE (JM- VAE) [1] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A. Y. Multimodal deep learning. ICML 2011. [2] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019. [3] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
  • 5. 5 Multimodal Learning Multimodal Learning  A common assumption in multimodal learning is the completeness of training data, i.e., full modalities are available in all training examples [1]  However, such an assumption may not always hold in real world due to privacy concerns or budget limitations  Incompleteness of test modalities [2, 3]  Incompleteness of train modalities X Question: can we learn a multimodal model from an incomplete dataset while its performance should as close as possible to the one that learns from a full-modality dataset? [1] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A. Y. Multimodal deep learning. ICML 2011. [2] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019. [3] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
  • 6. 6 Multimodal Learning Multimodal Learning Configurations  In [1], modalities are partially missing in testing examples  In [2], modalities are unpaired in training examples  This work consider an even more challenging setting where both training and testing data contain samples that have missing modalities. [1] Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., Salakhutdinov, R. Learning Factorized Multimodal Representations. ICLR 2019. [2] Pham, H., et al., Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019
  • 7. 7 Overview Problem  Consider a multimodal dataset containing two modalities with severely missing modalities (e.g., 90%)  Objective: Build a unified model that can handle missing modalities in training, testing, or both that can achieve comparable performance as the model trained on a full-modality dataset Two Perspectives to Address the Problem  Flexibility: how to uniformly handle missing modality in training, testing, or both?  Efficiency: how to improve training efficiency when major data suffers from missing modality? Approach  Bayesian meta-learning framework  The key idea is to perturb the latent feature space so that embeddings of single modality can approximate ones of full modality  Better than typical generative methods (e.g., AE, VAE, GAN) since they often require a significant amount of full-modality data to learn from
  • 8. 8 Flexibility & Efficiency Flexibility  Employ a feature reconstruction network that leverages the available modality to generate an approximation of the missing modality feature  This will generate complete data in the feature space  When training, the model can excavate the full potential of both modality-complete and modality-incomplete data  When testing, by turning on or off the feature reconstruction network, the model can tackle modality-complete or modality-incomplete inputs in a unified manner Efficiency  In severely missing modality setting, the feature reconstruction network would be highly bias-prone, which yields degraded and low-quality feature generation  Directly train a model with degraded and low-quality features will hinder the efficiency of the training process  Feature regularization approach is adopted to address this issue  The idea is to leverage a Bayesian neural network to assess the data uncertainty by performing feature perturbations
  • 9. 9 Overview Approach  Bayesian meta-learning framework  The key idea is to perturb the latent feature space so that embeddings of single modality can approximate ones of full modality  Better than typical generative methods (e.g., AE, VAE, GAN) since they often require a significant amount of full-modality data to learn from
  • 10. 10 Dataset Multimodal IMDb (MM-IMDb)  Image, text  Predict movie genre using image or text modality  Multi-label classification (multiple genres could be assigned to a single movie)  25,956 movies  23 classes  Evaluation metrics: F1 Samples and F1 Micro CMU Multimodal Opinion Sentiment Intensity (CMU-MOSI)  Image, text, audio  Predict the sentiment class of the clips  Binary classification (negative / positive)  2,199 opinion video clips (from YouTube movie reviews)  Evaluation metrics: F1 Score Audiovision-MNIST (av-MNIST)  Image, audio  0~9 classification  1,500 image & audio modality  Evaluation metrics: Accuracy
  • 11. 11 Baseline Lower-bound  Model trained using single modality of the data  i.e., 100% image, 100% text Upper-bound  Mode trained using all modalities of the data  i.e., 100% image and 100% test Autoencoder (AE) / GAN  First, sample a dataset containing only modality-complete samples from the original dataset  Then, assume one modality is missing and train AE to reconstruct the missing modality  Finally, impute the missing modality of modality-incomplete data using the trained AE  After finishing the imputation, the dataset is now available for multimodal learning Multimodal Variational Autoencoder (MVAE)  Linear evaluation protocol: First train MVAE using all the modalities → Freeze MVAE and train a randomly initialized linear classifier
  • 12. 12 CMU Multimodal Opinion Sentiment Intensity (CMU-MOSI) Results Multimodal IMDb (MM-IMDb)
  • 13. 13 Results Audiovision-MNIST (avMNIST) CMU Multimodal Opinion Sentiment Intensity (CMU-MOSI)

Editor's Notes

  1. Thank you.