CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vision Study Group (2022/08/07)

AGREEMENT
• If you plan to share these slides or to use the content in these slides for your own work,
please include the following reference:
• 本スライドの公開や個人使用の場合には、下記の様にリファレンスをお願いします：
Tejero-de-Pablos A. (2022) “Paper reading: Balanced multimodal learning via on-the-fly
gradient modulation”. The 11th All Japan Computer Vision Study Group.

CVPR2022論⽂読み会:
Balanced multimodal learning via
on-the-fly gradient modulation
2022/08/07
Antonio TEJERO DE PABLOS
antonio_tejero@cyberagent.co.jp

1.Self-introduction
2.Background
3.Paper introduction
4.Final remarks

Background
• Present: Research scientist @ CyberAgent (AI Lab)
• ~2021: Researcher @ U-Tokyo (Harada Lab) & RIKEN (AIP)
• ~2017: PhD @ NAIST (Yokoya Lab)
Research interests
• Learning of multimodal data (RGB, depth, audio, text)
• and its applications (action recognition, advertisement
classification, etc.)
父母
分野：コンピュータビジョン

The power of multimodal data
• The real world is multimodal
• Understanding the world in a comprehensive way requires more than one sense
Driving: Image of the road + Voices of children
Diagnosis: Image of the heart + ECG signal

What is multimodal learning?
• Neural networks can learn different types of data
But deciding when and how should such data be mixed, is not trivial
Also, not all modalities are learned at the same rate à CHALLENGE
・Car
Option 1
・Car
Option 2, etc.

Paper introduction
Balanced Multimodal Learning via On-the-fly
Gradient Modulation
Peng, X., Wei, Y., Deng, A., Wang, D., & Hu, D. (2022)
In Proc. Computer Vision and Pattern
Recognition (pp. 8238-8247).

Problem setting
• Differences between modalities hinder simultaneous learning
Multimodal information is not fully utilized

Problem setting
• Intuitively, leveraging multiple modalities should increase the performance, however…
Unimodal representations are stronger due to suboptimized learning
Reason: different modalities converge at different rates à Balance speeds!
Common learning schema
Modality B overfits
the training data
Modality A underfits
the training data
Optimal
learning point

Related work
• Gradient blending
Obtain an optimal blending of modalities based on their overfitting behaviors
Optimize a metric to understand the problem quantitatively: the overfitting-to-generalization ratio (OGR)
Wang, W., Tran, D., & Feiszli, M. (2020). What makes training multi-modal classification networks hard?.
In Proc. Computer Vision and Pattern Recognition (pp. 12695-12705).
Uni-modal Multimodal
Multimodal w/
weighted blending
OGR between two training checkpoints measures
the change in overfitting and generalization
(small ∆O/∆V is better)

Proposed method
• Pipeline of the On-the-fly Gradient Modulation with Generalization Enhancement strategy
Adaptively modulate the backward gradient according to the performance discrepancy between modalities

Proposed method
• Step 1: On-the-fly gradient modulation
Stochastic Gradient Descent (for modality “u”)
• Step 2: Generalization enhancement
Step 1 may undermine the gradient noise
↓ ↓ ↓
The generalization ability of SGD is weakened
Solution:
Add random Gaussian noise

Experiments
• Datasets
CREMA-D: audio-visual (video) dataset for emotion recognition
Kinetics-Sounds: audio-visual (video) dataset for action recognition
VGGSound: audio-visual (video) dataset for event recognition
• Implementation
Encoders are ResNet18-based backbones. Input:
- Visual: Subsampled video frames (~3)
- Audio: Spectrogram transformation of the signal
Optimizer: SGD with 0.9 momentum, weight decay is 1e-4, learning rate is 1e-3

Experiments
• Comparison on the multimodal task ( )
With conventional fusion methods With other modulation strategies Applied to recognition methods
Dataset CREMA-D

Conclusions
• The proposed method is effective in solving the optimization imbalance problem
Validation accuracy on VGGSound during training:
• Limitations
OGM-GE unimodal could not outperform the base unimodal model
Other modalities and fusion strategies should be investigated
Class-wise performance is not addressed
Audio modality Visual modality Multimodal

Final remarks
• There are still many unsolved problems related to multimodal learning
Why do multimodal models cannot achieve optimal performance?
What kind of features from each modality is the model actually learning?
Is there a way to design an optimal fusion strategy for a given task and set of modalities?
• このテーマに興味のある研究者/先生方の皆さんへ
共同研究大歓迎！

ありがとうございました︕
Website: https://antonio-t.github.io/
E-mail: antonio_tejero@cyberagent.co.jp

CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vision Study Group (2022/08/07)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vision Study Group (2022/08/07)

Similar to CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vision Study Group (2022/08/07) (20)

More from Antonio Tejero de Pablos

More from Antonio Tejero de Pablos (6)

Recently uploaded

Recently uploaded (20)

CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vision Study Group (2022/08/07)