Successfully reported this slideshow.
Your SlideShare is downloading. ×

CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vision Study Group (2022/08/07)

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 20 Ad
Advertisement

More Related Content

Similar to CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vision Study Group (2022/08/07) (20)

Advertisement

Recently uploaded (20)

CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vision Study Group (2022/08/07)

  1. 1. AGREEMENT • If you plan to share these slides or to use the content in these slides for your own work, please include the following reference: • 本スライドの公開や個人使用の場合には、下記の様にリファレンスをお願いします: Tejero-de-Pablos A. (2022) “Paper reading: Balanced multimodal learning via on-the-fly gradient modulation”. The 11th All Japan Computer Vision Study Group.
  2. 2. CVPR2022論⽂読み会: Balanced multimodal learning via on-the-fly gradient modulation 2022/08/07 Antonio TEJERO DE PABLOS antonio_tejero@cyberagent.co.jp
  3. 3. 1.Self-introduction 2.Background 3.Paper introduction 4.Final remarks
  4. 4. Self-introduction
  5. 5. Antonio TEJERO DE PABLOS Background • Present: Research scientist @ CyberAgent (AI Lab) • ~2021: Researcher @ U-Tokyo (Harada Lab) & RIKEN (AIP) • ~2017: PhD @ NAIST (Yokoya Lab) Research interests • Learning of multimodal data (RGB, depth, audio, text) • and its applications (action recognition, advertisement classification, etc.) 父 母 分野:コンピュータビジョン
  6. 6. Background
  7. 7. The power of multimodal data • The real world is multimodal • Understanding the world in a comprehensive way requires more than one sense Driving: Image of the road + Voices of children Diagnosis: Image of the heart + ECG signal
  8. 8. What is multimodal learning? • Neural networks can learn different types of data But deciding when and how should such data be mixed, is not trivial Also, not all modalities are learned at the same rate à CHALLENGE ・Car Option 1 ・Car Option 2, etc.
  9. 9. Paper introduction Balanced Multimodal Learning via On-the-fly Gradient Modulation Peng, X., Wei, Y., Deng, A., Wang, D., & Hu, D. (2022) In Proc. Computer Vision and Pattern Recognition (pp. 8238-8247).
  10. 10. Problem setting • Differences between modalities hinder simultaneous learning Multimodal information is not fully utilized
  11. 11. Problem setting • Intuitively, leveraging multiple modalities should increase the performance, however… Unimodal representations are stronger due to suboptimized learning Reason: different modalities converge at different rates à Balance speeds! Common learning schema Modality B overfits the training data Modality A underfits the training data Optimal learning point
  12. 12. Related work • Gradient blending Obtain an optimal blending of modalities based on their overfitting behaviors Optimize a metric to understand the problem quantitatively: the overfitting-to-generalization ratio (OGR) Wang, W., Tran, D., & Feiszli, M. (2020). What makes training multi-modal classification networks hard?. In Proc. Computer Vision and Pattern Recognition (pp. 12695-12705). Uni-modal Multimodal Multimodal w/ weighted blending OGR between two training checkpoints measures the change in overfitting and generalization (small ∆O/∆V is better)
  13. 13. Proposed method • Pipeline of the On-the-fly Gradient Modulation with Generalization Enhancement strategy Adaptively modulate the backward gradient according to the performance discrepancy between modalities
  14. 14. Proposed method • Step 1: On-the-fly gradient modulation Stochastic Gradient Descent (for modality “u”) • Step 2: Generalization enhancement Step 1 may undermine the gradient noise ↓ ↓ ↓ The generalization ability of SGD is weakened Solution: Add random Gaussian noise
  15. 15. Experiments • Datasets CREMA-D: audio-visual (video) dataset for emotion recognition Kinetics-Sounds: audio-visual (video) dataset for action recognition VGGSound: audio-visual (video) dataset for event recognition • Implementation Encoders are ResNet18-based backbones. Input: - Visual: Subsampled video frames (~3) - Audio: Spectrogram transformation of the signal Optimizer: SGD with 0.9 momentum, weight decay is 1e-4, learning rate is 1e-3
  16. 16. Experiments • Comparison on the multimodal task ( ) With conventional fusion methods With other modulation strategies Applied to recognition methods Dataset CREMA-D
  17. 17. Conclusions • The proposed method is effective in solving the optimization imbalance problem Validation accuracy on VGGSound during training: • Limitations OGM-GE unimodal could not outperform the base unimodal model Other modalities and fusion strategies should be investigated Class-wise performance is not addressed Audio modality Visual modality Multimodal
  18. 18. Final remarks
  19. 19. Final remarks • There are still many unsolved problems related to multimodal learning Why do multimodal models cannot achieve optimal performance? What kind of features from each modality is the model actually learning? Is there a way to design an optimal fusion strategy for a given task and set of modalities? • このテーマに興味のある研究者/先生方の皆さんへ 共同研究大歓迎!
  20. 20. ありがとうございました︕ Antonio TEJERO DE PABLOS Website: https://antonio-t.github.io/ E-mail: antonio_tejero@cyberagent.co.jp

×