Effective and Efficient Vision-Language Learning with Cross-Modal Skip-Connections

mPLUG: Effective and Efficient Vision-Language Learning by
Cross-modal Skip-connections
2023.04.09
이미지 처리팀
이찬혁, 김준철, 류채은, 이해원, 최승준, 현청천

03
Introduction Vision-Language Task

04
Introduction Problem of previous works

05
Introduction Problem of previous works
•
•

06
Introduction Proposed method
•
•
•

07
Introduction Contribution
• We propose a unified vision-language pretrained model mPLUG of cross-modal understanding and
generation for both effectiveness and efficiency in cross-modal learning.
• We introduce a new asymmetric vision language architecture with novel cross-modal skip-connections,
to address two fundamental problems of information asymmetry and computation inefficiency in multi-
modal fusion.
• mPLUG achieves state-of-the-art performance on a wide range of vision-language tasks, including
Image captioning, image-text retrieval, visual grounding and visual question answering in zero-shot
manner

09
Proposed method Overall architecture
skip-connected fusion block
skip-connected fusion block

010
Proposed method Model architecture
𝐿𝐼𝑇𝐶 + 𝐿𝐼𝑇𝑀 + 𝐿𝑀𝑎𝑠𝑘𝐿𝑀 𝐿𝑃𝑟𝑒𝑓𝑖𝑥𝐿𝑀

011
𝑳𝑰𝑻𝑪 + 𝐿𝐼𝑇𝑀 + 𝐿𝑀𝑎𝑠𝑘𝐿𝑀 𝐿𝑃𝑟𝑒𝑓𝑖𝑥𝐿𝑀

012
Proposed method Pre-training methods (Image-Text Contrastive learning)
•
𝐼1, 𝐼2, 𝐼3, 𝐼4
𝑇1, 𝑇2, 𝑇3, 𝑇4
𝐼1, 𝐼2, 𝐼3, 𝐼4
𝑇1, 𝑇2, 𝑇3, 𝑇4
𝐼1
𝑇2, 𝑇3, 𝑇4
𝐼3
𝑇1, 𝑇2, 𝑇4

013
Proposed method Pre-training methods (Image-Text Contrastive learning)
𝑻𝟏 𝑻𝟐 𝑻𝟑 𝑻𝟒
𝑰𝟏 𝑺𝟏𝟏 𝑺𝟏𝟐 𝑺𝟏𝟑 𝑺𝟏𝟒
𝑰𝟐 𝑺𝟐𝟏 𝑺𝟐𝟐 𝑺𝟐𝟑 𝑺𝟐𝟒
𝑰𝟑 𝑺𝟑𝟏 𝑺𝟑𝟐 𝑺𝟑𝟑 𝑺𝟑𝟒
𝑰𝟒 𝑺𝟒𝟏 𝑺𝟒𝟐 𝑺𝟒𝟑 𝑺𝟒𝟒
𝐼𝑚𝑎𝑔𝑒
𝑇𝑒𝑥𝑡
𝑳𝑰𝑻𝑪 + 𝐿𝐼𝑇𝑀 + 𝐿𝑀𝑎𝑠𝑘𝐿𝑀 𝐿𝑃𝑟𝑒𝑓𝑖𝑥𝐿𝑀

014
𝐿𝐼𝑇𝐶 + 𝑳𝑰𝑻𝑴 + 𝐿𝑀𝑎𝑠𝑘𝐿𝑀 𝐿𝑃𝑟𝑒𝑓𝑖𝑥𝐿𝑀

015
Proposed method Pre-training methods (Image-Text Matching)
𝑻𝟏 𝑻𝟐 𝑻𝟑 𝑻𝟒
𝑰𝟏 𝑺𝟏𝟏 𝑺𝟏𝟐 𝑺𝟏𝟑 𝑺𝟏𝟒
𝑰𝟐 𝑺𝟐𝟏 𝑺𝟐𝟐 𝑺𝟐𝟑 𝑺𝟐𝟒
𝑰𝟑 𝑺𝟑𝟏 𝑺𝟑𝟐 𝑺𝟑𝟑 𝑺𝟑𝟒
𝑰𝟒 𝑺𝟒𝟏 𝑺𝟒𝟐 𝑺𝟒𝟑 𝑺𝟒𝟒
𝑻𝟏 𝑻𝟐 𝑻𝟑 𝑻𝟒
𝑰𝟏
𝑰𝟐
𝑰𝟑
𝑰𝟒
𝑇𝑒𝑥𝑡
𝑇𝑒𝑥𝑡
[Hard negative Image-Text pair]
𝐼1 − 𝑇2 𝐼2 − 𝑇1 𝐼3 − 𝑇1 𝐼4 − 𝑇3
𝑻𝟏 𝑻𝟐 𝑻𝟑 𝑻𝟒
𝑰𝟏
𝑰𝟐
𝑰𝟑
𝑰𝟒
𝑇𝑒𝑥𝑡
[Hard negative Text-Image pair]
𝑇1 − 𝐼2 𝑇2 − 𝐼1 𝑇3 − 𝐼4 𝑇4 − 𝐿1

016
Proposed method Pre-training methods (Image-Text Matching)
→
𝐿𝐼𝑇𝐶 + 𝑳𝑰𝑻𝑴 + 𝐿𝑀𝑎𝑠𝑘𝐿𝑀 𝐿𝑃𝑟𝑒𝑓𝑖𝑥𝐿𝑀
𝑳𝑰𝑻𝑴

017
𝐿𝐼𝑇𝐶 + 𝐿𝐼𝑇𝑀 + 𝑳𝑴𝒂𝒔𝒌𝑳𝑴 𝐿𝑃𝑟𝑒𝑓𝑖𝑥𝐿𝑀

018
Proposed method Pre-training methods (Masked Language Modeling)
𝐿𝐼𝑇𝐶 + 𝐿𝐼𝑇𝑀 + 𝑳𝑴𝒂𝒔𝒌𝑳𝑴 𝐿𝑃𝑟𝑒𝑓𝑖𝑥𝐿𝑀

019
𝐿𝐼𝑇𝐶 + 𝐿𝐼𝑇𝑀 + 𝐿𝑀𝑎𝑠𝑘𝐿𝑀 𝑳𝑷𝒓𝒆𝒇𝒊𝒙𝑳𝑴

020
Proposed method Pre-training methods (Prefix Language Modeling)

021
Proposed method Pre-training methods
𝑳𝑰𝑻𝑪
𝑳𝑴𝑳𝑴 𝑳𝑰𝑻𝑴
𝑳𝑷𝒓𝒆𝒇𝒊𝒙𝑳𝑴

024
Experiments Data & Setup
•
•
• )
•
𝐵𝐸𝑅𝑇𝑏𝑎𝑠𝑒
Multi-modal network
(Last 6-layers in 𝐵𝐸𝑅𝑇𝑏𝑎𝑠𝑒)
Decoder
(12-layers Transformer)
•
• NVIDIA A100 GPUs (16 GPUS)
• AdamW (Weight decay 0.02)
• 1e-5 (ViT) / 1e-4 (BERT)
• Random image crop, RandAugment
• 65,536
•

025
Experiments Distributed learning on a large scale

026
Evaluation on Vision-Language Tasks Visual Question Answering

027
Evaluation on Vision-Language Tasks Image captioning

028
Evaluation on Vision-Language Tasks Image-Text retrieval

029
Evaluation on Vision-Language Tasks Visual grounding

030
Evaluation on Vision-Language Tasks Visual reasoning

031
Effectiveness and Efficiency Analysis of Stride for Skip

032
Effectiveness and Efficiency Analysis of Cross-modal Fusion

033
Zero-shot Transferability Image caption

034
Zero-shot Transferability Image-Text retrieval

035
Conclusion
• Presents mPLUG with novel cross-modal skip-connections, an effective and efficient VLP
framework for both cross-modal understanding and generation.
• mPLUG achieves state-of-the-art performance on a wide range of vision-language tasks with
zero-shot manner

Effective and Efficient Vision-Language Learning with Cross-Modal Skip-Connections

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Effective and Efficient Vision-Language Learning with Cross-Modal Skip-Connections

Similar to Effective and Efficient Vision-Language Learning with Cross-Modal Skip-Connections (20)

More from taeseon ryu

More from taeseon ryu (20)

Recently uploaded

Recently uploaded (20)

Effective and Efficient Vision-Language Learning with Cross-Modal Skip-Connections