mPLUG is a new vision-language pre-trained model proposed by the authors that achieves state-of-the-art performance on various vision-language tasks through an asymmetric architecture using novel cross-modal skip connections. The model introduces skip-connected fusion blocks to address information asymmetry and computation inefficiency problems in multi-modal fusion. mPLUG is pre-trained using contrastive learning on image-text pairs and masked language modeling, and shows strong zero-shot transfer ability on tasks like image captioning and image-text retrieval. Evaluation shows mPLUG outperforms prior work on tasks including visual question answering, image captioning, image-text retrieval, visual grounding and visual reasoning.
7. 07
Introduction Contribution
• We propose a unified vision-language pretrained model mPLUG of cross-modal understanding and
generation for both effectiveness and efficiency in cross-modal learning.
• We introduce a new asymmetric vision language architecture with novel cross-modal skip-connections,
to address two fundamental problems of information asymmetry and computation inefficiency in multi-
modal fusion.
• mPLUG achieves state-of-the-art performance on a wide range of vision-language tasks, including
Image captioning, image-text retrieval, visual grounding and visual question answering in zero-shot
manner
35. 035
Conclusion
• Presents mPLUG with novel cross-modal skip-connections, an effective and efficient VLP
framework for both cross-modal understanding and generation.
• mPLUG achieves state-of-the-art performance on a wide range of vision-language tasks with
zero-shot manner