SlideShare a Scribd company logo
VLP: A Survey on Vision-
Language Pre-training
Feilong Chen, Duzhen Zhang, Minglun Han, Xiuyi Chen, Jing Shi, Shuang Xu, Bo
Xu, MIR 2023
福沢匠(名工大玉木研)
2023/6/8
概要
nVLP (Vision Language Pre-training)
• 視覚-言語の事前学習
• 単語と,画像(動画像)中の物体・動作を対応付ける
• 移植性が高い
nVLPの手法を紹介
• 特徴抽出方法
• モデルアーキテクチャ
• 事前学習タスク
• 事前学習データセット
• 下流タスク
特徴抽出手法
n画像
• OD-based Region Features (OD-RFs)
• 事前訓練済みオブジェクト検出器
• ほとんどの先行研究で利用 [Jiasen+, NeurIPS2019][Li+, arXiv2019]
• Faster R-CNN [Ren+, NuerIPS2015] : 最も一般的なモデル
• 時間が掛かるため,通常は事前学習中に凍結
• CNN-based Grid Features (CNN-GFs)
• CNNでグリッド特徴を得る [Li+, AAAI2020][Wang+, arXiv2021]
• エンドツーエンドで学習が可能
• ViT-based Patch Features (ViT-PFs)
• ViT [Dosovitskiy+, ICLR 2021]ベース
特徴抽出手法
n動画
• 画像と同様の手法でフレーム特徴を取得
• CNN-GFとViT-PFがよく用いられる
• CNN-GF
• ImageNet [Deng+, CVPR2009]事前学習のResNet
• Kinetics事前学習のSlowFast [Feichtenhofer+, ICCV2019]とI3D
[Carreira&Zisserman, CVPR2017]
nテキスト
• BERT [Devlin+, NAACL2019]などの手法に従う
• テキストをサブワード列に分割,開始トークン・終了トークンを挿入
• 対応する単語埋め込み,位置埋め込みを合計
VLP: A Survey on Vision-Language Pre-training 5
Cross-Attn
Self-Attn
Feedforward
Visual Features
Cross-Attn
Self-Attn
Feedforward
Textual Features
(b) Dual-Stream Architecture
Self-Attn
Feedforward
Visual Features Textual Features
(a) Single-Stream Architecture
Fig. 1 Illustration of two types of model architectures for VLP.
事前学習モデルのアーキテクチャ
nシングルストリーム or デュアルストリーム
VLP: A Survey on Vision-Language Pre-training 5
Cross-Attn
Self-Attn
Feedforward
Visual Features
Cross-Attn
Self-Attn
Feedforward
Textual Features
(b) Dual-Stream Architecture
Self-Attn
Feedforward
Visual Features Textual Features
(a) Single-Stream Architecture
Fig. 1 Illustration of two types of model architectures for VLP.
事前学習モデルのアーキテクチャ
nシングルストリーム
• [Li+, arXiv2019][Chen+, ECCV2020][Chen+,
IEEE2022]
• テキストと視覚特徴を連結
• 単一Transformerブロックに入力
• 両モダリティで同じパラメータを
使用
• パラメータ効率が高い
VLP: A Survey on Vision-Language Pre-training 5
Cross-Attn
Self-Attn
Feedforward
Visual Features
Cross-Attn
Self-Attn
Feedforward
Textual Features
(b) Dual-Stream Architecture
Self-Attn
Feedforward
Visual Features Textual Features
(a) Single-Stream Architecture
Fig. 1 Illustration of two types of model architectures for VLP.
事前学習モデルのアーキテクチャ
nデュアルストリーム
• [Zhang+, ACM MM2022][Dou+,
arXiv2021]
• テキストと視覚特徴を独立
• 異なるTransformerブロックに入
力
• ブロック間でパラメータ共有しな
い
• 高い性能を出すためにクロスアテ
ンション
事前学習タスク
n補完型
• マスクされた要素を再構築し,モダリティを理解する
• Masked Language Modeling, Prefix Language Modeling [Wang+, arXiv2021],
Masked Vision Modeling [Chen+, ECCV2020]
nマッチング型
• 視覚と言語を同じ空間に統合し,普遍的な視覚-言語表現を得る
• Vision-Language Matching [Li+, arXiv2021], Vision-Language Contrastive
Learning [Li+, arXiv2021], Word-Region Alignment [Chen+, ECCV2020]
n時間型
• 乱れた入力シーケンスを並べ替える
• Frame Order Modeling [Li+, EMNLP2020]
n特殊型
• 視覚的質問応答や視覚的キャプション生成など,その他
事前学習データセット
Table 1 Details of some popular pre-training datasets for VLP. Names of some datasets
are abbreviated for the convenience of subsequent description. FLKR represents Flickr30k,
and HT100M represents HowTo100M.
Dataset # Images # Image-text Pairs Duration (hrs) # Clips # Videos
SBU [44] 875K 875K - - -
FLKR [45] 29K 145K - - -
COCO [46] 113K 567K - - -
VG [47] 108K 5.4M - - -
VGQA [47] 108K 1.8M - - -
VQA [48] 83K 444K - - -
Matterport3D [49] 104K 104K - - -
FashionGen [50] 260K 260K - - -
CC3M [51] 3M 3M - - -
GQA [52] 82K 1M - - -
LAIT [53] 10M 10M - - -
CC12M [54] 12M 12M - - -
ALIGN [55] 1.8B 1.8B - - -
Kinetics400 [23] - - 817 306K 306K
TVQA [38] - - 461 22K 925
HT100M [56] - - 134K 136M 1.2M
WebVid2M [57] - - 13K 2.5M 2.5M
画像言語事前学習データセット
nSBU [Ordonez+, NeurIPS2011], Flickr30k [Young+, TACL2014]
• 人間が生成した注釈がラベル
nCOCO [Lin+, ECCV2014]
• 異なる複数の人間がラベルキャプションを作成
nCC3M [Sharma+, ACL2018], CC12M [Changpinyo+, CVPR2021]
• インターネットから画像と説明テキストをクローリング
• ノイズ多
nVisual Genome [Krishna, IJCV2017]
• 視覚的質問応答で用いられる構造化データ
• 質問と回答のペア
動画像言語事前学習データセット
nKitetics-400 [Kay+, arXiv2017]
• 400の人間の動作クラス,各クラスごとに400以上のデータ
nHowTo100M [Miech+, ICCV2019]
• ナレーション付き動画
• 136Mのキャプション付きクリップ
nWebVid-2M [Bain+, ICCV2021]
• 2.5Mのキャプション付きクリップ
• 豊富な内容
nTVQA [Lei+, EMNLP2018]
• テレビ番組から生成された
• ビデオと対話を収集したデータセット
下流タスク
12 VLP: A Survey on Vision-Language Pre-training
Downstream Tasks
Classification
Regression
Retrieval
Generation
Other tasks
Visual Question
Answering(VQA)
Visual Reasoning and Compositional
Question Answering (GQA)
Video-Language
Inference (VLI)
Natural Language for
Visual Reasoning (NLVR)
Visual Entailment
(VE)
Visual Commonsense
Reasoning (VCR)
Grounding Referring
Expressions (GRE)
Category Recognition
(CR)
Multi-modal Sentiment
Analysis (MSA)
Vision-Language
Retrieval (VLR)
Visual Captioning
(VC)
Novel Object Captioning at
Scale (NoCaps)
Visual Dialogue (VD)
Multi-modal Machine
Translation (MMT)
Vision-Language
Navigation (VLN)
Optical Character
Recognition (OCR)
Fig. 2 Illustration of downstream tasks in VLP.
by subtitles, thus providing weak or strong alignments between video clips and
下流タスク
nVisual Question Answering (VQA)
• 視覚的な入力を与え,質問に対する答えを提示するタスク
nVideo-Language Inference (VLI)
• 字幕付きクリップに対してテキストで与えられた仮説の正しさを判別するタスク
nMulti-modal Sentiment Analysis (MSA)
• マルチモーダル情報を利用して,ビデオ内の感情を検出
nVisual Dialogue (VD)
• 画像と質問が与えられたときに,自然言語で質問に答える
nMulti-modal Machine Translation (MMT)
• 画像など他モダリティの追加情報とともにテキストを翻訳
SOTA : Image-Text VLP model
Table 2 The summary of mainstream image-text VLP models. The number of downstream tasks determines whether the model is generic or
domain-specific VLP. FE: Feature Extraction. PT: Pre-training. Emb: Embedding. SC in Datatsets column: self-constructed or self-collected. MTL
in Datatsets column: all datasets for multi-task learning in corresponding work. See other abbreviations in Datatsets column in Table 1.
Model Domain Vision FE Language FE Multimodal Fusion Decoder PT Objectives PT Datasets Downstream Tasks
VisualBERT [9] Image OD-RFs Emb Single-stream No MLM+VLM COCO GRE+NLVR+VCR+VQA
ViLBERT [8] Image OD-RFs Emb Dual-stream No MLM+VLM+MVM COCO+VG VLR+NLVR+VE+VQA
LXMERT [108] Image OD-RFs+Xformer Xformer Dual-stream No MLM+VLM+MVM+VQA COCO+VG+VQA+GQA+VGQA GQA+NLVR+VQA
B2T2 [109] Image CNN-GFs Emb Single-stream No MLM+VLM CC3M VCR
Unicoder-VL [13] Image OD-RFs Emb Single-stream No MLM+VLM+MVM CC3M+SBU VLR+VCR
VL-BERT [110] Image OD-RFs Emb Single-stream No MLM+MVM CC3M GRE+VCR+VQA
VLP [111] Image OD-RFs Emb Dual-stream Yes MLM+LM CC3M VC+VQA
UNITER [30] Image OD-RFs Emb Single-stream No MLM+VLM+MVM+WRA COCO+VG+SBU+CC3M GRE+VLR+NLVR+VCR+VE+VQA
12-IN-1 [112] Image OD-RFs Emb Single-stream No MLM+MVM MTL GQA+GRE+VC+NLVR+VE+VQA
VisDial-BERT [113] Image OD-RFs Emb Dual-stream No MLM+VLM+MVM CC3M+VQA VD
ImageBERT [53] Image OD-RFs Emb Single-stream No MLM+VLM+MVM LAIT+CC3M+SBU VLR
PREVALENT [114] Image CNN-GFs+Xformer Xformer Single-stream No MLM+MVM Matterport3D VLN
XGPT [41] Image OD-RFs Emb Dual-stream Yes MLM+IDA+VC+TIFG CC3M VC+VLR
InterBER [115] Image OD-RFs Emb Single-stream No MLM+VLM+MVM COCO+CC3M+SBU VLR+VCR
PixelBERT [116] Image CNN-GFs Emb Single-stream No MLM+VLM COCO+VG VLR+NLVR+VQA
OSCAR [10] Image OD-RFs Emb Single-stream No MLM+VLM COCO+SBU+CC3M+FLKR+VQA+GQA+VGQA GQA+VC+VLR+NLVR+NoCaps+VQA
VLN-BERT [117] Image OD-RFs Emb Dual-stream No MLM+VLM+MVM CC3M VLN
FashionBERT [118] Image Xformer Emb Single-stream No MLM+VLM+MVM FashionGen VLR
VILLA [119] Image OD-RFs+Xformer Xformer Single-stream No MLM+VLM+MVM COCO+VG+CC3M+SBU GRE+VLR+NLVR+VCR+VE+VQA
ERNIE-ViL [120] Image OD-RFs Emb Single-stream No MLM+MVM CC3M+SBU GRE+VLR+VCR+VQA
RVL-BERT [121] Image OD-RFs Emb Single-stream No MLM+VLM+MVM CC3M VC+VQA
VinVL [27] Image OD-RFs Emb Single-stream No MLM+VLM COCO+CC3M+SBU+FLKR+VQA+GQA+VGQA GQA+VC+VLR+NLVR+NoCaps+VQA
VL-T5 [122] Image OD-RFs Emb Single-stream Yes MLM+VLM+VQA+GRE+VC COCO+VG+VQA+GQA+VGQA GQA+GRE+VC+MMT+NLVR+VCR+VQA
ViLT [123] Image ViT-PFs Emb Single-stream No MLM+VLM COCO+VG+SBU+CC3M VLR+NLVR+VQA
ALIGN [55] Image CNN-GFs Xformer Dual-stream No VLC ALIGN VLR
Kaleido-BERT [43] Image CNN-GFs Emb Single-stream No MLM+VLM+AKPM FashionGen CR+VC+VLR
MDETR [42] Image Xformer Xformer Single-stream Yes OD+MLM+VLC COCO+VG+FLKR+GQA GQA+VQA
SOHO [124] Image CNN-GFs Emb Single-stream No MLM+VLM+MVM COCO+VG VLR+NLVR+VE+VQA
E2E-VLP [125] Image CNN-GFs Emb Single-stream Yes OD+MLM+VLM COCO+VG VC+VLR+NLVR+VQA
Visual Parsing [126] Image Xformer Emb Single-stream No MLM+VLM+MVM COCO+VG VLR+VCR+VE+VQA
CLIP-ViL [127] Image CNN-GFs Emb Single-stream Yes MLM+VLM+VQA COCO+VG+VQA+GQA+VGQA VE+VLN+VQA
ALBEF [35] Image Xformer Xformer Dual-stream No MLM+VLM+VLC COCO+VG+CC3M+SBU VLR+NLVR+VQA
SimVLM [14] Image CNN-GFs Emb Single-stream Yes PrefixLM ALIGN VC+NLVR+VE+VQA
MURAL [128] Image CNN-GFs Xformer Dual-stream No VLC CC12M+ALIGN VC+VLR
VLMO [129] Image ViT-PFs Emb Single-stream No MLM+VLC+VLM COCO+VG+CC3M+SBU VQA+NLVR+VLR
METER [33] Image Xformer Xformer Dual-stream No MLM+VLM COCO+VG+CC3M+SBU VLR+NLVR+VE+VQA
X-VLM [28] Image Xformer Xformer Single-stream No MLM+VLM+VG COCO+VG+CC3M+SBU VLR+NLVR+VE+VQA
TCL [130] Image Xformer Xformer Single-stream No MLM+VLM+TCL COCO+VG+CC3M+SBU VLR+NLVR+VE+VQA
SOTA : Video-Text VLP model
Table 3 The summary of mainstream video-text VLP models. The number of downstream tasks determines whether the model is generic or
domain-specific VLP. FE: Feature Extraction. PT: Pre-training. Emb: Embedding. SC in Datatsets column: self-constructed or self-collected. MTL
in Datatsets column: all datasets for multi-task learning in corresponding work. See other abbreviations in Datatsets column in Table 1.
Model Domain Vision FE Language FE Multimodal Fusion Decoder PT Objectives PT Datasets Downstream Tasks
VideoBERT [58] Video CNN-GFs Emb Single-stream No MLM+VLM+MVM SC AC+VC
CBT [105] Video CNN-GFs+Xformer Xformer Single-stream No VLC Kinetics AC+AS+VC
UniVL [106] Video CNN-GFs Xformer Dual-stream Yes MLM+VLM+VC HT100M AS+ASL+MSA+VC+VLR
HERO [36] Video CNN-GFs+Xformer Xformer Single-stream No MLM+VLM+MVM+FOM HT100M+TV VC+VLI+VQA+VLR
MMFT-BERT [131] Video OD-RFs+Xformer Xformer Single-stream No VQA TV VQA
ActBERT [132] Video OD-RFs+CNN Emb Single-stream No MLM+VLM+MVM HT100M AS+ASL+VC+VQA+VLR
CLIP [16] Image / Video CNN/Xformer Xformer Dual-stream No VLC SC OCR +AC etc.
Frozen [57] Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR
Region-Learner [133] Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR
CLIP4Clip [17] Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR
CLIP2Video [18] Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR
まとめ
nVLPの手法・SOTAを以下の観点で紹介
• 特徴抽出方法
• モデルアーキテクチャ
• 事前学習タスク
• 事前学習データセット
• 下流タスク

More Related Content

More from Toru Tamaki

論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
Toru Tamaki
 
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
Toru Tamaki
 
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
Toru Tamaki
 
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
Toru Tamaki
 
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
Toru Tamaki
 
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
Toru Tamaki
 
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
Toru Tamaki
 
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
Toru Tamaki
 
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
Toru Tamaki
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
Toru Tamaki
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
Toru Tamaki
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
Toru Tamaki
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
Toru Tamaki
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
Toru Tamaki
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
Toru Tamaki
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
Toru Tamaki
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
Toru Tamaki
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
Toru Tamaki
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
Toru Tamaki
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
Toru Tamaki
 

More from Toru Tamaki (20)

論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
 
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
 
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
 
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
 
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
 
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
 
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
 
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
 
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
 

Recently uploaded

【AI論文解説】Consistency ModelとRectified Flow
【AI論文解説】Consistency ModelとRectified Flow【AI論文解説】Consistency ModelとRectified Flow
【AI論文解説】Consistency ModelとRectified Flow
Sony - Neural Network Libraries
 
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
harmonylab
 
FIDO Alliance Osaka Seminar: CloudGate.pdf
FIDO Alliance Osaka Seminar: CloudGate.pdfFIDO Alliance Osaka Seminar: CloudGate.pdf
FIDO Alliance Osaka Seminar: CloudGate.pdf
FIDO Alliance
 
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
NTT DATA Technology & Innovation
 
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
iPride Co., Ltd.
 
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
atsushi061452
 
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdfFIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdfFIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
FIDO Alliance
 
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
atsushi061452
 
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdfFIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
FIDO Alliance
 
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアルLoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
CRI Japan, Inc.
 
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
yassun7010
 
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
Fukuoka Institute of Technology
 
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
Matsushita Laboratory
 
FIDO Alliance Osaka Seminar: Welcome Slides.pdf
FIDO Alliance Osaka Seminar: Welcome Slides.pdfFIDO Alliance Osaka Seminar: Welcome Slides.pdf
FIDO Alliance Osaka Seminar: Welcome Slides.pdf
FIDO Alliance
 

Recently uploaded (15)

【AI論文解説】Consistency ModelとRectified Flow
【AI論文解説】Consistency ModelとRectified Flow【AI論文解説】Consistency ModelとRectified Flow
【AI論文解説】Consistency ModelとRectified Flow
 
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
 
FIDO Alliance Osaka Seminar: CloudGate.pdf
FIDO Alliance Osaka Seminar: CloudGate.pdfFIDO Alliance Osaka Seminar: CloudGate.pdf
FIDO Alliance Osaka Seminar: CloudGate.pdf
 
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
 
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
 
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
 
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdfFIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
 
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdfFIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
 
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
 
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdfFIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
 
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアルLoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
 
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
 
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
 
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
 
FIDO Alliance Osaka Seminar: Welcome Slides.pdf
FIDO Alliance Osaka Seminar: Welcome Slides.pdfFIDO Alliance Osaka Seminar: Welcome Slides.pdf
FIDO Alliance Osaka Seminar: Welcome Slides.pdf
 

論文紹介:VLP: A Survey on Vision-Language Pre-training

  • 1. VLP: A Survey on Vision- Language Pre-training Feilong Chen, Duzhen Zhang, Minglun Han, Xiuyi Chen, Jing Shi, Shuang Xu, Bo Xu, MIR 2023 福沢匠(名工大玉木研) 2023/6/8
  • 2. 概要 nVLP (Vision Language Pre-training) • 視覚-言語の事前学習 • 単語と,画像(動画像)中の物体・動作を対応付ける • 移植性が高い nVLPの手法を紹介 • 特徴抽出方法 • モデルアーキテクチャ • 事前学習タスク • 事前学習データセット • 下流タスク
  • 3. 特徴抽出手法 n画像 • OD-based Region Features (OD-RFs) • 事前訓練済みオブジェクト検出器 • ほとんどの先行研究で利用 [Jiasen+, NeurIPS2019][Li+, arXiv2019] • Faster R-CNN [Ren+, NuerIPS2015] : 最も一般的なモデル • 時間が掛かるため,通常は事前学習中に凍結 • CNN-based Grid Features (CNN-GFs) • CNNでグリッド特徴を得る [Li+, AAAI2020][Wang+, arXiv2021] • エンドツーエンドで学習が可能 • ViT-based Patch Features (ViT-PFs) • ViT [Dosovitskiy+, ICLR 2021]ベース
  • 4. 特徴抽出手法 n動画 • 画像と同様の手法でフレーム特徴を取得 • CNN-GFとViT-PFがよく用いられる • CNN-GF • ImageNet [Deng+, CVPR2009]事前学習のResNet • Kinetics事前学習のSlowFast [Feichtenhofer+, ICCV2019]とI3D [Carreira&Zisserman, CVPR2017] nテキスト • BERT [Devlin+, NAACL2019]などの手法に従う • テキストをサブワード列に分割,開始トークン・終了トークンを挿入 • 対応する単語埋め込み,位置埋め込みを合計
  • 5. VLP: A Survey on Vision-Language Pre-training 5 Cross-Attn Self-Attn Feedforward Visual Features Cross-Attn Self-Attn Feedforward Textual Features (b) Dual-Stream Architecture Self-Attn Feedforward Visual Features Textual Features (a) Single-Stream Architecture Fig. 1 Illustration of two types of model architectures for VLP. 事前学習モデルのアーキテクチャ nシングルストリーム or デュアルストリーム
  • 6. VLP: A Survey on Vision-Language Pre-training 5 Cross-Attn Self-Attn Feedforward Visual Features Cross-Attn Self-Attn Feedforward Textual Features (b) Dual-Stream Architecture Self-Attn Feedforward Visual Features Textual Features (a) Single-Stream Architecture Fig. 1 Illustration of two types of model architectures for VLP. 事前学習モデルのアーキテクチャ nシングルストリーム • [Li+, arXiv2019][Chen+, ECCV2020][Chen+, IEEE2022] • テキストと視覚特徴を連結 • 単一Transformerブロックに入力 • 両モダリティで同じパラメータを 使用 • パラメータ効率が高い
  • 7. VLP: A Survey on Vision-Language Pre-training 5 Cross-Attn Self-Attn Feedforward Visual Features Cross-Attn Self-Attn Feedforward Textual Features (b) Dual-Stream Architecture Self-Attn Feedforward Visual Features Textual Features (a) Single-Stream Architecture Fig. 1 Illustration of two types of model architectures for VLP. 事前学習モデルのアーキテクチャ nデュアルストリーム • [Zhang+, ACM MM2022][Dou+, arXiv2021] • テキストと視覚特徴を独立 • 異なるTransformerブロックに入 力 • ブロック間でパラメータ共有しな い • 高い性能を出すためにクロスアテ ンション
  • 8. 事前学習タスク n補完型 • マスクされた要素を再構築し,モダリティを理解する • Masked Language Modeling, Prefix Language Modeling [Wang+, arXiv2021], Masked Vision Modeling [Chen+, ECCV2020] nマッチング型 • 視覚と言語を同じ空間に統合し,普遍的な視覚-言語表現を得る • Vision-Language Matching [Li+, arXiv2021], Vision-Language Contrastive Learning [Li+, arXiv2021], Word-Region Alignment [Chen+, ECCV2020] n時間型 • 乱れた入力シーケンスを並べ替える • Frame Order Modeling [Li+, EMNLP2020] n特殊型 • 視覚的質問応答や視覚的キャプション生成など,その他
  • 9. 事前学習データセット Table 1 Details of some popular pre-training datasets for VLP. Names of some datasets are abbreviated for the convenience of subsequent description. FLKR represents Flickr30k, and HT100M represents HowTo100M. Dataset # Images # Image-text Pairs Duration (hrs) # Clips # Videos SBU [44] 875K 875K - - - FLKR [45] 29K 145K - - - COCO [46] 113K 567K - - - VG [47] 108K 5.4M - - - VGQA [47] 108K 1.8M - - - VQA [48] 83K 444K - - - Matterport3D [49] 104K 104K - - - FashionGen [50] 260K 260K - - - CC3M [51] 3M 3M - - - GQA [52] 82K 1M - - - LAIT [53] 10M 10M - - - CC12M [54] 12M 12M - - - ALIGN [55] 1.8B 1.8B - - - Kinetics400 [23] - - 817 306K 306K TVQA [38] - - 461 22K 925 HT100M [56] - - 134K 136M 1.2M WebVid2M [57] - - 13K 2.5M 2.5M
  • 10. 画像言語事前学習データセット nSBU [Ordonez+, NeurIPS2011], Flickr30k [Young+, TACL2014] • 人間が生成した注釈がラベル nCOCO [Lin+, ECCV2014] • 異なる複数の人間がラベルキャプションを作成 nCC3M [Sharma+, ACL2018], CC12M [Changpinyo+, CVPR2021] • インターネットから画像と説明テキストをクローリング • ノイズ多 nVisual Genome [Krishna, IJCV2017] • 視覚的質問応答で用いられる構造化データ • 質問と回答のペア
  • 11. 動画像言語事前学習データセット nKitetics-400 [Kay+, arXiv2017] • 400の人間の動作クラス,各クラスごとに400以上のデータ nHowTo100M [Miech+, ICCV2019] • ナレーション付き動画 • 136Mのキャプション付きクリップ nWebVid-2M [Bain+, ICCV2021] • 2.5Mのキャプション付きクリップ • 豊富な内容 nTVQA [Lei+, EMNLP2018] • テレビ番組から生成された • ビデオと対話を収集したデータセット
  • 12. 下流タスク 12 VLP: A Survey on Vision-Language Pre-training Downstream Tasks Classification Regression Retrieval Generation Other tasks Visual Question Answering(VQA) Visual Reasoning and Compositional Question Answering (GQA) Video-Language Inference (VLI) Natural Language for Visual Reasoning (NLVR) Visual Entailment (VE) Visual Commonsense Reasoning (VCR) Grounding Referring Expressions (GRE) Category Recognition (CR) Multi-modal Sentiment Analysis (MSA) Vision-Language Retrieval (VLR) Visual Captioning (VC) Novel Object Captioning at Scale (NoCaps) Visual Dialogue (VD) Multi-modal Machine Translation (MMT) Vision-Language Navigation (VLN) Optical Character Recognition (OCR) Fig. 2 Illustration of downstream tasks in VLP. by subtitles, thus providing weak or strong alignments between video clips and
  • 13. 下流タスク nVisual Question Answering (VQA) • 視覚的な入力を与え,質問に対する答えを提示するタスク nVideo-Language Inference (VLI) • 字幕付きクリップに対してテキストで与えられた仮説の正しさを判別するタスク nMulti-modal Sentiment Analysis (MSA) • マルチモーダル情報を利用して,ビデオ内の感情を検出 nVisual Dialogue (VD) • 画像と質問が与えられたときに,自然言語で質問に答える nMulti-modal Machine Translation (MMT) • 画像など他モダリティの追加情報とともにテキストを翻訳
  • 14. SOTA : Image-Text VLP model Table 2 The summary of mainstream image-text VLP models. The number of downstream tasks determines whether the model is generic or domain-specific VLP. FE: Feature Extraction. PT: Pre-training. Emb: Embedding. SC in Datatsets column: self-constructed or self-collected. MTL in Datatsets column: all datasets for multi-task learning in corresponding work. See other abbreviations in Datatsets column in Table 1. Model Domain Vision FE Language FE Multimodal Fusion Decoder PT Objectives PT Datasets Downstream Tasks VisualBERT [9] Image OD-RFs Emb Single-stream No MLM+VLM COCO GRE+NLVR+VCR+VQA ViLBERT [8] Image OD-RFs Emb Dual-stream No MLM+VLM+MVM COCO+VG VLR+NLVR+VE+VQA LXMERT [108] Image OD-RFs+Xformer Xformer Dual-stream No MLM+VLM+MVM+VQA COCO+VG+VQA+GQA+VGQA GQA+NLVR+VQA B2T2 [109] Image CNN-GFs Emb Single-stream No MLM+VLM CC3M VCR Unicoder-VL [13] Image OD-RFs Emb Single-stream No MLM+VLM+MVM CC3M+SBU VLR+VCR VL-BERT [110] Image OD-RFs Emb Single-stream No MLM+MVM CC3M GRE+VCR+VQA VLP [111] Image OD-RFs Emb Dual-stream Yes MLM+LM CC3M VC+VQA UNITER [30] Image OD-RFs Emb Single-stream No MLM+VLM+MVM+WRA COCO+VG+SBU+CC3M GRE+VLR+NLVR+VCR+VE+VQA 12-IN-1 [112] Image OD-RFs Emb Single-stream No MLM+MVM MTL GQA+GRE+VC+NLVR+VE+VQA VisDial-BERT [113] Image OD-RFs Emb Dual-stream No MLM+VLM+MVM CC3M+VQA VD ImageBERT [53] Image OD-RFs Emb Single-stream No MLM+VLM+MVM LAIT+CC3M+SBU VLR PREVALENT [114] Image CNN-GFs+Xformer Xformer Single-stream No MLM+MVM Matterport3D VLN XGPT [41] Image OD-RFs Emb Dual-stream Yes MLM+IDA+VC+TIFG CC3M VC+VLR InterBER [115] Image OD-RFs Emb Single-stream No MLM+VLM+MVM COCO+CC3M+SBU VLR+VCR PixelBERT [116] Image CNN-GFs Emb Single-stream No MLM+VLM COCO+VG VLR+NLVR+VQA OSCAR [10] Image OD-RFs Emb Single-stream No MLM+VLM COCO+SBU+CC3M+FLKR+VQA+GQA+VGQA GQA+VC+VLR+NLVR+NoCaps+VQA VLN-BERT [117] Image OD-RFs Emb Dual-stream No MLM+VLM+MVM CC3M VLN FashionBERT [118] Image Xformer Emb Single-stream No MLM+VLM+MVM FashionGen VLR VILLA [119] Image OD-RFs+Xformer Xformer Single-stream No MLM+VLM+MVM COCO+VG+CC3M+SBU GRE+VLR+NLVR+VCR+VE+VQA ERNIE-ViL [120] Image OD-RFs Emb Single-stream No MLM+MVM CC3M+SBU GRE+VLR+VCR+VQA RVL-BERT [121] Image OD-RFs Emb Single-stream No MLM+VLM+MVM CC3M VC+VQA VinVL [27] Image OD-RFs Emb Single-stream No MLM+VLM COCO+CC3M+SBU+FLKR+VQA+GQA+VGQA GQA+VC+VLR+NLVR+NoCaps+VQA VL-T5 [122] Image OD-RFs Emb Single-stream Yes MLM+VLM+VQA+GRE+VC COCO+VG+VQA+GQA+VGQA GQA+GRE+VC+MMT+NLVR+VCR+VQA ViLT [123] Image ViT-PFs Emb Single-stream No MLM+VLM COCO+VG+SBU+CC3M VLR+NLVR+VQA ALIGN [55] Image CNN-GFs Xformer Dual-stream No VLC ALIGN VLR Kaleido-BERT [43] Image CNN-GFs Emb Single-stream No MLM+VLM+AKPM FashionGen CR+VC+VLR MDETR [42] Image Xformer Xformer Single-stream Yes OD+MLM+VLC COCO+VG+FLKR+GQA GQA+VQA SOHO [124] Image CNN-GFs Emb Single-stream No MLM+VLM+MVM COCO+VG VLR+NLVR+VE+VQA E2E-VLP [125] Image CNN-GFs Emb Single-stream Yes OD+MLM+VLM COCO+VG VC+VLR+NLVR+VQA Visual Parsing [126] Image Xformer Emb Single-stream No MLM+VLM+MVM COCO+VG VLR+VCR+VE+VQA CLIP-ViL [127] Image CNN-GFs Emb Single-stream Yes MLM+VLM+VQA COCO+VG+VQA+GQA+VGQA VE+VLN+VQA ALBEF [35] Image Xformer Xformer Dual-stream No MLM+VLM+VLC COCO+VG+CC3M+SBU VLR+NLVR+VQA SimVLM [14] Image CNN-GFs Emb Single-stream Yes PrefixLM ALIGN VC+NLVR+VE+VQA MURAL [128] Image CNN-GFs Xformer Dual-stream No VLC CC12M+ALIGN VC+VLR VLMO [129] Image ViT-PFs Emb Single-stream No MLM+VLC+VLM COCO+VG+CC3M+SBU VQA+NLVR+VLR METER [33] Image Xformer Xformer Dual-stream No MLM+VLM COCO+VG+CC3M+SBU VLR+NLVR+VE+VQA X-VLM [28] Image Xformer Xformer Single-stream No MLM+VLM+VG COCO+VG+CC3M+SBU VLR+NLVR+VE+VQA TCL [130] Image Xformer Xformer Single-stream No MLM+VLM+TCL COCO+VG+CC3M+SBU VLR+NLVR+VE+VQA
  • 15. SOTA : Video-Text VLP model Table 3 The summary of mainstream video-text VLP models. The number of downstream tasks determines whether the model is generic or domain-specific VLP. FE: Feature Extraction. PT: Pre-training. Emb: Embedding. SC in Datatsets column: self-constructed or self-collected. MTL in Datatsets column: all datasets for multi-task learning in corresponding work. See other abbreviations in Datatsets column in Table 1. Model Domain Vision FE Language FE Multimodal Fusion Decoder PT Objectives PT Datasets Downstream Tasks VideoBERT [58] Video CNN-GFs Emb Single-stream No MLM+VLM+MVM SC AC+VC CBT [105] Video CNN-GFs+Xformer Xformer Single-stream No VLC Kinetics AC+AS+VC UniVL [106] Video CNN-GFs Xformer Dual-stream Yes MLM+VLM+VC HT100M AS+ASL+MSA+VC+VLR HERO [36] Video CNN-GFs+Xformer Xformer Single-stream No MLM+VLM+MVM+FOM HT100M+TV VC+VLI+VQA+VLR MMFT-BERT [131] Video OD-RFs+Xformer Xformer Single-stream No VQA TV VQA ActBERT [132] Video OD-RFs+CNN Emb Single-stream No MLM+VLM+MVM HT100M AS+ASL+VC+VQA+VLR CLIP [16] Image / Video CNN/Xformer Xformer Dual-stream No VLC SC OCR +AC etc. Frozen [57] Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR Region-Learner [133] Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR CLIP4Clip [17] Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR CLIP2Video [18] Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR
  • 16. まとめ nVLPの手法・SOTAを以下の観点で紹介 • 特徴抽出方法 • モデルアーキテクチャ • 事前学習タスク • 事前学習データセット • 下流タスク