論文紹介：Vision Transformer Adapter for Dense Predictions

VISION TRANSFORMER ADAPTER
FOR DENSE PREDICTIONS
Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao,
ICLR2023
木全潤（名工大）
2023/11/30

概要
◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021]を
密な画像タスクに適応するためのアダプタの提案
• 従来のモデルのバックボーンは画像に特化
• 提案手法ではシンプルなViTを使用
• 提案手法はマルチモーダルなバックボーンでも可
this issue, we propose the ViT-Adapter, which allows plain ViT to achieve com-
parable performance to vision-specific transformers. Specifically, the backbone
in our framework is a plain ViT that can learn powerful representations from
large-scale multi-modal data. W hen transferring to downstream tasks, a pre-
training-free adapter is used to introduce the image-related inductive biases into
the model, making it suitable for these tasks. We verify ViT-Adapter on mul-
tiple dense prediction tasks, including object detection, instance segmentation,
and semantic segmentation. Notably, without using extra detection data, our ViT-
Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-
dev. We hope that the ViT-Adapter could serve asan alternative for vision-specific
transformers and facilitate future research. Code and models will be released at
ht t ps: //gi t hub. com/czczup/Vi T- Adapt er .
Step1: Im ag e M odality Pre-training
Step2: Fine-tuning
(a) Previous Parad ig m
Vision-Specific
Transform er
COCO
ADE20K
Det/Seg
Vision-Specific
Transform er
Im ag e
M odality
SL/SSL
Step1: M ulti-M odal Pre-training
Step2: Fine-tuning with Adapter
(b) Our Parad ig m
ViT
Im ag e,
Video, Text, ...
SL/SSL
COCO
ADE20K
ViT
Adapter
Det/Seg

関連研究
◼密な画像タスクでは複数スケールの特徴を扱う
◼ViTDet [Li+, arXiv2022]
• アップサンプリングとダウンサンプリング
◼提案手法
• アダプタによって入力画像も考慮

提案手法
◼基本的なViTにアダプタを追加
◼アダプタ部分は3種のモジュールから構成
Published as a conference paper at ICLR 2023
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
E
P
Position
Em bedding
Elem ent-wise
Addition
Figure 4: Overa
into N (usually N
three key designs
input image, (d) a
feature extractor

Spatial Prior Module
◼畳み込みは局所的な空間情報を捉えるのに優れる
• 3つ畳み込みとmaxプーリングから構成
• 異なる解像度を持つ特徴マップの生成と平坦化
Patch
Em bed ding
…
Spatial
Prior M odule
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
Patch
Em bed ding
Spatial
Prior M odule
Extr
Block 1
Injector 1
Position
Em bedding
Elem ent-wise
Addition
(c) Spatial Prior M odule
ℱ! "
#
Stem
ℱ# ℱ$ ℱ %
V
Figure 4: Overall architecture of ViT-A
into N (usually N = 4) equal blocks for f
three key designs, including (c) a spatial p
input image, (d) a spatial feature injector fo
feature extractor for reorganizing multi-sca

Spatial Feature Injector
◼空間の事前情報をViTに注入
• 𝛾によってViTの分布に大幅な変更が起きないよう調整
Patch
Em bed ding
…
Spatial
Prior M odule
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
Patch
Em bed ding
Spatial
Prior M odule
(b) ViT-Adapter
Block 2
Extrac
Injector 2
Extractor 1
Block 1
Injector 1
Position
Em bedding
Elem ent-wise
Addition
ℱ! "
#
Stem
ℱ# ℱ$ ℱ %
(d) Spatial Feature Injector "
Key/
Value
ℱ! "
&
Query ℱ ' ()
&
Cross-Attention
Q
ℱ! "
&
ℱ
#' ()
&
Figure 4: Overall architecture of ViT-Adapter. (a) The ViT, w
into N (usually N = 4) equal blocks for feature interaction. (b) O
three key designs, including (c) a spatial prior module for modelin
input image, (d) a spatial feature injector for introducing spatial pri

Multi-Scale Feature Extractor
◼クロスアテンションとFFNで構成
• マルチスケール特徴を抽出
Patch
Em bed ding
…
Spatial
Prior M odule
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
Patch
Em bed ding
…
Spatial
Prior M odule
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Position
Em bedding
Elem ent-wise
Addition
ℱ! "
#
Stem
ℱ# ℱ$ ℱ %
Extractor N
(d) Spatial Feature Injector "
Key/
Value
ℱ! "
&
Query ℱ ' ()
&
Cross-Attention
(e) M ulti-Scale Feature Extractor "
Key/Value
Query
ℱ! "
&
ℱ ' ()
&
* #
ℱ
#
! "
&
N orm
Cross-Attention FFN
ℱ! "
&
* #
ℱ! "
&
ℱ
#' ()
&
gure 4: Overall architecture of ViT-Adapter. (a) The ViT, whose encoder layers are divided
to N (usually N = 4) equal blocks for feature interaction. (b) Our ViT-Adapter, which contains
ree key designs, including (c) a spatial prior module for modeling local spatial contexts from the
put image, (d) a spatial feature injector for introducing spatial priors into ViT, and (e) a multi-scale
ature extractor for reorganizing multi-scale features from the single-scale features of ViT.

実験
◼タスクとデータセット
• オブジェクト検出
• COCO [Lin+, ECCV2014]
• インスタンスセグメンテーション
• ADE20K [Zhou+, CVPR2017]
◼モデル
• ViT
• ImageNet 1K/ 22Kで事前学習

オブジェクト検出
◼視覚に特化したTransformerと同等の結果
◼マルチモーダルな事前学習で性能向上

セグメンテーション
◼視覚に特化したTransformerと同等の結果

まとめ
◼ViT-Adapterの提案
• 密な画像タスクにおける通常のViTと視覚に特化したTransformerとの
ギャップを埋める
◼画像に関連した誘導バイアスの注入
• 密なタスクに必要な細かななマルチスケール特徴の取得
◼実験
• 視覚に特化した手法と同等か上回る結果
• マルチモーダル事前学習の有用性を確認

パラメータ数
39
41
43
45
47
49
51
53
0 50 100 150 200 250 300 350 400
CO
CO
BBo
x
AP
(%
)
#Param eter (M )
ViT-Ad ap ter (ours)
PVT
Swin
PVTv2
ViT
51.2
+
M
u
l
t
i
-
M
o
d
a
l
P
r
e
-
t
r
a
i
n
i
n
g
T
S
B L
Method #Param APb
PVTv2-B1 33.7M 44.9
ViT-T 26.1M 40.2
ViT-Adapter-T (ours) 28.1M 46.0
PVTv2-B2 45.0M 47.8
Swin-T 47.8M 46.0
ViT-S 43.8M 44.0
ViT-Adapter-S (ours) 47.8M 48.2
Swin-B 107.1M 48.6
ViT-B 113.6M 45.8
ViT-Adapter-B (ours) 120.2M 49.6
ViT-Adapter-BF
(ours) 120.2M 51.2
ViT-L†
337.3M 48.8
ViT-Adapter-L†
(ours) 347.9M 52.1
plain ViT (i.e., vanilla trans-
s some nonnegligible advan-
example lies in multi-modal
u et al., 2021; 2022; Wang
Stemming from the natural
sing (NLP) ﬁeld, transformer
on of input data. Equipping
okenizers, e.g. patch embed-
y et al., 2020), 3D patch em-
al., 2021c), and token em-
ni et al., 2017), vanilla trans-
s plain ViT can use massive
a for pre-training, including

論文紹介：Vision Transformer Adapter for Dense Predictions

More Related Content

What's hot

Similar to 論文紹介：Vision Transformer Adapter for Dense Predictions

More from Toru Tamaki

Recently uploaded

論文紹介：Vision Transformer Adapter for Dense Predictions