2. 概要
◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021]を
密な画像タスクに適応するためのアダプタの提案
• 従来のモデルのバックボーンは画像に特化
• 提案手法ではシンプルなViTを使用
• 提案手法はマルチモーダルなバックボーンでも可
this issue, we propose the ViT-Adapter, which allows plain ViT to achieve com-
parable performance to vision-specific transformers. Specifically, the backbone
in our framework is a plain ViT that can learn powerful representations from
large-scale multi-modal data. W hen transferring to downstream tasks, a pre-
training-free adapter is used to introduce the image-related inductive biases into
the model, making it suitable for these tasks. We verify ViT-Adapter on mul-
tiple dense prediction tasks, including object detection, instance segmentation,
and semantic segmentation. Notably, without using extra detection data, our ViT-
Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-
dev. We hope that the ViT-Adapter could serve asan alternative for vision-specific
transformers and facilitate future research. Code and models will be released at
ht t ps: //gi t hub. com/czczup/Vi T- Adapt er .
Step1: Im ag e M odality Pre-training
Step2: Fine-tuning
(a) Previous Parad ig m
Vision-Specific
Transform er
COCO
ADE20K
Det/Seg
Vision-Specific
Transform er
Im ag e
M odality
SL/SSL
Step1: M ulti-M odal Pre-training
Step2: Fine-tuning with Adapter
(b) Our Parad ig m
ViT
Im ag e,
Video, Text, ...
SL/SSL
COCO
ADE20K
ViT
Adapter
Det/Seg
4. 提案手法
◼基本的なViTにアダプタを追加
◼アダプタ部分は3種のモジュールから構成
Published as a conference paper at ICLR 2023
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
E
P
Position
Em bedding
Elem ent-wise
Addition
Figure 4: Overa
into N (usually N
three key designs
input image, (d) a
feature extractor
5. Spatial Prior Module
◼畳み込みは局所的な空間情報を捉えるのに優れる
• 3つ畳み込みとmaxプーリングから構成
• 異なる解像度を持つ特徴マップの生成と平坦化
Published as a conference paper at ICLR 2023
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
Patch
Em bed ding
Spatial
Prior M odule
Extr
Block 1
Injector 1
Position
Em bedding
Elem ent-wise
Addition
(c) Spatial Prior M odule
ℱ! "
#
Stem
ℱ# ℱ$ ℱ %
V
Figure 4: Overall architecture of ViT-A
into N (usually N = 4) equal blocks for f
three key designs, including (c) a spatial p
input image, (d) a spatial feature injector fo
feature extractor for reorganizing multi-sca
6. Spatial Feature Injector
◼空間の事前情報をViTに注入
• 𝛾によってViTの分布に大幅な変更が起きないよう調整
Published as a conference paper at ICLR 2023
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
Patch
Em bed ding
Spatial
Prior M odule
(b) ViT-Adapter
Block 2
Extrac
Injector 2
Extractor 1
Block 1
Injector 1
Position
Em bedding
Elem ent-wise
Addition
(c) Spatial Prior M odule
ℱ! "
#
Stem
ℱ# ℱ$ ℱ %
(d) Spatial Feature Injector "
Key/
Value
ℱ! "
&
Query ℱ ' ()
&
Cross-Attention
Q
ℱ! "
&
ℱ
#' ()
&
Figure 4: Overall architecture of ViT-Adapter. (a) The ViT, w
into N (usually N = 4) equal blocks for feature interaction. (b) O
three key designs, including (c) a spatial prior module for modelin
input image, (d) a spatial feature injector for introducing spatial pri
7. Multi-Scale Feature Extractor
◼クロスアテンションとFFNで構成
• マルチスケール特徴を抽出
Published as a conference paper at ICLR 2023
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Position
Em bedding
Elem ent-wise
Addition
(c) Spatial Prior M odule
ℱ! "
#
Stem
ℱ# ℱ$ ℱ %
Extractor N
(d) Spatial Feature Injector "
Key/
Value
ℱ! "
&
Query ℱ ' ()
&
Cross-Attention
(e) M ulti-Scale Feature Extractor "
Key/Value
Query
ℱ! "
&
ℱ ' ()
&
* #
ℱ
#
! "
&
N orm
Cross-Attention FFN
ℱ! "
&
* #
ℱ! "
&
ℱ
#' ()
&
gure 4: Overall architecture of ViT-Adapter. (a) The ViT, whose encoder layers are divided
to N (usually N = 4) equal blocks for feature interaction. (b) Our ViT-Adapter, which contains
ree key designs, including (c) a spatial prior module for modeling local spatial contexts from the
put image, (d) a spatial feature injector for introducing spatial priors into ViT, and (e) a multi-scale
ature extractor for reorganizing multi-scale features from the single-scale features of ViT.
13. パラメータ数
39
41
43
45
47
49
51
53
0 50 100 150 200 250 300 350 400
CO
CO
BBo
x
AP
(%
)
#Param eter (M )
ViT-Ad ap ter (ours)
PVT
Swin
PVTv2
ViT
51.2
+
M
u
l
t
i
-
M
o
d
a
l
P
r
e
-
t
r
a
i
n
i
n
g
T
S
B L
Method #Param APb
PVTv2-B1 33.7M 44.9
ViT-T 26.1M 40.2
ViT-Adapter-T (ours) 28.1M 46.0
PVTv2-B2 45.0M 47.8
Swin-T 47.8M 46.0
ViT-S 43.8M 44.0
ViT-Adapter-S (ours) 47.8M 48.2
Swin-B 107.1M 48.6
ViT-B 113.6M 45.8
ViT-Adapter-B (ours) 120.2M 49.6
ViT-Adapter-BF
(ours) 120.2M 51.2
ViT-L†
337.3M 48.8
ViT-Adapter-L†
(ours) 347.9M 52.1
plain ViT (i.e., vanilla trans-
s some nonnegligible advan-
example lies in multi-modal
u et al., 2021; 2022; Wang
Stemming from the natural
sing (NLP) field, transformer
on of input data. Equipping
okenizers, e.g. patch embed-
y et al., 2020), 3D patch em-
al., 2021c), and token em-
ni et al., 2017), vanilla trans-
s plain ViT can use massive
a for pre-training, including