VISION TRANSFORMER ADAPTER
FOR DENSE PREDICTIONS
Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao,
ICLR2023
木全 潤(名工大)
2023/11/30
概要
◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021]を
密な画像タスクに適応するためのアダプタの提案
• 従来のモデルのバックボーンは画像に特化
• 提案手法ではシンプルなViTを使用
• 提案手法はマルチモーダルなバックボーンでも可
this issue, we propose the ViT-Adapter, which allows plain ViT to achieve com-
parable performance to vision-specific transformers. Specifically, the backbone
in our framework is a plain ViT that can learn powerful representations from
large-scale multi-modal data. W hen transferring to downstream tasks, a pre-
training-free adapter is used to introduce the image-related inductive biases into
the model, making it suitable for these tasks. We verify ViT-Adapter on mul-
tiple dense prediction tasks, including object detection, instance segmentation,
and semantic segmentation. Notably, without using extra detection data, our ViT-
Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-
dev. We hope that the ViT-Adapter could serve asan alternative for vision-specific
transformers and facilitate future research. Code and models will be released at
ht t ps: //gi t hub. com/czczup/Vi T- Adapt er .
Step1: Im ag e M odality Pre-training
Step2: Fine-tuning
(a) Previous Parad ig m
Vision-Specific
Transform er
COCO
ADE20K
Det/Seg
Vision-Specific
Transform er
Im ag e
M odality
SL/SSL
Step1: M ulti-M odal Pre-training
Step2: Fine-tuning with Adapter
(b) Our Parad ig m
ViT
Im ag e,
Video, Text, ...
SL/SSL
COCO
ADE20K
ViT
Adapter
Det/Seg
関連研究
◼密な画像タスクでは複数スケールの特徴を扱う
◼ViTDet [Li+, arXiv2022]
• アップサンプリングとダウンサンプリング
◼提案手法
• アダプタによって入力画像も考慮
提案手法
◼基本的なViTにアダプタを追加
◼アダプタ部分は3種のモジュールから構成
Published as a conference paper at ICLR 2023
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
E
P
Position
Em bedding
Elem ent-wise
Addition
Figure 4: Overa
into N (usually N
three key designs
input image, (d) a
feature extractor
Spatial Prior Module
◼畳み込みは局所的な空間情報を捉えるのに優れる
• 3つ畳み込みとmaxプーリングから構成
• 異なる解像度を持つ特徴マップの生成と平坦化
Published as a conference paper at ICLR 2023
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
Patch
Em bed ding
Spatial
Prior M odule
Extr
Block 1
Injector 1
Position
Em bedding
Elem ent-wise
Addition
(c) Spatial Prior M odule
ℱ! "
#
Stem
ℱ# ℱ$ ℱ %
V
Figure 4: Overall architecture of ViT-A
into N (usually N = 4) equal blocks for f
three key designs, including (c) a spatial p
input image, (d) a spatial feature injector fo
feature extractor for reorganizing multi-sca
Spatial Feature Injector
◼空間の事前情報をViTに注入
• 𝛾によってViTの分布に大幅な変更が起きないよう調整
Published as a conference paper at ICLR 2023
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
Patch
Em bed ding
Spatial
Prior M odule
(b) ViT-Adapter
Block 2
Extrac
Injector 2
Extractor 1
Block 1
Injector 1
Position
Em bedding
Elem ent-wise
Addition
(c) Spatial Prior M odule
ℱ! "
#
Stem
ℱ# ℱ$ ℱ %
(d) Spatial Feature Injector "
Key/
Value
ℱ! "
&
Query ℱ ' ()
&
Cross-Attention
Q
ℱ! "
&
ℱ
#' ()
&
Figure 4: Overall architecture of ViT-Adapter. (a) The ViT, w
into N (usually N = 4) equal blocks for feature interaction. (b) O
three key designs, including (c) a spatial prior module for modelin
input image, (d) a spatial feature injector for introducing spatial pri
Multi-Scale Feature Extractor
◼クロスアテンションとFFNで構成
• マルチスケール特徴を抽出
Published as a conference paper at ICLR 2023
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Position
Em bedding
Elem ent-wise
Addition
(c) Spatial Prior M odule
ℱ! "
#
Stem
ℱ# ℱ$ ℱ %
Extractor N
(d) Spatial Feature Injector "
Key/
Value
ℱ! "
&
Query ℱ ' ()
&
Cross-Attention
(e) M ulti-Scale Feature Extractor "
Key/Value
Query
ℱ! "
&
ℱ ' ()
&
* #
ℱ
#
! "
&
N orm
Cross-Attention FFN
ℱ! "
&
* #
ℱ! "
&
ℱ
#' ()
&
gure 4: Overall architecture of ViT-Adapter. (a) The ViT, whose encoder layers are divided
to N (usually N = 4) equal blocks for feature interaction. (b) Our ViT-Adapter, which contains
ree key designs, including (c) a spatial prior module for modeling local spatial contexts from the
put image, (d) a spatial feature injector for introducing spatial priors into ViT, and (e) a multi-scale
ature extractor for reorganizing multi-scale features from the single-scale features of ViT.
実験
◼タスクとデータセット
• オブジェクト検出
• COCO [Lin+, ECCV2014]
• インスタンスセグメンテーション
• ADE20K [Zhou+, CVPR2017]
◼モデル
• ViT
• ImageNet 1K/ 22Kで事前学習
オブジェクト検出
◼視覚に特化したTransformerと同等の結果
◼マルチモーダルな事前学習で性能向上
セグメンテーション
◼視覚に特化したTransformerと同等の結果
特徴マップ
まとめ
◼ViT-Adapterの提案
• 密な画像タスクにおける通常のViTと視覚に特化したTransformerとの
ギャップを埋める
◼画像に関連した誘導バイアスの注入
• 密なタスクに必要な細かななマルチスケール特徴の取得
◼実験
• 視覚に特化した手法と同等か上回る結果
• マルチモーダル事前学習の有用性を確認
パラメータ数
39
41
43
45
47
49
51
53
0 50 100 150 200 250 300 350 400
CO
CO
BBo
x
AP
(%
)
#Param eter (M )
ViT-Ad ap ter (ours)
PVT
Swin
PVTv2
ViT
51.2
+
M
u
l
t
i
-
M
o
d
a
l
P
r
e
-
t
r
a
i
n
i
n
g
T
S
B L
Method #Param APb
PVTv2-B1 33.7M 44.9
ViT-T 26.1M 40.2
ViT-Adapter-T (ours) 28.1M 46.0
PVTv2-B2 45.0M 47.8
Swin-T 47.8M 46.0
ViT-S 43.8M 44.0
ViT-Adapter-S (ours) 47.8M 48.2
Swin-B 107.1M 48.6
ViT-B 113.6M 45.8
ViT-Adapter-B (ours) 120.2M 49.6
ViT-Adapter-BF
(ours) 120.2M 51.2
ViT-L†
337.3M 48.8
ViT-Adapter-L†
(ours) 347.9M 52.1
plain ViT (i.e., vanilla trans-
s some nonnegligible advan-
example lies in multi-modal
u et al., 2021; 2022; Wang
Stemming from the natural
sing (NLP) field, transformer
on of input data. Equipping
okenizers, e.g. patch embed-
y et al., 2020), 3D patch em-
al., 2021c), and token em-
ni et al., 2017), vanilla trans-
s plain ViT can use massive
a for pre-training, including

論文紹介:Vision Transformer Adapter for Dense Predictions

  • 1.
    VISION TRANSFORMER ADAPTER FORDENSE PREDICTIONS Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao, ICLR2023 木全 潤(名工大) 2023/11/30
  • 2.
    概要 ◼Vision Transformer (ViT)[Dosovitskiy+, ICLR 2021]を 密な画像タスクに適応するためのアダプタの提案 • 従来のモデルのバックボーンは画像に特化 • 提案手法ではシンプルなViTを使用 • 提案手法はマルチモーダルなバックボーンでも可 this issue, we propose the ViT-Adapter, which allows plain ViT to achieve com- parable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. W hen transferring to downstream tasks, a pre- training-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on mul- tiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT- Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test- dev. We hope that the ViT-Adapter could serve asan alternative for vision-specific transformers and facilitate future research. Code and models will be released at ht t ps: //gi t hub. com/czczup/Vi T- Adapt er . Step1: Im ag e M odality Pre-training Step2: Fine-tuning (a) Previous Parad ig m Vision-Specific Transform er COCO ADE20K Det/Seg Vision-Specific Transform er Im ag e M odality SL/SSL Step1: M ulti-M odal Pre-training Step2: Fine-tuning with Adapter (b) Our Parad ig m ViT Im ag e, Video, Text, ... SL/SSL COCO ADE20K ViT Adapter Det/Seg
  • 3.
    関連研究 ◼密な画像タスクでは複数スケールの特徴を扱う ◼ViTDet [Li+, arXiv2022] •アップサンプリングとダウンサンプリング ◼提案手法 • アダプタによって入力画像も考慮
  • 4.
    提案手法 ◼基本的なViTにアダプタを追加 ◼アダプタ部分は3種のモジュールから構成 Published as aconference paper at ICLR 2023 Patch Em bed ding … Spatial Prior M odule (a) Vision Transform er (ViT) (b) ViT-Adapter Block N Injector N Block 2 Extractor 2 Injector 2 Extractor 1 Block 1 Injector 1 Det Seg ... Extractor N E P Position Em bedding Elem ent-wise Addition Figure 4: Overa into N (usually N three key designs input image, (d) a feature extractor
  • 5.
    Spatial Prior Module ◼畳み込みは局所的な空間情報を捉えるのに優れる •3つ畳み込みとmaxプーリングから構成 • 異なる解像度を持つ特徴マップの生成と平坦化 Published as a conference paper at ICLR 2023 Patch Em bed ding … Spatial Prior M odule (a) Vision Transform er (ViT) (b) ViT-Adapter Block N Injector N Block 2 Extractor 2 Injector 2 Extractor 1 Block 1 Injector 1 Det Seg ... Extractor N Patch Em bed ding Spatial Prior M odule Extr Block 1 Injector 1 Position Em bedding Elem ent-wise Addition (c) Spatial Prior M odule ℱ! " # Stem ℱ# ℱ$ ℱ % V Figure 4: Overall architecture of ViT-A into N (usually N = 4) equal blocks for f three key designs, including (c) a spatial p input image, (d) a spatial feature injector fo feature extractor for reorganizing multi-sca
  • 6.
    Spatial Feature Injector ◼空間の事前情報をViTに注入 •𝛾によってViTの分布に大幅な変更が起きないよう調整 Published as a conference paper at ICLR 2023 Patch Em bed ding … Spatial Prior M odule (a) Vision Transform er (ViT) (b) ViT-Adapter Block N Injector N Block 2 Extractor 2 Injector 2 Extractor 1 Block 1 Injector 1 Det Seg ... Extractor N Patch Em bed ding Spatial Prior M odule (b) ViT-Adapter Block 2 Extrac Injector 2 Extractor 1 Block 1 Injector 1 Position Em bedding Elem ent-wise Addition (c) Spatial Prior M odule ℱ! " # Stem ℱ# ℱ$ ℱ % (d) Spatial Feature Injector " Key/ Value ℱ! " & Query ℱ ' () & Cross-Attention Q ℱ! " & ℱ #' () & Figure 4: Overall architecture of ViT-Adapter. (a) The ViT, w into N (usually N = 4) equal blocks for feature interaction. (b) O three key designs, including (c) a spatial prior module for modelin input image, (d) a spatial feature injector for introducing spatial pri
  • 7.
    Multi-Scale Feature Extractor ◼クロスアテンションとFFNで構成 •マルチスケール特徴を抽出 Published as a conference paper at ICLR 2023 Patch Em bed ding … Spatial Prior M odule (a) Vision Transform er (ViT) (b) ViT-Adapter Block N Injector N Block 2 Extractor 2 Injector 2 Extractor 1 Block 1 Injector 1 Det Seg ... Extractor N Patch Em bed ding … Spatial Prior M odule (a) Vision Transform er (ViT) (b) ViT-Adapter Block N Injector N Block 2 Extractor 2 Injector 2 Extractor 1 Block 1 Injector 1 Det Seg ... Position Em bedding Elem ent-wise Addition (c) Spatial Prior M odule ℱ! " # Stem ℱ# ℱ$ ℱ % Extractor N (d) Spatial Feature Injector " Key/ Value ℱ! " & Query ℱ ' () & Cross-Attention (e) M ulti-Scale Feature Extractor " Key/Value Query ℱ! " & ℱ ' () & * # ℱ # ! " & N orm Cross-Attention FFN ℱ! " & * # ℱ! " & ℱ #' () & gure 4: Overall architecture of ViT-Adapter. (a) The ViT, whose encoder layers are divided to N (usually N = 4) equal blocks for feature interaction. (b) Our ViT-Adapter, which contains ree key designs, including (c) a spatial prior module for modeling local spatial contexts from the put image, (d) a spatial feature injector for introducing spatial priors into ViT, and (e) a multi-scale ature extractor for reorganizing multi-scale features from the single-scale features of ViT.
  • 8.
    実験 ◼タスクとデータセット • オブジェクト検出 • COCO[Lin+, ECCV2014] • インスタンスセグメンテーション • ADE20K [Zhou+, CVPR2017] ◼モデル • ViT • ImageNet 1K/ 22Kで事前学習
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
    パラメータ数 39 41 43 45 47 49 51 53 0 50 100150 200 250 300 350 400 CO CO BBo x AP (% ) #Param eter (M ) ViT-Ad ap ter (ours) PVT Swin PVTv2 ViT 51.2 + M u l t i - M o d a l P r e - t r a i n i n g T S B L Method #Param APb PVTv2-B1 33.7M 44.9 ViT-T 26.1M 40.2 ViT-Adapter-T (ours) 28.1M 46.0 PVTv2-B2 45.0M 47.8 Swin-T 47.8M 46.0 ViT-S 43.8M 44.0 ViT-Adapter-S (ours) 47.8M 48.2 Swin-B 107.1M 48.6 ViT-B 113.6M 45.8 ViT-Adapter-B (ours) 120.2M 49.6 ViT-Adapter-BF (ours) 120.2M 51.2 ViT-L† 337.3M 48.8 ViT-Adapter-L† (ours) 347.9M 52.1 plain ViT (i.e., vanilla trans- s some nonnegligible advan- example lies in multi-modal u et al., 2021; 2022; Wang Stemming from the natural sing (NLP) field, transformer on of input data. Equipping okenizers, e.g. patch embed- y et al., 2020), 3D patch em- al., 2021c), and token em- ni et al., 2017), vanilla trans- s plain ViT can use massive a for pre-training, including