SlideShare a Scribd company logo
1 of 13
Download to read offline
VISION TRANSFORMER ADAPTER
FOR DENSE PREDICTIONS
Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao,
ICLR2023
木全 潤(名工大)
2023/11/30
概要
◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021]を
密な画像タスクに適応するためのアダプタの提案
• 従来のモデルのバックボーンは画像に特化
• 提案手法ではシンプルなViTを使用
• 提案手法はマルチモーダルなバックボーンでも可
this issue, we propose the ViT-Adapter, which allows plain ViT to achieve com-
parable performance to vision-specific transformers. Specifically, the backbone
in our framework is a plain ViT that can learn powerful representations from
large-scale multi-modal data. W hen transferring to downstream tasks, a pre-
training-free adapter is used to introduce the image-related inductive biases into
the model, making it suitable for these tasks. We verify ViT-Adapter on mul-
tiple dense prediction tasks, including object detection, instance segmentation,
and semantic segmentation. Notably, without using extra detection data, our ViT-
Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-
dev. We hope that the ViT-Adapter could serve asan alternative for vision-specific
transformers and facilitate future research. Code and models will be released at
ht t ps: //gi t hub. com/czczup/Vi T- Adapt er .
Step1: Im ag e M odality Pre-training
Step2: Fine-tuning
(a) Previous Parad ig m
Vision-Specific
Transform er
COCO
ADE20K
Det/Seg
Vision-Specific
Transform er
Im ag e
M odality
SL/SSL
Step1: M ulti-M odal Pre-training
Step2: Fine-tuning with Adapter
(b) Our Parad ig m
ViT
Im ag e,
Video, Text, ...
SL/SSL
COCO
ADE20K
ViT
Adapter
Det/Seg
関連研究
◼密な画像タスクでは複数スケールの特徴を扱う
◼ViTDet [Li+, arXiv2022]
• アップサンプリングとダウンサンプリング
◼提案手法
• アダプタによって入力画像も考慮
提案手法
◼基本的なViTにアダプタを追加
◼アダプタ部分は3種のモジュールから構成
Published as a conference paper at ICLR 2023
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
E
P
Position
Em bedding
Elem ent-wise
Addition
Figure 4: Overa
into N (usually N
three key designs
input image, (d) a
feature extractor
Spatial Prior Module
◼畳み込みは局所的な空間情報を捉えるのに優れる
• 3つ畳み込みとmaxプーリングから構成
• 異なる解像度を持つ特徴マップの生成と平坦化
Published as a conference paper at ICLR 2023
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
Patch
Em bed ding
Spatial
Prior M odule
Extr
Block 1
Injector 1
Position
Em bedding
Elem ent-wise
Addition
(c) Spatial Prior M odule
ℱ! "
#
Stem
ℱ# ℱ$ ℱ %
V
Figure 4: Overall architecture of ViT-A
into N (usually N = 4) equal blocks for f
three key designs, including (c) a spatial p
input image, (d) a spatial feature injector fo
feature extractor for reorganizing multi-sca
Spatial Feature Injector
◼空間の事前情報をViTに注入
• 𝛾によってViTの分布に大幅な変更が起きないよう調整
Published as a conference paper at ICLR 2023
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
Patch
Em bed ding
Spatial
Prior M odule
(b) ViT-Adapter
Block 2
Extrac
Injector 2
Extractor 1
Block 1
Injector 1
Position
Em bedding
Elem ent-wise
Addition
(c) Spatial Prior M odule
ℱ! "
#
Stem
ℱ# ℱ$ ℱ %
(d) Spatial Feature Injector "
Key/
Value
ℱ! "
&
Query ℱ ' ()
&
Cross-Attention
Q
ℱ! "
&
ℱ
#' ()
&
Figure 4: Overall architecture of ViT-Adapter. (a) The ViT, w
into N (usually N = 4) equal blocks for feature interaction. (b) O
three key designs, including (c) a spatial prior module for modelin
input image, (d) a spatial feature injector for introducing spatial pri
Multi-Scale Feature Extractor
◼クロスアテンションとFFNで構成
• マルチスケール特徴を抽出
Published as a conference paper at ICLR 2023
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Extractor N
Patch
Em bed ding
…
Spatial
Prior M odule
(a) Vision Transform er (ViT)
(b) ViT-Adapter
Block N
Injector N
Block 2
Extractor 2
Injector 2
Extractor 1
Block 1
Injector 1
Det
Seg
...
Position
Em bedding
Elem ent-wise
Addition
(c) Spatial Prior M odule
ℱ! "
#
Stem
ℱ# ℱ$ ℱ %
Extractor N
(d) Spatial Feature Injector "
Key/
Value
ℱ! "
&
Query ℱ ' ()
&
Cross-Attention
(e) M ulti-Scale Feature Extractor "
Key/Value
Query
ℱ! "
&
ℱ ' ()
&
* #
ℱ
#
! "
&
N orm
Cross-Attention FFN
ℱ! "
&
* #
ℱ! "
&
ℱ
#' ()
&
gure 4: Overall architecture of ViT-Adapter. (a) The ViT, whose encoder layers are divided
to N (usually N = 4) equal blocks for feature interaction. (b) Our ViT-Adapter, which contains
ree key designs, including (c) a spatial prior module for modeling local spatial contexts from the
put image, (d) a spatial feature injector for introducing spatial priors into ViT, and (e) a multi-scale
ature extractor for reorganizing multi-scale features from the single-scale features of ViT.
実験
◼タスクとデータセット
• オブジェクト検出
• COCO [Lin+, ECCV2014]
• インスタンスセグメンテーション
• ADE20K [Zhou+, CVPR2017]
◼モデル
• ViT
• ImageNet 1K/ 22Kで事前学習
オブジェクト検出
◼視覚に特化したTransformerと同等の結果
◼マルチモーダルな事前学習で性能向上
セグメンテーション
◼視覚に特化したTransformerと同等の結果
特徴マップ
まとめ
◼ViT-Adapterの提案
• 密な画像タスクにおける通常のViTと視覚に特化したTransformerとの
ギャップを埋める
◼画像に関連した誘導バイアスの注入
• 密なタスクに必要な細かななマルチスケール特徴の取得
◼実験
• 視覚に特化した手法と同等か上回る結果
• マルチモーダル事前学習の有用性を確認
パラメータ数
39
41
43
45
47
49
51
53
0 50 100 150 200 250 300 350 400
CO
CO
BBo
x
AP
(%
)
#Param eter (M )
ViT-Ad ap ter (ours)
PVT
Swin
PVTv2
ViT
51.2
+
M
u
l
t
i
-
M
o
d
a
l
P
r
e
-
t
r
a
i
n
i
n
g
T
S
B L
Method #Param APb
PVTv2-B1 33.7M 44.9
ViT-T 26.1M 40.2
ViT-Adapter-T (ours) 28.1M 46.0
PVTv2-B2 45.0M 47.8
Swin-T 47.8M 46.0
ViT-S 43.8M 44.0
ViT-Adapter-S (ours) 47.8M 48.2
Swin-B 107.1M 48.6
ViT-B 113.6M 45.8
ViT-Adapter-B (ours) 120.2M 49.6
ViT-Adapter-BF
(ours) 120.2M 51.2
ViT-L†
337.3M 48.8
ViT-Adapter-L†
(ours) 347.9M 52.1
plain ViT (i.e., vanilla trans-
s some nonnegligible advan-
example lies in multi-modal
u et al., 2021; 2022; Wang
Stemming from the natural
sing (NLP) field, transformer
on of input data. Equipping
okenizers, e.g. patch embed-
y et al., 2020), 3D patch em-
al., 2021c), and token em-
ni et al., 2017), vanilla trans-
s plain ViT can use massive
a for pre-training, including

More Related Content

Similar to 論文紹介:Vision Transformer Adapter for Dense Predictions

Shader Programming With Unity
Shader Programming With UnityShader Programming With Unity
Shader Programming With UnityMindstorm Studios
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...Intel® Software
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesDongsun Kim
 
Build Your Own 3D Scanner: 3D Scanning with Structured Lighting
Build Your Own 3D Scanner: 3D Scanning with Structured LightingBuild Your Own 3D Scanner: 3D Scanning with Structured Lighting
Build Your Own 3D Scanner: 3D Scanning with Structured LightingDouglas Lanman
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeIntel® Software
 
Dataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsDataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsPetteriTeikariPhD
 
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...taeseon ryu
 
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...Edge AI and Vision Alliance
 
[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...
[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...
[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...Seiya Ito
 
Text and Object Recognition using Deep Learning for Visually Impaired People
Text and Object Recognition using Deep Learning for Visually Impaired PeopleText and Object Recognition using Deep Learning for Visually Impaired People
Text and Object Recognition using Deep Learning for Visually Impaired Peopleijtsrd
 
Build Your Own VR Display Course - SIGGRAPH 2017: Part 2
Build Your Own VR Display Course - SIGGRAPH 2017: Part 2Build Your Own VR Display Course - SIGGRAPH 2017: Part 2
Build Your Own VR Display Course - SIGGRAPH 2017: Part 2StanfordComputationalImaging
 
Future semantic segmentation with convolutional LSTM
Future semantic segmentation with convolutional LSTMFuture semantic segmentation with convolutional LSTM
Future semantic segmentation with convolutional LSTMKyuri Kim
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...Edge AI and Vision Alliance
 
NetVLAD: CNN architecture for weakly supervised place recognition
NetVLAD:  CNN architecture for weakly supervised place recognitionNetVLAD:  CNN architecture for weakly supervised place recognition
NetVLAD: CNN architecture for weakly supervised place recognitionGeunhee Cho
 
Sem 2 Presentation
Sem 2 PresentationSem 2 Presentation
Sem 2 PresentationShalom Cohen
 
Efficient video compression using EZWT
Efficient video compression using EZWTEfficient video compression using EZWT
Efficient video compression using EZWTIJERA Editor
 
ANISH_and_DR.DANIEL_augmented_reality_presentation
ANISH_and_DR.DANIEL_augmented_reality_presentationANISH_and_DR.DANIEL_augmented_reality_presentation
ANISH_and_DR.DANIEL_augmented_reality_presentationAnish Patel
 
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...Edge AI and Vision Alliance
 

Similar to 論文紹介:Vision Transformer Adapter for Dense Predictions (20)

Shader Programming With Unity
Shader Programming With UnityShader Programming With Unity
Shader Programming With Unity
 
Markus Tessmann, InnoGames
Markus Tessmann, InnoGames	Markus Tessmann, InnoGames
Markus Tessmann, InnoGames
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method Names
 
Build Your Own 3D Scanner: 3D Scanning with Structured Lighting
Build Your Own 3D Scanner: 3D Scanning with Structured LightingBuild Your Own 3D Scanner: 3D Scanning with Structured Lighting
Build Your Own 3D Scanner: 3D Scanning with Structured Lighting
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's Stampede
 
Dataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsDataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problems
 
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
 
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
 
[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...
[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...
[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...
 
Text and Object Recognition using Deep Learning for Visually Impaired People
Text and Object Recognition using Deep Learning for Visually Impaired PeopleText and Object Recognition using Deep Learning for Visually Impaired People
Text and Object Recognition using Deep Learning for Visually Impaired People
 
Build Your Own VR Display Course - SIGGRAPH 2017: Part 2
Build Your Own VR Display Course - SIGGRAPH 2017: Part 2Build Your Own VR Display Course - SIGGRAPH 2017: Part 2
Build Your Own VR Display Course - SIGGRAPH 2017: Part 2
 
Future semantic segmentation with convolutional LSTM
Future semantic segmentation with convolutional LSTMFuture semantic segmentation with convolutional LSTM
Future semantic segmentation with convolutional LSTM
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
K0445660
K0445660K0445660
K0445660
 
NetVLAD: CNN architecture for weakly supervised place recognition
NetVLAD:  CNN architecture for weakly supervised place recognitionNetVLAD:  CNN architecture for weakly supervised place recognition
NetVLAD: CNN architecture for weakly supervised place recognition
 
Sem 2 Presentation
Sem 2 PresentationSem 2 Presentation
Sem 2 Presentation
 
Efficient video compression using EZWT
Efficient video compression using EZWTEfficient video compression using EZWT
Efficient video compression using EZWT
 
ANISH_and_DR.DANIEL_augmented_reality_presentation
ANISH_and_DR.DANIEL_augmented_reality_presentationANISH_and_DR.DANIEL_augmented_reality_presentation
ANISH_and_DR.DANIEL_augmented_reality_presentation
 
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
 

More from Toru Tamaki

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...Toru Tamaki
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...Toru Tamaki
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNetToru Tamaki
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A surveyToru Tamaki
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex ScenesToru Tamaki
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...Toru Tamaki
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video SegmentationToru Tamaki
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New HopeToru Tamaki
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...Toru Tamaki
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt TuningToru Tamaki
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in MoviesToru Tamaki
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICAToru Tamaki
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context RefinementToru Tamaki
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...Toru Tamaki
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...Toru Tamaki
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusionToru Tamaki
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous DrivingToru Tamaki
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large MotionToru Tamaki
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understandingToru Tamaki
 
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation LearningToru Tamaki
 

More from Toru Tamaki (20)

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
 
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 

論文紹介:Vision Transformer Adapter for Dense Predictions

  • 1. VISION TRANSFORMER ADAPTER FOR DENSE PREDICTIONS Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao, ICLR2023 木全 潤(名工大) 2023/11/30
  • 2. 概要 ◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021]を 密な画像タスクに適応するためのアダプタの提案 • 従来のモデルのバックボーンは画像に特化 • 提案手法ではシンプルなViTを使用 • 提案手法はマルチモーダルなバックボーンでも可 this issue, we propose the ViT-Adapter, which allows plain ViT to achieve com- parable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. W hen transferring to downstream tasks, a pre- training-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on mul- tiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT- Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test- dev. We hope that the ViT-Adapter could serve asan alternative for vision-specific transformers and facilitate future research. Code and models will be released at ht t ps: //gi t hub. com/czczup/Vi T- Adapt er . Step1: Im ag e M odality Pre-training Step2: Fine-tuning (a) Previous Parad ig m Vision-Specific Transform er COCO ADE20K Det/Seg Vision-Specific Transform er Im ag e M odality SL/SSL Step1: M ulti-M odal Pre-training Step2: Fine-tuning with Adapter (b) Our Parad ig m ViT Im ag e, Video, Text, ... SL/SSL COCO ADE20K ViT Adapter Det/Seg
  • 3. 関連研究 ◼密な画像タスクでは複数スケールの特徴を扱う ◼ViTDet [Li+, arXiv2022] • アップサンプリングとダウンサンプリング ◼提案手法 • アダプタによって入力画像も考慮
  • 4. 提案手法 ◼基本的なViTにアダプタを追加 ◼アダプタ部分は3種のモジュールから構成 Published as a conference paper at ICLR 2023 Patch Em bed ding … Spatial Prior M odule (a) Vision Transform er (ViT) (b) ViT-Adapter Block N Injector N Block 2 Extractor 2 Injector 2 Extractor 1 Block 1 Injector 1 Det Seg ... Extractor N E P Position Em bedding Elem ent-wise Addition Figure 4: Overa into N (usually N three key designs input image, (d) a feature extractor
  • 5. Spatial Prior Module ◼畳み込みは局所的な空間情報を捉えるのに優れる • 3つ畳み込みとmaxプーリングから構成 • 異なる解像度を持つ特徴マップの生成と平坦化 Published as a conference paper at ICLR 2023 Patch Em bed ding … Spatial Prior M odule (a) Vision Transform er (ViT) (b) ViT-Adapter Block N Injector N Block 2 Extractor 2 Injector 2 Extractor 1 Block 1 Injector 1 Det Seg ... Extractor N Patch Em bed ding Spatial Prior M odule Extr Block 1 Injector 1 Position Em bedding Elem ent-wise Addition (c) Spatial Prior M odule ℱ! " # Stem ℱ# ℱ$ ℱ % V Figure 4: Overall architecture of ViT-A into N (usually N = 4) equal blocks for f three key designs, including (c) a spatial p input image, (d) a spatial feature injector fo feature extractor for reorganizing multi-sca
  • 6. Spatial Feature Injector ◼空間の事前情報をViTに注入 • 𝛾によってViTの分布に大幅な変更が起きないよう調整 Published as a conference paper at ICLR 2023 Patch Em bed ding … Spatial Prior M odule (a) Vision Transform er (ViT) (b) ViT-Adapter Block N Injector N Block 2 Extractor 2 Injector 2 Extractor 1 Block 1 Injector 1 Det Seg ... Extractor N Patch Em bed ding Spatial Prior M odule (b) ViT-Adapter Block 2 Extrac Injector 2 Extractor 1 Block 1 Injector 1 Position Em bedding Elem ent-wise Addition (c) Spatial Prior M odule ℱ! " # Stem ℱ# ℱ$ ℱ % (d) Spatial Feature Injector " Key/ Value ℱ! " & Query ℱ ' () & Cross-Attention Q ℱ! " & ℱ #' () & Figure 4: Overall architecture of ViT-Adapter. (a) The ViT, w into N (usually N = 4) equal blocks for feature interaction. (b) O three key designs, including (c) a spatial prior module for modelin input image, (d) a spatial feature injector for introducing spatial pri
  • 7. Multi-Scale Feature Extractor ◼クロスアテンションとFFNで構成 • マルチスケール特徴を抽出 Published as a conference paper at ICLR 2023 Patch Em bed ding … Spatial Prior M odule (a) Vision Transform er (ViT) (b) ViT-Adapter Block N Injector N Block 2 Extractor 2 Injector 2 Extractor 1 Block 1 Injector 1 Det Seg ... Extractor N Patch Em bed ding … Spatial Prior M odule (a) Vision Transform er (ViT) (b) ViT-Adapter Block N Injector N Block 2 Extractor 2 Injector 2 Extractor 1 Block 1 Injector 1 Det Seg ... Position Em bedding Elem ent-wise Addition (c) Spatial Prior M odule ℱ! " # Stem ℱ# ℱ$ ℱ % Extractor N (d) Spatial Feature Injector " Key/ Value ℱ! " & Query ℱ ' () & Cross-Attention (e) M ulti-Scale Feature Extractor " Key/Value Query ℱ! " & ℱ ' () & * # ℱ # ! " & N orm Cross-Attention FFN ℱ! " & * # ℱ! " & ℱ #' () & gure 4: Overall architecture of ViT-Adapter. (a) The ViT, whose encoder layers are divided to N (usually N = 4) equal blocks for feature interaction. (b) Our ViT-Adapter, which contains ree key designs, including (c) a spatial prior module for modeling local spatial contexts from the put image, (d) a spatial feature injector for introducing spatial priors into ViT, and (e) a multi-scale ature extractor for reorganizing multi-scale features from the single-scale features of ViT.
  • 8. 実験 ◼タスクとデータセット • オブジェクト検出 • COCO [Lin+, ECCV2014] • インスタンスセグメンテーション • ADE20K [Zhou+, CVPR2017] ◼モデル • ViT • ImageNet 1K/ 22Kで事前学習
  • 13. パラメータ数 39 41 43 45 47 49 51 53 0 50 100 150 200 250 300 350 400 CO CO BBo x AP (% ) #Param eter (M ) ViT-Ad ap ter (ours) PVT Swin PVTv2 ViT 51.2 + M u l t i - M o d a l P r e - t r a i n i n g T S B L Method #Param APb PVTv2-B1 33.7M 44.9 ViT-T 26.1M 40.2 ViT-Adapter-T (ours) 28.1M 46.0 PVTv2-B2 45.0M 47.8 Swin-T 47.8M 46.0 ViT-S 43.8M 44.0 ViT-Adapter-S (ours) 47.8M 48.2 Swin-B 107.1M 48.6 ViT-B 113.6M 45.8 ViT-Adapter-B (ours) 120.2M 49.6 ViT-Adapter-BF (ours) 120.2M 51.2 ViT-L† 337.3M 48.8 ViT-Adapter-L† (ours) 347.9M 52.1 plain ViT (i.e., vanilla trans- s some nonnegligible advan- example lies in multi-modal u et al., 2021; 2022; Wang Stemming from the natural sing (NLP) field, transformer on of input data. Equipping okenizers, e.g. patch embed- y et al., 2020), 3D patch em- al., 2021c), and token em- ni et al., 2017), vanilla trans- s plain ViT can use massive a for pre-training, including