近年のHierarchical Vision Transformer

株式会社 Mobility Technologies 内田祐介 (@yu4u)
近年の
Vision Transformer
〜全部同じじゃないですか〜
本資料はDeNA+MoTでの
輪講資料を加筆したものです

2
▪ ViT [1] の流行
▪ 画像もTransformer！でも大量データ（JFT-300M）必要
▪ DeiT [2]
▪ ViTの学習方法の確立、ImageNetだけでもCNN相当に
▪ MLP-Mixer [3]
▪ AttentionではなくMLPでもいいよ！
▪ ViTの改良やattentionの代替（MLP, pool, shift, LSTM) 乱立
背景
[1] A. Dosovitskiy, et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale," in Proc. of ICLR, 2021.
[2] H. Touvron, et al., "Training Data-efficient Image Transformers & Distillation Through Attention," in
Proc. of ICLR'21.
[3] I. Tolstikhin, et al., "MLP-Mixer: An all-MLP Architecture for Vision," in Proc. of NeurIPS'21.

3
▪ Stride=2のdepthwise convでpooling
▪ Deformable DETRベースの物体検出ではResNet50に負けている
初期の改良例：Pooling-based Vision Transformer (PiT)
B. Heo, et al., "Rethinking Spatial Dimensions of Vision Transformers," in Proc. of ICCV'21.

4
▪ Stage1, 2ではMBConv (MobileNetV2~やEffNetのメイン構成要素）
を利用、Stage3, 4ではattention（+rel pos embedding）を利用
▪ MBConvはstrided convで、attentionはpoolでdownsample
初期の改良例： CoAtNet
Z. Dai, et al., "CoAtNet: Marrying Convolution and Attention for All Data Sizes," in Proc. of NeurIPS'21.
identity residual

5
▪ 当然ViTを物体検出やセグメンテーションにも適用したくなる
▪ CNNは入力画像の1/4から1/32までの複数解像度の特徴マップを生成
▪ ViTは1/16のみ、小さい物体検出や細かいセグメンテーションには不向き
▪ 高解像度の特徴マップも扱いたい！
▪ この課題をクリアしたVision Transformerをみんなが考えた結果…
背景
B. Heo, et al., "Rethinking Spatial Dimensions of Vision Transformers," in Proc. of ICCV'21.
そして次に高速化
とかが流行る

6
最近のVision Transformerたち（全部同じじゃないですか!?
Swin Trasnformer
PoolFormer
ShiftViT
AS-MLP
Shunted Transformer CSWin Transformer
ResT
SepViT
Lite Vision Transformer
Pyramid Vision Transformer

7
最近のVision Transformerたち（全部同じじゃないですか!?
Swin Trasnformer
PoolFormer
ShiftViT
AS-MLP
Shunted Transformer CSWin Transformer
ResT
SepViT
Lite Vision Transformer
Pyramid Vision Transformer
今日この資料で
ちがいますよー！
言えるようになる

8
▪ 物体検出やセマンティックセグメンテーションに適用可能な
階層的なVision Transformerバックボーンの紹介
▪ ViTではなく、transformerをビジョンタスクに適用した的なモデル一般を
本資料ではVision Transformerと呼ぶ
▪ DETR等、attentionをタスクを解く部分に利用する手法には触れない
▪ Attention layerは何となく分かっている前提
▪ 入力を線形変換してQ, K, V作って
▪ softmax(Q KT) から重みを算出して、Vの重み付け和を出力する
▪ それが並列に複数ある（multi-head）くらいでOK！
▪ 図で理解するTransformer 読みましょう！
この資料で扱う範囲

9
▪ 紹介するVision Transformerはほぼこの形で表現可能
▪ Transformer blockのtoken mixerが主な違い
▪ MLP-Mixer, PoolFormer, ShiftViT等のattentionを使わないモデルも
token mixerが違うだけのViTと言える
▪ この構造を [1] ではMetaFormerと呼び、この構造が性能に寄与していると主張
階層的Vision Transformerの一般系（CNN的な階層構造）
Transformer
Block
[1] W. Yu, et al., "MetaFormer is Actually What You Need for Vision," in Proc. of CVPR’22.

10
▪ Patch embedding：画像をパッチに分割しtoken化
▪ Positional encoding：tokenに位置情報を付加
▪ Patch merging：空間解像度を半分にし、チャネル数を増加させる
▪ Transformer block：token mixer (attention) とFFNによる特徴抽出
階層的Vision Transformerの構成要素
Transformer
Block

11
▪ 画像を小さなパッチに分割し、高次元のtokenに変換する
▪ ViTでは16×16
▪ 階層的Vision Transformerでは4×4が一般的
▪ 実装は
▪ rearange (einops) -> linear
or
▪ Conv2D (kernel size = stride = パッチサイズ）
▪ オーバーラップして分割するモデルも存在
▪ CNNのように複数のConv2Dを利用して
ダウンサンプルするモデルも存在
▪ Layer normがあったりなかったり
Patch Embedding

12
▪ Transformer (attention) 自体は集合のencoder (decoder)
▪ Positional encodingにより各tokenに位置情報を付加する必要がある
▪ 色々なアプローチがある
▪ Relative or absolute × 固定（sinusoidal） or learnable [1]
▪ Conditional positional encodings [2]（面白いので本資料のappendixで紹介）
▪ FFNのconvで暗にembedする [3]
▪ Absolute positional encodingは入力のtokenに付加する
▪ オリジナルのViTはこれ
▪ Relative positional encodingはattentionの内積部分に付加
Positional Encoding
[1] K. Wu, et al., "Rethinking and Improving Relative Position Encoding for Vision Transformer," in Proc. of ICCV'21.
[2] X. Chu, et al., "Conditional Positional Encodings for Vision Transformers," in arXiv:2102.10882.
[3] Enze Xie, et a., "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers," in Proc. of
NeurIPS'21.

13
▪ 近傍2×2のtokenを統合することで空間解像度を半分にしつつ
tokenの次元数を増加（2倍が多い）させる
▪ Patch embeddingと同じ実装が多い
▪ Patch embeddingと同様にオーバーラップさせるケースも
Patch Merging

14
▪ Layer norm, token mixer (=self-attention),
feed-forward network (FFN) (=MLP),
skip connectionで構成
▪ Self-attention部分がポイント
▪ Attention 𝑄, 𝐾, 𝑉 = softmax 𝑄𝐾T/ 𝑑 𝑉
▪ 𝑄 = 𝑊𝑞𝑋, 𝐾 = 𝑊𝑘𝑋, 𝑉 = 𝑊𝑣𝑋
Transformer block

15
▪ Self-attentionの計算量が系列長の二乗に比例する（𝑄𝐾Tの内積）
▪ 画像の場合は系列長＝画像サイズ（特徴マップのH×W）
▪ ViTの場合は入力画像サイズ224で14x14（入力の1/16）の特徴マップ
▪ 画像サイズを大きくして（e.g. 1280）、高解像度化（e.g. 入力の1/4）
すると大変なことになる
▪ この課題をどう解決するかが各手法の違い
▪ Attentionの範囲を局所的に制限するwindow (local) attention
▪ K, Vの空間サイズを小さくするspatial-reduction attention（Qはそのまま
▪ 実はほぼ上記の2パターン（ネタバレ）
▪ 上記の2つを組み合わせたり、spatial-reductionをマルチスケールでやったり、
windowの作り方が違ったり…
高解像度の特徴マップを利用しようとした際の課題

16
▪ Vision Transformerを物体検出やセグメンテーションタスクの
バックボーンとすべく階層的なVision Transformerが提案されている
▪ これらは共通の構造を持っており下記のモジュールから構成
▪ Transformer blockのattention部分の計算量削減がポイント
▪ Window (local) attentionとspatial-reduction attentionに大別される
▪ 以降では各モデルのtoken mixer (attention) 部分をメインに
雑に解説！
ここまでのまとめ

17
▪ Token mixer: Shifted Window-based Multi-head Self-attention
Swin Transformer
Two Successive
Swin Transformer Blocks
ココがポイント
Z. Liu, et al., "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows," in Proc. of
ICCV'21.
Swin Transformer (ICCV'21 Best Paper) を
完璧に理解する資料も見てネ！

18
▪ 特徴マップをサイズがMxMのwindowに区切り
window内でのみself-attentionを求める
▪ hxw個のパッチが存在する特徴マップにおいて、
(hw)x(hw)の計算量が、M2xM2 x (h/M)x(w/M) = M2hwに削減
▪ M=7 (入力サイズ224の場合）
▪ C2（stride=4, 56x56のfeature map）だと、8x8個のwindow
Window-based Multi-head Self-attention (W-MSA)
per window window数
パッチ数の2乗

19
▪ (M/2, M/2) だけwindowをshiftしたW-MSA
▪ 通常のwindow-basedと交互に適用することで
隣接したwindow間でのconnectionが生まれる
Shifted Window-based Multi-head Self-attention (SW-MSA)
h=w=8, M=4の例

20
▪ 下記だと9個のwindowができるが、特徴マップをshiftし
シフトなしと同じ2x2のwindowとしてattention計算
▪ 実際は複数windowが混じっているwindowは
attention maskを利用してwindow間のattentionを0にする
（通常はdecoderで未来の情報を見ないようにするときに使う）
効率的なSW-MSAの実装

21
▪ チャネルを2等分して、縦横のstripeでのself-attention
CSWin Transformer
X. Dong, et al., "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped
Windows," in Proc. of CVPR’22.

22
▪ でっかいモデルをGPUになんとか押し込みました！
▪ post-normになってる…
Swin Transformer V2
Ze Liu, et al., "Swin Transformer V2: Scaling Up Capacity and Resolution," in Proc. of CVPR’22.

23
▪ Token mixer: Spatial-Reduction Attention (SRA)
Pyramid Vision Transformer (PVT)
W. Wang, et al., "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without
Convolutions," in Proc. of ICCV, 2021.
Spatial-Reduction Attention
(SRA) がポイント

24
▪ K, V（辞書側）のみ空間サイズを縮小
▪ 実装としてはConv2D -> LayerNorm
▪ Qはそのままなので
出力サイズは変わらない
▪ 各stageの削減率は8, 4, 2, 1 と
特徴マップの縮小率と整合させる
Spatial-Reduction Attention (SRA)

25
▪ SRAのdown samplingをaverage poolに
▪ Patch embeddingにconvを使いoverlapさせる
▪ FFNにdepthwise convを挿入し、
positional embeddingを削除
（暗黙的なpositional encoding）
PVTv2
W. Wang, et al., "PVTv2: Improved Baselines with Pyramid Vision Transformer," in Journal of
Computational Visual Media, 2022.

26
▪ 動画認識がメインタスクのモデル
▪ PVTと同様にK, Vをpoolingしたattention
▪ pool関数としてmax pool, average pool,
stride付きdepthwise convを比較して
depthwise convが精度面で良い結果
▪ PVT→PVTv2ではconv→average poolに変更
▪ PVTはdepthwiseではない通常のconvだった
▪ Patch merging (downsample) を、
Qをdownsampleすることで
行っているのが面白い
Multiscale Vision Transformers (MViT)
H. Fan, et al., "Multiscale Vision Transformers," in Proc. of ICCV'21.

27
▪ residual pooling connectionの追加
▪ decomposed relative position embedding E(rel) の追加
▪ H×W×Tのテーブルを持たず独立を仮定して次元毎に持つ
MViTv2
Y. Li, et al., "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection," in Proc.
of CVPR'22.

28
▪ Patch embeddingにCNNのようなstem convを利用
▪ Positional encodingにconvolutional token embeddingを利用
▪ Q, K, Vの作成にdepthwise separable convを利用 (K, V縮小）
CvT
H. Wu, "CvT: Introducing Convolutions to Vision Transformers," in Proc. of ICCV'21.

29
▪ Efficient multi-head self-attention
▪ PVTと同じでK, Vを縮小
▪ DWConvで縮小しているのが違い
ResT
Q. Zhang and Y. Yang, "ResT: An Efficient Transformer for Visual Recognition," in Proc. of NeurIPS'21.

30
▪ これもspatial-reduction attention
▪ Head毎に異なる縮小率のK, Vを利用
▪ 右の図が分かりやすくて素敵
Shunted Transformer
S. Ren, et al., "Shunted Self-Attention via Multi-Scale Token Aggregation," in Proc. of CVPR'22.
Shunted
Transformer

31
▪ 畳み込みは高周波、attentionは低周波の情報を活用
▪ GoogLeNetのInceptionモジュールのように両方を活用する手法
▪ Stageが上がるにつれてattentionのチャネル率を増加させる
▪ Stage1, 2ではattentionはspatial-reduction attention
Inception Transformer
C. Si, et al., "Inception Transformer," in arXiv:2205.12956.

32
▪ LSAとGSAを繰り返すアーキテクチャ
▪ Locally-grouped self-attention (LSA)：Swinのwindow attention
▪ Global sub-sampled attention (GSA)：PVTのspatial-reduction
attention
Twins
X. Chu, et al., "Twins: Revisiting the Design of Spatial Attention in Vision Transformers," in Proc. of
NeurIPS'21.

33
▪ Query周辺のパッチを複数の解像度でpoolingしてK, Vとする
▪ 近傍は高解像度、遠方は低解像度
Focal Transformer（理想）
J. Yang, et al., "Focal Self-attention for Local-Global Interactions in Vision Transformers," in Proc. of
NeurIPS'21.

34
▪ Two levelでほぼlocalとglobal attention
▪ “For the focal self-attention layer, we introduce two levels, one for fine-
grain local attention and one for coarse-grain global attention”
Focal Transformer（現実）
J. Yang, et al., "Focal Self-attention for Local-Global Interactions in Vision Transformers," in Proc. of
NeurIPS'21.
Level数を L と一般化して
図も L=3 なのに実際は
2 levelのみ…

35
▪ SDA (window attention) と、特徴マップを空間的にshuffleしてから
window attentionするLDAの組み合わせ
▪ 空間shuffleは [2] でも利用されている
▪ 古くはCNNにShuffleNetというものがあってじゃな…
CrossFormer [1]
[1] W. Wang, et al., "CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention," in
Proc. of ICLR'22.
[2] Z. Huang, et al., "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer," in
arXiv:2106.03650.

36
▪ Attention（がメイン）じゃないやつとか
おまけ

37
▪ MobileNetと並列にglobal tokenの
streamを配置
▪ 本体はCNN
▪ cross-attentionで情報をやりとり
Mobile-Former
Y. Chen, et al., "Mobile-Former: Bridging MobileNet and Transformer," in Proc. of CVPR'22.
MobileNetの
stream
Global tokenの
stream
cross-
attention
cross-
attention

38
▪ Attentionの代わりにshift operation
▪ 空間方向（上下左右）に1 pixelずらす
▪ なのでZERO FLOPs!!!
▪ S2-MLP [2] や AS-MLP [3] といった
先行手法が存在するが
ShiftViTは本当にshiftだけ
ShiftViT [1]
[1] G. Wang, et al., "When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to
Attention Mechanism," in Proc. of AAAI'22.
[2] T. Yu, et al., "S2-MLP: Spatial-Shift MLP Architecture for Vision," in Proc. of WACV'22.
[3] D. Lian, et al., "AS-MLP: An Axial Shifted MLP Architecture for Vision," in Proc. of ICLR'22.

39
▪ Attentionの代わりにpool operation！
▪ （MetaFormer論文）
PoolFormer
W. Yu, et al., "MetaFormer is Actually What You Need for Vision," in Proc. of CVPR’22.

40
▪ 近年の階層的なVision Transformerを紹介した
▪ これらは共通の構造を持っており下記のモジュールから構成
▪ Transformer blockのattention部分の計算量削減がポイント
▪ Window (local) attentionとspatial-reduction attentionに大別される
▪ これらの組み合わせもある。1 blockで両方 or 連続したblockで個別に
▪ Position encodingはなくしてFFNにDWConvが良さそう（個人の意見です
▪ cls tokenはなくしてglobal average poolingを使う流れ
まとめ

41
まとめ（ICCV‘21, NeurIPS’21で流行、CVPR’22で完成？）
Model Name Paper Title Published at Attention Type
HaloNet Scaling Local Self-Attention for Parameter Efficient Visual Backbones CVPR'21 overlapped window
Swin Transformer Swin Transformer: Hierarchical Vision Transformer using Shifted Windows ICCV'21 window + shifted window
PVT Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions ICCV'21 spatial reduction
MViT Multiscale Vision Transformers ICCV'21 spatial reduction
CvT CvT: Introducing Convolutions to Vision Transformers ICCV'21 spatial reduction
Vision Longformer Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding ICCV'21 window + global token
CrossViT CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification ICCV'21 global
CeiT Incorporating Convolution Designs into Visual Transformers ICCV'21 global
CoaT Co-Scale Conv-Attentional Image Transformers ICCV'21 factorized
ResT ResT: An Efficient Transformer for Visual Recognition NeurIPS'21 spatial reduction
Twins Twins: Revisiting the Design of Spatial Attention in Vision Transformers NeurIPS'21 window + spatial reduction
Focal Transformer Focal Self-attention for Local-Global Interactions in Vision Transformers NeurIPS'21 window + spatial reduction
CoAtNet CoAtNet: Marrying Convolution and Attention for All Data Sizes NeurIPS'21 global
SegFormer SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers NeurIPS'21 spatial reduction
TNT Transformer in Transformer NeurIPS'21 window + spatial reduction
CrossFormer CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention ICLR'22 window + shuffle
RegionViT RegionViT: Regional-to-Local Attention for Vision Transformers ICLR'22 window + regional token
PoolFormer / MetaFormer MetaFormer is Actually What You Need for Vision CVPR’22 pool
CSWin Transformer A General Vision Transformer Backbone with Cross-Shaped Windows CVPR’22 cross-shaped window
Swin Transformer V2 Swin Transformer V2: Scaling Up Capacity and Resolution CVPR’22 window + shifted window
MViTv2 MViTv2: Improved Multiscale Vision Transformers for Classification and Detection CVPR'22 spatial reduction
Shunted Transformer Shunted Self-Attention via Multi-Scale Token Aggregation CVPR'22 spatial reduction
Mobile-Former Mobile-Former: Bridging MobileNet and Transformer CVPR'22 global token
Lite Vision Transformer Lite Vision Transformer with Enhanced Self-Attention CVPR'22 conv attention
PVTv2 Improved Baselines with Pyramid Vision Transformer CVMJ'22 spatial reduction

43
▪ Self-attention自体は単なる集合のencoder
▪ Positional encodingにより系列データであることを教えている
▪ SwinではRelative Position Biasを利用
▪ Relativeにすることで、translation invarianceを表現
Relative Position Bias
Window内の相対的な位置関係によって
attention強度を調整（learnable）

44
▪ 相対位置関係は縦横[−M + 1, M −1]のrangeで(2M-1)2パターン
▪ このbiasとindexの関係を保持しておき、使うときに引く
実装

45
▪ On Position Embeddings in BERT, ICLR’21
▪ https://openreview.net/forum?id=onxoVA9FxMw
▪ https://twitter.com/akivajp/status/1442241252204814336
▪ Rethinking and Improving Relative Position Encoding for Vision
Transformer, ICCV’21. thanks to @sasaki_ts
▪ CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows, arXiv’21. thanks to @Ocha_Cocoa
Positional Encodingの議論

46
▪ 入力token依存、画像入力サイズに依存しない、translation-
invariance、絶対座標も何となく加味できるposition encoding (PE)
▪ 実装は単に特徴マップを2次元に再構築してzero padding付きのconvするだけ
▪ Zero pad付きconvによりCNNが絶対座標を特徴マップに保持するという報告 [2]
▪ これにinspireされ、PVTv2ではFFNにDWConvを挿入、PE削除
Conditional Positional Encoding (CPE) [1]
[1] X. Chu, et al., "Conditional Positional Encodings for Vision Transformers," in arXiv:2102.10882.
[2] M. Islam, et al., "How Much Position Information Do Convolutional Neural Networks Encode?," in
Proc. of ICLR'20.

近年のHierarchical Vision Transformer

More Related Content

What's hot

Similar to 近年のHierarchical Vision Transformer

More from Yusuke Uchida

近年のHierarchical Vision Transformer