Hierarchical Vision Transformer using Shifted Windows

Hierarchical Vision Transformer
using Shifted Windows
Submitted on 25 Mar 2021
Authors {Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin and Baining Guo}@microsoft.com
Swin Transformer:
Image Processing Team
Dawoon Heo / Jeongho Shin / Sanghyun Kim / Seonok Kim(🙋)
2021.05.23

2
Swin Transformer
Why❓
Image classification: 🥇86.4 top-1 accuracy on ImageNet-1K
Object detection: 🥇58.7 box AP and 51.1 🥇mask AP on COCO test-dev
Semantic segmentation: 🥇53.5 mIoU on ADE20K val
(1) https://paperswithcode.com/dataset/coco
MS COCO Dataset Benchmarks (1)

3
Swin Transformer
What❓
Hierarchical feature maps where at each level of hierarchy Self-attention is applied within local non-
overlapping windows.
Window-based Self-attention reduces the computational overhead.
Swin Transformer is new vision Transformer with hierarchical feature maps and shifted windows.
Vision Transformer (ViT) Swin Transformer

Introduction
Swin Transformer
Experiments
Conclusion
INDEX

6
Swin Transformer
Transformers & CNNs
Attention can simultaneously extract all the information we need from the input and its inter-relation.
couch
A
A
cat sleeping on couch
cat
sleeping
on
Transformers CNNs
CNNs are much more localized. It does not have the spa7al informa7on necessary for many tasks like instance
recogni7on because convolu7ons don't consider distanced-pixels rela7ons.

7
Swin Transformer
Self-attention
Self-attention could be seen as a measurement of a specific word's effect on all other words of the same
sentence.
This same process can be applied to images calcula7ng the aAen7on of patches of the images and their rela7ons to
each other.
A cat sleeping on couch

8
Swin Transformer
Motivation
Challenges in adapting Transformer from language to vision arise from differences between the two
domains.
Regular self-attention requires quadratic of the image size number of operations, limiting applications in
computer vision where high resolution is necessary.
250px
250px
62500 pixels
3906250000
calculations 🤯

11
Swin Transformer
Swin Transformer Atchitecture

12
Swin Transformer
Patch Partition
Token
Token
8
8
Dimension = 48
(4x4x3 channels)
At first, like most computer vision tasks, an RGB
image is sent to the network.
This image is split into patches, and each patch
is treated as a token.

13
Swin Transformer
Vision Transformer (ViT) vs Swin Transformer
Vision Transformer (ViT) (2) Swin Transformer
(2) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (arXiv:2010.11929)
ViT directly applies a Transformer architecture on non-overlapping image patches. It has the quadratic
increase in complexity with image size. Swin transformer block modifies the multi-header self-attention.

14
Swin Transformer
Swin Transformer Blocks
W-MSA
The input image is divided into four
windows and compute attention for
patches within the window.
SW-MSA
The approach introduces connections
between neighboring non-overlapping
windows in the previous layer.

15
Swin Transformer
Shifted Window based Self-Attention
Self-attention is applied on each patch, here referred to as windows. In layer l (left), a regular window
partitioning scheme is adopted, and self-attention is computed within each window.
Then, the windows are shifted, resulting in a new window configuration to apply self-attention again. This
allows the creation of connections between windows while maintaining the computation efficiency of this
windowed architecture.

16
Swin Transformer
Cyclic-Shift
The paper propose an efficient batch computation approach by cyclic-shifting toward the top-left direction
After this shift, a batched window may be composed of several sub-windows that are not adjacent in the
feature map, so a masking mechanism is employed to limit self-attention computation to within each sub-
window.

18
Swin Transformer
Variants
The paper also introduce Swin-T, Swin-S and Swin-L, which are versions of about 0.25x, 0.5x and 2x the
model size and computational complexity, respectively.
The value of c is from 96 to 128 and 192. The number of blocks is also changed at each stage from 6 to 18
or keeping them to 2

20
Swin Transformer
Ablation study
The paper show the results of ablation study on shifted window approach. Speed of padding and cyclic
method is also compared.

21
Swin Transformer
Classification Results

22
Swin Transformer
Detection Results

23
Swin Transformer
Semantic Segmentation Results

25
Swin Transformer
Conclusion
Using a similar architecture for both NLP and computer vision could significantly accelerate the research
process.
This paper presents Swin Transformer, a new vision Transformer which produces a hierarchical feature
representation and has linear computational complexity with respect to input image size.
As a key element of Swin Transformer, the shifted window based self-attention is shown to be effective and
efficient on vision problems.

Sources
Original Paper
Paper (https://arxiv.org/pdf/2103.14030.pdf)
GitHub (https://github.com/microsoft/Swin-Transformer)
YouTube
AI Bites (https://www.youtube.com/watch?v=tFYxJZBAbE8)
What's AI (https://www.youtube.com/watch?v=QcCJJOLCeJQ&t=4s)
Blog
Hello Joo World! (https://velog.io/@hangjoo_-)
Deep.log (https://velog.io/@yhyj1001)
Seonok Kim (sokim0991@korea.ac.kr)
Presenter

Hierarchical Vision Transformer using Shifted Windows

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hierarchical Vision Transformer using Shifted Windows

Similar to Hierarchical Vision Transformer using Shifted Windows (20)

More from taeseon ryu

More from taeseon ryu (20)

Recently uploaded

Recently uploaded (20)

Hierarchical Vision Transformer using Shifted Windows