Swin Transformer가 최근 오브젝트디텍션 그리고 Semantic segmentation분야에서의 성능이 가장 좋은 모델 중 하나로
주목 받고 있습니다.
Swin Transformer nlp분야에서 많이 쓰이는 트랜스포머를 비전 분야에 적용한 모델로 Hierarchical feature maps과
Window-based Self-attention의 특징적입니다 Swin Transformer는 작년 구글에서 제안된 방법인
비전 트랜스포머의 한계점을 개선한 모델이라고 보시면 됩니다
트랜스포머의 한계란.. ㄷㄷ 이네요
이미지 처리팀의 김선옥님이 자세한 리뷰 도와주셨습니다!!
오늘도 많은 관심 미리 감사드립니다!!
https://youtu.be/L3sH9tjkvKI
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
Hierarchical Vision Transformer using Shifted Windows
1. Hierarchical Vision Transformer
using Shifted Windows
Submitted on 25 Mar 2021
Authors {Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin and Baining Guo}@microsoft.com
Swin Transformer:
Image Processing Team
Dawoon Heo / Jeongho Shin / Sanghyun Kim / Seonok Kim(🙋)
2021.05.23
2. 2
Swin Transformer
Why❓
Image classification: 🥇86.4 top-1 accuracy on ImageNet-1K
Object detection: 🥇58.7 box AP and 51.1 🥇mask AP on COCO test-dev
Semantic segmentation: 🥇53.5 mIoU on ADE20K val
(1) https://paperswithcode.com/dataset/coco
MS COCO Dataset Benchmarks (1)
3. 3
Swin Transformer
What❓
Hierarchical feature maps where at each level of hierarchy Self-attention is applied within local non-
overlapping windows.
Window-based Self-attention reduces the computational overhead.
Swin Transformer is new vision Transformer with hierarchical feature maps and shifted windows.
Vision Transformer (ViT) Swin Transformer
6. 6
Swin Transformer
Transformers & CNNs
Attention can simultaneously extract all the information we need from the input and its inter-relation.
couch
A
A
cat sleeping on couch
cat
sleeping
on
Transformers CNNs
CNNs are much more localized. It does not have the spa7al informa7on necessary for many tasks like instance
recogni7on because convolu7ons don't consider distanced-pixels rela7ons.
7. 7
Swin Transformer
Self-attention
Self-attention could be seen as a measurement of a specific word's effect on all other words of the same
sentence.
This same process can be applied to images calcula7ng the aAen7on of patches of the images and their rela7ons to
each other.
A cat sleeping on couch
A cat sleeping on couch
8. 8
Swin Transformer
Motivation
Challenges in adapting Transformer from language to vision arise from differences between the two
domains.
Regular self-attention requires quadratic of the image size number of operations, limiting applications in
computer vision where high resolution is necessary.
250px
250px
62500 pixels
3906250000
calculations 🤯
12. 12
Swin Transformer
Patch Partition
A cat sleeping on couch
Token
Token
8
8
Dimension = 48
(4x4x3 channels)
At first, like most computer vision tasks, an RGB
image is sent to the network.
This image is split into patches, and each patch
is treated as a token.
13. 13
Swin Transformer
Vision Transformer (ViT) vs Swin Transformer
Vision Transformer (ViT) (2) Swin Transformer
(2) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (arXiv:2010.11929)
ViT directly applies a Transformer architecture on non-overlapping image patches. It has the quadratic
increase in complexity with image size. Swin transformer block modifies the multi-header self-attention.
14. 14
Swin Transformer
Swin Transformer Blocks
W-MSA
The input image is divided into four
windows and compute attention for
patches within the window.
SW-MSA
The approach introduces connections
between neighboring non-overlapping
windows in the previous layer.
15. 15
Swin Transformer
Shifted Window based Self-Attention
Self-attention is applied on each patch, here referred to as windows. In layer l (left), a regular window
partitioning scheme is adopted, and self-attention is computed within each window.
Then, the windows are shifted, resulting in a new window configuration to apply self-attention again. This
allows the creation of connections between windows while maintaining the computation efficiency of this
windowed architecture.
16. 16
Swin Transformer
Cyclic-Shift
The paper propose an efficient batch computation approach by cyclic-shifting toward the top-left direction
After this shift, a batched window may be composed of several sub-windows that are not adjacent in the
feature map, so a masking mechanism is employed to limit self-attention computation to within each sub-
window.
18. 18
Swin Transformer
Variants
The paper also introduce Swin-T, Swin-S and Swin-L, which are versions of about 0.25x, 0.5x and 2x the
model size and computational complexity, respectively.
The value of c is from 96 to 128 and 192. The number of blocks is also changed at each stage from 6 to 18
or keeping them to 2
20. 20
Swin Transformer
Ablation study
The paper show the results of ablation study on shifted window approach. Speed of padding and cyclic
method is also compared.
25. 25
Swin Transformer
Conclusion
Using a similar architecture for both NLP and computer vision could significantly accelerate the research
process.
A cat sleeping on couch
A cat sleeping on couch
This paper presents Swin Transformer, a new vision Transformer which produces a hierarchical feature
representation and has linear computational complexity with respect to input image size.
As a key element of Swin Transformer, the shifted window based self-attention is shown to be effective and
efficient on vision problems.
28. Sources
Original Paper
Paper (https://arxiv.org/pdf/2103.14030.pdf)
GitHub (https://github.com/microsoft/Swin-Transformer)
YouTube
AI Bites (https://www.youtube.com/watch?v=tFYxJZBAbE8)
What's AI (https://www.youtube.com/watch?v=QcCJJOLCeJQ&t=4s)
Blog
Hello Joo World! (https://velog.io/@hangjoo_-)
Deep.log (https://velog.io/@yhyj1001)
Seonok Kim (sokim0991@korea.ac.kr)
Presenter