Jaemin Jeong Seminar 2
General-purpose backbone for computer vision.
Hierachical Transformer whose representation is
computed with shifted windows.
Swin Transformer
Jaemin Jeong Seminar 3
Architecture
https://hangjo-o.tistory.com/69
Jaemin Jeong Seminar 4
Patch Merging
1 2 1 2
3 4 3 4
1 2 1 2
3 4 3 4
1 1
1 1
2 2
2 2
3 3
3 3
4 4
4 4
1 1
1 1
1 1
1 1
Jaemin Jeong Seminar 5
 𝑀 × 𝑀 (7 x 7)
 Propose to compute self-attention within local windows.
Self-attention in non-overlapped windows
Jaemin Jeong Seminar 6
Shifted window partitioning in successive blocks
 The window-based self-attention module lacks connections across windows, which limits its
modeling power.
 W-MSA
 SW-MSA
Jaemin Jeong Seminar 7
Swin Transformer
Attention Attention
Attention
Attention
Jaemin Jeong Seminar 8
 if.. smaller than M x M ??
 Add padding -> increased computation
 Cyclic-Shift
Efficient batch computation for shifted configuration
Jaemin Jeong Seminar 9
Jaemin Jeong Seminar 10
 Including a relative position bias : 𝐵 ∈ ℝ𝑀2×𝑀2
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥
𝑄𝐾𝑇
𝑑
+ 𝐵 𝑉
Relative position bias
Jaemin Jeong Seminar 11
Swin Transformer
 We also introduce Swin-T, Swin-S and Swin-L, which are versions of about 0.25×, 0.5× and 2× the
model size and computational complexity, respectively.
 M = 7 / d = 32
Jaemin Jeong Seminar 12
Experiments - ImageNet
14.2 million images and 22K classes.
Jaemin Jeong Seminar 13
Experiments - Object Detection
Jaemin Jeong Seminar 14
Semantic Segmentation
Jaemin Jeong Seminar 15
Ablation Study
Jaemin Jeong Seminar 16
Proposed Swin Transformer which a new vision Transformer.
Hierarchical feature representation and has linear
computational complexity with respect to input image size.
Shifted window based self-attention is shown to be effective
and efficient on vision problems.
Conclusion

Swin transformer

  • 2.
    Jaemin Jeong Seminar2 General-purpose backbone for computer vision. Hierachical Transformer whose representation is computed with shifted windows. Swin Transformer
  • 3.
    Jaemin Jeong Seminar3 Architecture https://hangjo-o.tistory.com/69
  • 4.
    Jaemin Jeong Seminar4 Patch Merging 1 2 1 2 3 4 3 4 1 2 1 2 3 4 3 4 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 1 1 1 1 1 1 1 1
  • 5.
    Jaemin Jeong Seminar5  𝑀 × 𝑀 (7 x 7)  Propose to compute self-attention within local windows. Self-attention in non-overlapped windows
  • 6.
    Jaemin Jeong Seminar6 Shifted window partitioning in successive blocks  The window-based self-attention module lacks connections across windows, which limits its modeling power.  W-MSA  SW-MSA
  • 7.
    Jaemin Jeong Seminar7 Swin Transformer Attention Attention Attention Attention
  • 8.
    Jaemin Jeong Seminar8  if.. smaller than M x M ??  Add padding -> increased computation  Cyclic-Shift Efficient batch computation for shifted configuration
  • 9.
  • 10.
    Jaemin Jeong Seminar10  Including a relative position bias : 𝐵 ∈ ℝ𝑀2×𝑀2 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇 𝑑 + 𝐵 𝑉 Relative position bias
  • 11.
    Jaemin Jeong Seminar11 Swin Transformer  We also introduce Swin-T, Swin-S and Swin-L, which are versions of about 0.25×, 0.5× and 2× the model size and computational complexity, respectively.  M = 7 / d = 32
  • 12.
    Jaemin Jeong Seminar12 Experiments - ImageNet 14.2 million images and 22K classes.
  • 13.
    Jaemin Jeong Seminar13 Experiments - Object Detection
  • 14.
    Jaemin Jeong Seminar14 Semantic Segmentation
  • 15.
    Jaemin Jeong Seminar15 Ablation Study
  • 16.
    Jaemin Jeong Seminar16 Proposed Swin Transformer which a new vision Transformer. Hierarchical feature representation and has linear computational complexity with respect to input image size. Shifted window based self-attention is shown to be effective and efficient on vision problems. Conclusion