More Related Content Similar to 論文紹介:Is Space-Time Attention All You Need for Video Understanding? (17) More from Toru Tamaki (20) 論文紹介:Is Space-Time Attention All You Need for Video Understanding?2. ◼Transformer : TimeSformer
•
• Self-Attention
◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021]
• Transformer
• Embeding
•
• Transformer Encoder
• Self-Attention
• MLP
• Head
3. ◼ViViT [Arnab+, ICCV2021]
• Embedding 3D Conv
◼ (TimeSformer)
• ViT 2D Conv
Embedding:
3D Conv
Embedding:
2D Conv
Transformer
Encoder
Transformer
Encoder
.
.
.
.
.
.
.
.
.
.
..
.
..
Attention
Self-Attention
4. ◼TimeSformer
•
• 2D Conv
• Time Attention, Space Attention
• Attention
•
Embedding:
2D Conv
.
.
. .
.
.
.
..
Transformer Encoder
Time
Attention
Space
Attention
× 12
Time Attn, Space Attn
Time Attn Space Attn
.
.
.
+
5. Self-Attention Architectures
◼ Self-Attention
• Space Attention (S)
• Attn
• Joint Space-Time Attention (ST)
• Attn
• Divided Space-Time Attention (S+T)
• Attn
• Sparse Local Global Attention (L+G)
• Attn
• Axial Attention (T+W+H)
•
Attn
6. ◼
• Kinetics-400 (K400) [Kay+, arXiv2017]
• Kinetics-600 (K600) [Carreira+, arXiv2018]
• Something-Something-v2 (SSv2)
[Goyal+, ICCV2017]
• Diving-48 [Li+, ECCV2018]
◼
• 224 × 224
• 8
•
1
32
◼
• TimeSformer
• TimeSformer-HR
◼
• ImageNet-21k (I21K)
• ImageNet-1k (I1K)
◼
• 15
• Optimizer SGD
• Momentum 0.9
• Weight decay 0.0001
7. 1. Analysis of Self-Attention Schemes
2. Comparison to 3D CNNs
3. Varying the Number of Tokens
4. The Importance of Positional Embeddings
5. Comparison to the State-of-the-Art
8. 1. Analysis of Self-Attention Schemes
✓Self-Attention
• Space Attention (S)
• Joint Space-Time Attention (ST)
• Divided Space-Time Attention (S+T)
• Sparse Local Global Attention (L+G)
• Axial Attention (T+W+H)
✓ST S+T
• 224, 336, 448, 560
• 8, 32, 64, 96
◼
• K400, SSv2
• I21K
10. 2. Comparison to 3D CNNs
✓3D CNN
•
•
•
•
•
• I21K, I1K
◼
• TimeSformer
• I3D R50 [Wang+, CVPR2018]
• SlowFast R50 [Feichtenhofer+, ICCV2019]
◼
• K400
✓
• I21K I1K
◼
• TimeSformer
• 8 224 224
• TimeSformer-HR
• 16 448 448
• TimeSformer-L
• 96 224 224
◼
• K400, SSv2
12. 3. Varying the Number of Tokens
✓
• 224 (default), 336, 448, 560
• 8 (default), 32, 64, 96
◼
• 16 × 16
224 336 448 560
8 8 × 14 × 14 8 × 21 × 21 8 × 28 × 28 8 × 35 × 35
32 32 × 14 × 14 32 × 21 × 21 32 × 28 × 28 32 × 35 × 35
64 64 × 14 × 14 64 × 21 × 21 64 × 28 × 28 64 × 35 × 35
96 96 × 14 × 14 96 × 21 × 21 96 × 28 × 28 96 × 35 × 35
14. The Importance of Positional Embeddings
◼
•
•
•
•
◼
• K400, SSv2
• I21K
Embedding:
2D Conv
.
.
. .
.
.
.
..
Transformer Encoder
Time
Attention
Space
Attention
.
.
.
+
16. Comparison to the State-of-the-Art
✓SOTA
• R(2+1)D [Tran+, arXiv2018]
• bLVNet [Fan+, 2019]
• TSM [Lin+, ICCV2019]
• S3D-G [Xie+, ECCV2018]
• Oct-I3D+NL [Chen+, ICCV2019]
• D3D [Stroud+, WACV2020]
• I3D+NL [Wang+, CVPR2018]
• Ip-CSN-152 [Tran+, ICCV2019]
• CorrNet [Wang+, CVPR2020]
• LGD-3D-101 [Qiu+, CVPR2019]
• SlowFast [Feichtenhofer+, ICCV2019]
• X3D-XXL [Feichtenhofer+, CVPR2020]
◼
•
1. K400, K600
2. SSv2, Div48
•
• I21K
◼
• Top1, top5, TFLOPs
18. ◼Transformer : TimeSformer
•
• Self-Attention
• Divided Space-Time Attention
◼
•
• SOTA
•
◼
• Self-Attention
• 3D CNN
• Token
• Positional embedding