SlideShare a Scribd company logo
1 of 18
Download to read offline
Is Space-Time Attention
All You Need for
Video Understanding?
Gedas Bertasius, Heng Wang, Lorenzo Torresani,
ICML2021
2023/5/11
◼Transformer : TimeSformer
•
• Self-Attention
◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021]
• Transformer
• Embeding
•
• Transformer Encoder
• Self-Attention
• MLP
• Head
◼ViViT [Arnab+, ICCV2021]
• Embedding 3D Conv
◼ (TimeSformer)
• ViT 2D Conv
Embedding:
3D Conv
Embedding:
2D Conv
Transformer
Encoder
Transformer
Encoder
.
.
.
.
.
.
.
.
.
.
..
.
..
Attention
Self-Attention
◼TimeSformer
•
• 2D Conv
• Time Attention, Space Attention
• Attention
•
Embedding:
2D Conv
.
.
. .
.
.
.
..
Transformer Encoder
Time
Attention
Space
Attention
× 12
Time Attn, Space Attn
Time Attn Space Attn
.
.
.
+
Self-Attention Architectures
◼ Self-Attention
• Space Attention (S)
• Attn
• Joint Space-Time Attention (ST)
• Attn
• Divided Space-Time Attention (S+T)
• Attn
• Sparse Local Global Attention (L+G)
• Attn
• Axial Attention (T+W+H)
•
Attn
◼
• Kinetics-400 (K400) [Kay+, arXiv2017]
• Kinetics-600 (K600) [Carreira+, arXiv2018]
• Something-Something-v2 (SSv2)
[Goyal+, ICCV2017]
• Diving-48 [Li+, ECCV2018]
◼
• 224 × 224
• 8
•
1
32
◼
• TimeSformer
• TimeSformer-HR
◼
• ImageNet-21k (I21K)
• ImageNet-1k (I1K)
◼
• 15
• Optimizer SGD
• Momentum 0.9
• Weight decay 0.0001
1. Analysis of Self-Attention Schemes
2. Comparison to 3D CNNs
3. Varying the Number of Tokens
4. The Importance of Positional Embeddings
5. Comparison to the State-of-the-Art
1. Analysis of Self-Attention Schemes
✓Self-Attention
• Space Attention (S)
• Joint Space-Time Attention (ST)
• Divided Space-Time Attention (S+T)
• Sparse Local Global Attention (L+G)
• Axial Attention (T+W+H)
✓ST S+T
• 224, 336, 448, 560
• 8, 32, 64, 96
◼
• K400, SSv2
• I21K
◼Self-Attention
• Divided Space-Time
• Space Time Attention
◼ST S+T
• S+T (Divided)
2. Comparison to 3D CNNs
✓3D CNN
•
•
•
•
•
• I21K, I1K
◼
• TimeSformer
• I3D R50 [Wang+, CVPR2018]
• SlowFast R50 [Feichtenhofer+, ICCV2019]
◼
• K400
✓
• I21K I1K
◼
• TimeSformer
• 8 224 224
• TimeSformer-HR
• 16 448 448
• TimeSformer-L
• 96 224 224
◼
• K400, SSv2
◼3D CNN
• TimeSformer
• TimeSformer
• I21K
◼
• TimeSformer
I21K
3. Varying the Number of Tokens
✓
• 224 (default), 336, 448, 560
• 8 (default), 32, 64, 96
◼
• 16 × 16
224 336 448 560
8 8 × 14 × 14 8 × 21 × 21 8 × 28 × 28 8 × 35 × 35
32 32 × 14 × 14 32 × 21 × 21 32 × 28 × 28 32 × 35 × 35
64 64 × 14 × 14 64 × 21 × 21 64 × 28 × 28 64 × 35 × 35
96 96 × 14 × 14 96 × 21 × 21 96 × 28 × 28 96 × 35 × 35
◼
•
•
◼
•
The Importance of Positional Embeddings
◼
•
•
•
•
◼
• K400, SSv2
• I21K
Embedding:
2D Conv
.
.
. .
.
.
.
..
Transformer Encoder
Time
Attention
Space
Attention
.
.
.
+
◼Space-Time
• up
Comparison to the State-of-the-Art
✓SOTA
• R(2+1)D [Tran+, arXiv2018]
• bLVNet [Fan+, 2019]
• TSM [Lin+, ICCV2019]
• S3D-G [Xie+, ECCV2018]
• Oct-I3D+NL [Chen+, ICCV2019]
• D3D [Stroud+, WACV2020]
• I3D+NL [Wang+, CVPR2018]
• Ip-CSN-152 [Tran+, ICCV2019]
• CorrNet [Wang+, CVPR2020]
• LGD-3D-101 [Qiu+, CVPR2019]
• SlowFast [Feichtenhofer+, ICCV2019]
• X3D-XXL [Feichtenhofer+, CVPR2020]
◼
•
1. K400, K600
2. SSv2, Div48
•
• I21K
◼
• Top1, top5, TFLOPs
K400 K600
SSv2 Div48
◼Transformer : TimeSformer
•
• Self-Attention
• Divided Space-Time Attention
◼
•
• SOTA
•
◼
• Self-Attention
• 3D CNN
• Token
• Positional embedding

More Related Content

Similar to 論文紹介:Is Space-Time Attention All You Need for Video Understanding?

Will the computer world collapse in 2038?
Will the computer world collapse in 2038?Will the computer world collapse in 2038?
Will the computer world collapse in 2038?
Joris Berthelot
 
Playing in Tune: How We Refactored Cube to Terabyte Scale
Playing in Tune: How We Refactored Cube to Terabyte ScalePlaying in Tune: How We Refactored Cube to Terabyte Scale
Playing in Tune: How We Refactored Cube to Terabyte Scale
MongoDB
 
Harmony intune final
Harmony intune finalHarmony intune final
Harmony intune final
MongoDB
 
Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...
Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...
Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...
Redis Labs
 
H 264 in cuda presentation
H 264 in cuda presentationH 264 in cuda presentation
H 264 in cuda presentation
ashoknaik120
 

Similar to 論文紹介:Is Space-Time Attention All You Need for Video Understanding? (17)

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark
 
Will the computer world collapse in 2038?
Will the computer world collapse in 2038?Will the computer world collapse in 2038?
Will the computer world collapse in 2038?
 
xray at SciPy 2015
xray at SciPy 2015xray at SciPy 2015
xray at SciPy 2015
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
論文紹介:Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...
論文紹介:Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...論文紹介:Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...
論文紹介:Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...
 
G1 collector and tuning and Cassandra
G1 collector and tuning and CassandraG1 collector and tuning and Cassandra
G1 collector and tuning and Cassandra
 
"Mesh of Periodic Minimal Surfaces in CGAL."
"Mesh of Periodic Minimal Surfaces in CGAL.""Mesh of Periodic Minimal Surfaces in CGAL."
"Mesh of Periodic Minimal Surfaces in CGAL."
 
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraRealtime Analytics with Apache Cassandra
Realtime Analytics with Apache Cassandra
 
Playing in Tune: How We Refactored Cube to Terabyte Scale
Playing in Tune: How We Refactored Cube to Terabyte ScalePlaying in Tune: How We Refactored Cube to Terabyte Scale
Playing in Tune: How We Refactored Cube to Terabyte Scale
 
Harmony intune final
Harmony intune finalHarmony intune final
Harmony intune final
 
Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...
Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...
Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...
 
H 264 in cuda presentation
H 264 in cuda presentationH 264 in cuda presentation
H 264 in cuda presentation
 
WebRTC Standards & Implementation Q&A - Legacy API Support Changes
WebRTC Standards & Implementation Q&A - Legacy API Support ChangesWebRTC Standards & Implementation Q&A - Legacy API Support Changes
WebRTC Standards & Implementation Q&A - Legacy API Support Changes
 
Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...
Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...
Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...
 
Garbage First Garbage Collector: Where the Rubber Meets the Road!
Garbage First Garbage Collector: Where the Rubber Meets the Road!Garbage First Garbage Collector: Where the Rubber Meets the Road!
Garbage First Garbage Collector: Where the Rubber Meets the Road!
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalf
 

More from Toru Tamaki

More from Toru Tamaki (20)

論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
 
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
 
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion
 

Recently uploaded

CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
Wonjun Hwang
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdf
Overkill Security
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 

Recently uploaded (20)

JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistan
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdf
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 

論文紹介:Is Space-Time Attention All You Need for Video Understanding?

  • 1. Is Space-Time Attention All You Need for Video Understanding? Gedas Bertasius, Heng Wang, Lorenzo Torresani, ICML2021 2023/5/11
  • 2. ◼Transformer : TimeSformer • • Self-Attention ◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021] • Transformer • Embeding • • Transformer Encoder • Self-Attention • MLP • Head
  • 3. ◼ViViT [Arnab+, ICCV2021] • Embedding 3D Conv ◼ (TimeSformer) • ViT 2D Conv Embedding: 3D Conv Embedding: 2D Conv Transformer Encoder Transformer Encoder . . . . . . . . . . .. . .. Attention Self-Attention
  • 4. ◼TimeSformer • • 2D Conv • Time Attention, Space Attention • Attention • Embedding: 2D Conv . . . . . . . .. Transformer Encoder Time Attention Space Attention × 12 Time Attn, Space Attn Time Attn Space Attn . . . +
  • 5. Self-Attention Architectures ◼ Self-Attention • Space Attention (S) • Attn • Joint Space-Time Attention (ST) • Attn • Divided Space-Time Attention (S+T) • Attn • Sparse Local Global Attention (L+G) • Attn • Axial Attention (T+W+H) • Attn
  • 6. ◼ • Kinetics-400 (K400) [Kay+, arXiv2017] • Kinetics-600 (K600) [Carreira+, arXiv2018] • Something-Something-v2 (SSv2) [Goyal+, ICCV2017] • Diving-48 [Li+, ECCV2018] ◼ • 224 × 224 • 8 • 1 32 ◼ • TimeSformer • TimeSformer-HR ◼ • ImageNet-21k (I21K) • ImageNet-1k (I1K) ◼ • 15 • Optimizer SGD • Momentum 0.9 • Weight decay 0.0001
  • 7. 1. Analysis of Self-Attention Schemes 2. Comparison to 3D CNNs 3. Varying the Number of Tokens 4. The Importance of Positional Embeddings 5. Comparison to the State-of-the-Art
  • 8. 1. Analysis of Self-Attention Schemes ✓Self-Attention • Space Attention (S) • Joint Space-Time Attention (ST) • Divided Space-Time Attention (S+T) • Sparse Local Global Attention (L+G) • Axial Attention (T+W+H) ✓ST S+T • 224, 336, 448, 560 • 8, 32, 64, 96 ◼ • K400, SSv2 • I21K
  • 9. ◼Self-Attention • Divided Space-Time • Space Time Attention ◼ST S+T • S+T (Divided)
  • 10. 2. Comparison to 3D CNNs ✓3D CNN • • • • • • I21K, I1K ◼ • TimeSformer • I3D R50 [Wang+, CVPR2018] • SlowFast R50 [Feichtenhofer+, ICCV2019] ◼ • K400 ✓ • I21K I1K ◼ • TimeSformer • 8 224 224 • TimeSformer-HR • 16 448 448 • TimeSformer-L • 96 224 224 ◼ • K400, SSv2
  • 11. ◼3D CNN • TimeSformer • TimeSformer • I21K ◼ • TimeSformer I21K
  • 12. 3. Varying the Number of Tokens ✓ • 224 (default), 336, 448, 560 • 8 (default), 32, 64, 96 ◼ • 16 × 16 224 336 448 560 8 8 × 14 × 14 8 × 21 × 21 8 × 28 × 28 8 × 35 × 35 32 32 × 14 × 14 32 × 21 × 21 32 × 28 × 28 32 × 35 × 35 64 64 × 14 × 14 64 × 21 × 21 64 × 28 × 28 64 × 35 × 35 96 96 × 14 × 14 96 × 21 × 21 96 × 28 × 28 96 × 35 × 35
  • 14. The Importance of Positional Embeddings ◼ • • • • ◼ • K400, SSv2 • I21K Embedding: 2D Conv . . . . . . . .. Transformer Encoder Time Attention Space Attention . . . +
  • 16. Comparison to the State-of-the-Art ✓SOTA • R(2+1)D [Tran+, arXiv2018] • bLVNet [Fan+, 2019] • TSM [Lin+, ICCV2019] • S3D-G [Xie+, ECCV2018] • Oct-I3D+NL [Chen+, ICCV2019] • D3D [Stroud+, WACV2020] • I3D+NL [Wang+, CVPR2018] • Ip-CSN-152 [Tran+, ICCV2019] • CorrNet [Wang+, CVPR2020] • LGD-3D-101 [Qiu+, CVPR2019] • SlowFast [Feichtenhofer+, ICCV2019] • X3D-XXL [Feichtenhofer+, CVPR2020] ◼ • 1. K400, K600 2. SSv2, Div48 • • I21K ◼ • Top1, top5, TFLOPs
  • 18. ◼Transformer : TimeSformer • • Self-Attention • Divided Space-Time Attention ◼ • • SOTA • ◼ • Self-Attention • 3D CNN • Token • Positional embedding