SlideShare a Scribd company logo
Is Space-Time Attention
All You Need for
Video Understanding?
Gedas Bertasius, Heng Wang, Lorenzo Torresani,
ICML2021
2023/5/11
◼Transformer : TimeSformer
•
• Self-Attention
◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021]
• Transformer
• Embeding
•
• Transformer Encoder
• Self-Attention
• MLP
• Head
◼ViViT [Arnab+, ICCV2021]
• Embedding 3D Conv
◼ (TimeSformer)
• ViT 2D Conv
Embedding:
3D Conv
Embedding:
2D Conv
Transformer
Encoder
Transformer
Encoder
.
.
.
.
.
.
.
.
.
.
..
.
..
Attention
Self-Attention
◼TimeSformer
•
• 2D Conv
• Time Attention, Space Attention
• Attention
•
Embedding:
2D Conv
.
.
. .
.
.
.
..
Transformer Encoder
Time
Attention
Space
Attention
× 12
Time Attn, Space Attn
Time Attn Space Attn
.
.
.
+
Self-Attention Architectures
◼ Self-Attention
• Space Attention (S)
• Attn
• Joint Space-Time Attention (ST)
• Attn
• Divided Space-Time Attention (S+T)
• Attn
• Sparse Local Global Attention (L+G)
• Attn
• Axial Attention (T+W+H)
•
Attn
◼
• Kinetics-400 (K400) [Kay+, arXiv2017]
• Kinetics-600 (K600) [Carreira+, arXiv2018]
• Something-Something-v2 (SSv2)
[Goyal+, ICCV2017]
• Diving-48 [Li+, ECCV2018]
◼
• 224 × 224
• 8
•
1
32
◼
• TimeSformer
• TimeSformer-HR
◼
• ImageNet-21k (I21K)
• ImageNet-1k (I1K)
◼
• 15
• Optimizer SGD
• Momentum 0.9
• Weight decay 0.0001
1. Analysis of Self-Attention Schemes
2. Comparison to 3D CNNs
3. Varying the Number of Tokens
4. The Importance of Positional Embeddings
5. Comparison to the State-of-the-Art
1. Analysis of Self-Attention Schemes
✓Self-Attention
• Space Attention (S)
• Joint Space-Time Attention (ST)
• Divided Space-Time Attention (S+T)
• Sparse Local Global Attention (L+G)
• Axial Attention (T+W+H)
✓ST S+T
• 224, 336, 448, 560
• 8, 32, 64, 96
◼
• K400, SSv2
• I21K
◼Self-Attention
• Divided Space-Time
• Space Time Attention
◼ST S+T
• S+T (Divided)
2. Comparison to 3D CNNs
✓3D CNN
•
•
•
•
•
• I21K, I1K
◼
• TimeSformer
• I3D R50 [Wang+, CVPR2018]
• SlowFast R50 [Feichtenhofer+, ICCV2019]
◼
• K400
✓
• I21K I1K
◼
• TimeSformer
• 8 224 224
• TimeSformer-HR
• 16 448 448
• TimeSformer-L
• 96 224 224
◼
• K400, SSv2
◼3D CNN
• TimeSformer
• TimeSformer
• I21K
◼
• TimeSformer
I21K
3. Varying the Number of Tokens
✓
• 224 (default), 336, 448, 560
• 8 (default), 32, 64, 96
◼
• 16 × 16
224 336 448 560
8 8 × 14 × 14 8 × 21 × 21 8 × 28 × 28 8 × 35 × 35
32 32 × 14 × 14 32 × 21 × 21 32 × 28 × 28 32 × 35 × 35
64 64 × 14 × 14 64 × 21 × 21 64 × 28 × 28 64 × 35 × 35
96 96 × 14 × 14 96 × 21 × 21 96 × 28 × 28 96 × 35 × 35
◼
•
•
◼
•
The Importance of Positional Embeddings
◼
•
•
•
•
◼
• K400, SSv2
• I21K
Embedding:
2D Conv
.
.
. .
.
.
.
..
Transformer Encoder
Time
Attention
Space
Attention
.
.
.
+
◼Space-Time
• up
Comparison to the State-of-the-Art
✓SOTA
• R(2+1)D [Tran+, arXiv2018]
• bLVNet [Fan+, 2019]
• TSM [Lin+, ICCV2019]
• S3D-G [Xie+, ECCV2018]
• Oct-I3D+NL [Chen+, ICCV2019]
• D3D [Stroud+, WACV2020]
• I3D+NL [Wang+, CVPR2018]
• Ip-CSN-152 [Tran+, ICCV2019]
• CorrNet [Wang+, CVPR2020]
• LGD-3D-101 [Qiu+, CVPR2019]
• SlowFast [Feichtenhofer+, ICCV2019]
• X3D-XXL [Feichtenhofer+, CVPR2020]
◼
•
1. K400, K600
2. SSv2, Div48
•
• I21K
◼
• Top1, top5, TFLOPs
K400 K600
SSv2 Div48
◼Transformer : TimeSformer
•
• Self-Attention
• Divided Space-Time Attention
◼
•
• SOTA
•
◼
• Self-Attention
• 3D CNN
• Token
• Positional embedding

More Related Content

Similar to 論文紹介:Is Space-Time Attention All You Need for Video Understanding?

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark
Databricks
 
Will the computer world collapse in 2038?
Will the computer world collapse in 2038?Will the computer world collapse in 2038?
Will the computer world collapse in 2038?Joris Berthelot
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
Sangmin Woo
 
xray at SciPy 2015
xray at SciPy 2015xray at SciPy 2015
xray at SciPy 2015
Stephan Hoyer
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
MITSUNARI Shigeo
 
論文紹介:Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...
論文紹介:Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...論文紹介:Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...
論文紹介:Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...
Toru Tamaki
 
G1 collector and tuning and Cassandra
G1 collector and tuning and CassandraG1 collector and tuning and Cassandra
G1 collector and tuning and Cassandra
Chris Lohfink
 
"Mesh of Periodic Minimal Surfaces in CGAL."
"Mesh of Periodic Minimal Surfaces in CGAL.""Mesh of Periodic Minimal Surfaces in CGAL."
"Mesh of Periodic Minimal Surfaces in CGAL."
Vissarion Fisikopoulos
 
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraRealtime Analytics with Apache Cassandra
Realtime Analytics with Apache Cassandra
Acunu
 
Playing in Tune: How We Refactored Cube to Terabyte Scale
Playing in Tune: How We Refactored Cube to Terabyte ScalePlaying in Tune: How We Refactored Cube to Terabyte Scale
Playing in Tune: How We Refactored Cube to Terabyte ScaleMongoDB
 
Harmony intune final
Harmony intune finalHarmony intune final
Harmony intune finalMongoDB
 
Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...
Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...
Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...
Redis Labs
 
H 264 in cuda presentation
H 264 in cuda presentationH 264 in cuda presentation
H 264 in cuda presentationashoknaik120
 
WebRTC Standards & Implementation Q&A - Legacy API Support Changes
WebRTC Standards & Implementation Q&A - Legacy API Support ChangesWebRTC Standards & Implementation Q&A - Legacy API Support Changes
WebRTC Standards & Implementation Q&A - Legacy API Support Changes
Amir Zmora
 
Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...
Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...
Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...
Direct Dimensions, Inc.
 
Garbage First Garbage Collector: Where the Rubber Meets the Road!
Garbage First Garbage Collector: Where the Rubber Meets the Road!Garbage First Garbage Collector: Where the Rubber Meets the Road!
Garbage First Garbage Collector: Where the Rubber Meets the Road!
Monica Beckwith
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalf
Salo Shp
 

Similar to 論文紹介:Is Space-Time Attention All You Need for Video Understanding? (18)

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark
 
Will the computer world collapse in 2038?
Will the computer world collapse in 2038?Will the computer world collapse in 2038?
Will the computer world collapse in 2038?
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
 
xray at SciPy 2015
xray at SciPy 2015xray at SciPy 2015
xray at SciPy 2015
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
論文紹介:Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...
論文紹介:Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...論文紹介:Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...
論文紹介:Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...
 
G1 collector and tuning and Cassandra
G1 collector and tuning and CassandraG1 collector and tuning and Cassandra
G1 collector and tuning and Cassandra
 
"Mesh of Periodic Minimal Surfaces in CGAL."
"Mesh of Periodic Minimal Surfaces in CGAL.""Mesh of Periodic Minimal Surfaces in CGAL."
"Mesh of Periodic Minimal Surfaces in CGAL."
 
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraRealtime Analytics with Apache Cassandra
Realtime Analytics with Apache Cassandra
 
Playing in Tune: How We Refactored Cube to Terabyte Scale
Playing in Tune: How We Refactored Cube to Terabyte ScalePlaying in Tune: How We Refactored Cube to Terabyte Scale
Playing in Tune: How We Refactored Cube to Terabyte Scale
 
Harmony intune final
Harmony intune finalHarmony intune final
Harmony intune final
 
Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...
Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...
Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...
 
H 264 in cuda presentation
H 264 in cuda presentationH 264 in cuda presentation
H 264 in cuda presentation
 
WebRTC Standards & Implementation Q&A - Legacy API Support Changes
WebRTC Standards & Implementation Q&A - Legacy API Support ChangesWebRTC Standards & Implementation Q&A - Legacy API Support Changes
WebRTC Standards & Implementation Q&A - Legacy API Support Changes
 
Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...
Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...
Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...
 
Garbage First Garbage Collector: Where the Rubber Meets the Road!
Garbage First Garbage Collector: Where the Rubber Meets the Road!Garbage First Garbage Collector: Where the Rubber Meets the Road!
Garbage First Garbage Collector: Where the Rubber Meets the Road!
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalf
 

More from Toru Tamaki

論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
Toru Tamaki
 
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
Toru Tamaki
 
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
Toru Tamaki
 
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
Toru Tamaki
 
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
Toru Tamaki
 
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
Toru Tamaki
 
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
Toru Tamaki
 
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
Toru Tamaki
 
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
Toru Tamaki
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
Toru Tamaki
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
Toru Tamaki
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
Toru Tamaki
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
Toru Tamaki
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
Toru Tamaki
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
Toru Tamaki
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
Toru Tamaki
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
Toru Tamaki
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
Toru Tamaki
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
Toru Tamaki
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
Toru Tamaki
 

More from Toru Tamaki (20)

論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
 
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
 
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
 
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
 
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
 
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
 
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
 
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
 
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
 

Recently uploaded

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 

Recently uploaded (20)

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 

論文紹介:Is Space-Time Attention All You Need for Video Understanding?

  • 1. Is Space-Time Attention All You Need for Video Understanding? Gedas Bertasius, Heng Wang, Lorenzo Torresani, ICML2021 2023/5/11
  • 2. ◼Transformer : TimeSformer • • Self-Attention ◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021] • Transformer • Embeding • • Transformer Encoder • Self-Attention • MLP • Head
  • 3. ◼ViViT [Arnab+, ICCV2021] • Embedding 3D Conv ◼ (TimeSformer) • ViT 2D Conv Embedding: 3D Conv Embedding: 2D Conv Transformer Encoder Transformer Encoder . . . . . . . . . . .. . .. Attention Self-Attention
  • 4. ◼TimeSformer • • 2D Conv • Time Attention, Space Attention • Attention • Embedding: 2D Conv . . . . . . . .. Transformer Encoder Time Attention Space Attention × 12 Time Attn, Space Attn Time Attn Space Attn . . . +
  • 5. Self-Attention Architectures ◼ Self-Attention • Space Attention (S) • Attn • Joint Space-Time Attention (ST) • Attn • Divided Space-Time Attention (S+T) • Attn • Sparse Local Global Attention (L+G) • Attn • Axial Attention (T+W+H) • Attn
  • 6. ◼ • Kinetics-400 (K400) [Kay+, arXiv2017] • Kinetics-600 (K600) [Carreira+, arXiv2018] • Something-Something-v2 (SSv2) [Goyal+, ICCV2017] • Diving-48 [Li+, ECCV2018] ◼ • 224 × 224 • 8 • 1 32 ◼ • TimeSformer • TimeSformer-HR ◼ • ImageNet-21k (I21K) • ImageNet-1k (I1K) ◼ • 15 • Optimizer SGD • Momentum 0.9 • Weight decay 0.0001
  • 7. 1. Analysis of Self-Attention Schemes 2. Comparison to 3D CNNs 3. Varying the Number of Tokens 4. The Importance of Positional Embeddings 5. Comparison to the State-of-the-Art
  • 8. 1. Analysis of Self-Attention Schemes ✓Self-Attention • Space Attention (S) • Joint Space-Time Attention (ST) • Divided Space-Time Attention (S+T) • Sparse Local Global Attention (L+G) • Axial Attention (T+W+H) ✓ST S+T • 224, 336, 448, 560 • 8, 32, 64, 96 ◼ • K400, SSv2 • I21K
  • 9. ◼Self-Attention • Divided Space-Time • Space Time Attention ◼ST S+T • S+T (Divided)
  • 10. 2. Comparison to 3D CNNs ✓3D CNN • • • • • • I21K, I1K ◼ • TimeSformer • I3D R50 [Wang+, CVPR2018] • SlowFast R50 [Feichtenhofer+, ICCV2019] ◼ • K400 ✓ • I21K I1K ◼ • TimeSformer • 8 224 224 • TimeSformer-HR • 16 448 448 • TimeSformer-L • 96 224 224 ◼ • K400, SSv2
  • 11. ◼3D CNN • TimeSformer • TimeSformer • I21K ◼ • TimeSformer I21K
  • 12. 3. Varying the Number of Tokens ✓ • 224 (default), 336, 448, 560 • 8 (default), 32, 64, 96 ◼ • 16 × 16 224 336 448 560 8 8 × 14 × 14 8 × 21 × 21 8 × 28 × 28 8 × 35 × 35 32 32 × 14 × 14 32 × 21 × 21 32 × 28 × 28 32 × 35 × 35 64 64 × 14 × 14 64 × 21 × 21 64 × 28 × 28 64 × 35 × 35 96 96 × 14 × 14 96 × 21 × 21 96 × 28 × 28 96 × 35 × 35
  • 14. The Importance of Positional Embeddings ◼ • • • • ◼ • K400, SSv2 • I21K Embedding: 2D Conv . . . . . . . .. Transformer Encoder Time Attention Space Attention . . . +
  • 16. Comparison to the State-of-the-Art ✓SOTA • R(2+1)D [Tran+, arXiv2018] • bLVNet [Fan+, 2019] • TSM [Lin+, ICCV2019] • S3D-G [Xie+, ECCV2018] • Oct-I3D+NL [Chen+, ICCV2019] • D3D [Stroud+, WACV2020] • I3D+NL [Wang+, CVPR2018] • Ip-CSN-152 [Tran+, ICCV2019] • CorrNet [Wang+, CVPR2020] • LGD-3D-101 [Qiu+, CVPR2019] • SlowFast [Feichtenhofer+, ICCV2019] • X3D-XXL [Feichtenhofer+, CVPR2020] ◼ • 1. K400, K600 2. SSv2, Div48 • • I21K ◼ • Top1, top5, TFLOPs
  • 18. ◼Transformer : TimeSformer • • Self-Attention • Divided Space-Time Attention ◼ • • SOTA • ◼ • Self-Attention • 3D CNN • Token • Positional embedding