論文紹介：Is Space-Time Attention All You Need for Video Understanding?

•

0 likes•81 views

Gedas Bertasius, Heng Wang, Lorenzo Torresani, "Is Space-Time Attention All You Need for Video Understanding?" ICML2021 https://proceedings.mlr.press/v139/bertasius21a.html

Technology

Is Space-Time Attention
All You Need for
Video Understanding?
Gedas Bertasius, Heng Wang, Lorenzo Torresani,
ICML2021
2023/5/11

◼Transformer : TimeSformer
•
• Self-Attention
◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021]
• Transformer
• Embeding
•
• Transformer Encoder
• Self-Attention
• MLP
• Head

◼ViViT [Arnab+, ICCV2021]
• Embedding 3D Conv
◼ (TimeSformer)
• ViT 2D Conv
Embedding:
3D Conv
Embedding:
2D Conv
Transformer
Encoder
Transformer
Encoder
.
.
.
.
.
.
.
.
.
.
..
.
..
Attention
Self-Attention

◼TimeSformer
•
• 2D Conv
• Time Attention, Space Attention
• Attention
•
Embedding:
2D Conv
.
.
. .
.
.
.
..
Transformer Encoder
Time
Attention
Space
Attention
× 12
Time Attn, Space Attn
Time Attn Space Attn
.
.
.
+

Self-Attention Architectures
◼ Self-Attention
• Space Attention (S)
• Attn
• Joint Space-Time Attention (ST)
• Attn
• Divided Space-Time Attention (S+T)
• Attn
• Sparse Local Global Attention (L+G)
• Attn
• Axial Attention (T+W+H)
•
Attn

◼
• Kinetics-400 (K400) [Kay+, arXiv2017]
• Kinetics-600 (K600) [Carreira+, arXiv2018]
• Something-Something-v2 (SSv2)
[Goyal+, ICCV2017]
• Diving-48 [Li+, ECCV2018]
◼
• 224 × 224
• 8
•
1
32
◼
• TimeSformer
• TimeSformer-HR
◼
• ImageNet-21k (I21K)
• ImageNet-1k (I1K)
◼
• 15
• Optimizer SGD
• Momentum 0.9
• Weight decay 0.0001

1. Analysis of Self-Attention Schemes
2. Comparison to 3D CNNs
3. Varying the Number of Tokens
4. The Importance of Positional Embeddings
5. Comparison to the State-of-the-Art

1. Analysis of Self-Attention Schemes
✓Self-Attention
• Space Attention (S)
• Joint Space-Time Attention (ST)
• Divided Space-Time Attention (S+T)
• Sparse Local Global Attention (L+G)
• Axial Attention (T+W+H)
✓ST S+T
• 224, 336, 448, 560
• 8, 32, 64, 96
◼
• K400, SSv2
• I21K

◼Self-Attention
• Divided Space-Time
• Space Time Attention
◼ST S+T
• S+T (Divided)

2. Comparison to 3D CNNs
✓3D CNN
•
•
•
•
•
• I21K, I1K
◼
• TimeSformer
• I3D R50 [Wang+, CVPR2018]
• SlowFast R50 [Feichtenhofer+, ICCV2019]
◼
• K400
✓
• I21K I1K
◼
• TimeSformer
• 8 224 224
• TimeSformer-HR
• 16 448 448
• TimeSformer-L
• 96 224 224
◼
• K400, SSv2

◼3D CNN
• TimeSformer
• TimeSformer
• I21K
◼
• TimeSformer
I21K

3. Varying the Number of Tokens
✓
• 224 (default), 336, 448, 560
• 8 (default), 32, 64, 96
◼
• 16 × 16
224 336 448 560
8 8 × 14 × 14 8 × 21 × 21 8 × 28 × 28 8 × 35 × 35
32 32 × 14 × 14 32 × 21 × 21 32 × 28 × 28 32 × 35 × 35
64 64 × 14 × 14 64 × 21 × 21 64 × 28 × 28 64 × 35 × 35
96 96 × 14 × 14 96 × 21 × 21 96 × 28 × 28 96 × 35 × 35

The Importance of Positional Embeddings
◼
•
•
•
•
◼
• K400, SSv2
• I21K
Embedding:
2D Conv
.
.
. .
.
.
.
..
Transformer Encoder
Time
Attention
Space
Attention
.
.
.
+

Comparison to the State-of-the-Art
✓SOTA
• R(2+1)D [Tran+, arXiv2018]
• bLVNet [Fan+, 2019]
• TSM [Lin+, ICCV2019]
• S3D-G [Xie+, ECCV2018]
• Oct-I3D+NL [Chen+, ICCV2019]
• D3D [Stroud+, WACV2020]
• I3D+NL [Wang+, CVPR2018]
• Ip-CSN-152 [Tran+, ICCV2019]
• CorrNet [Wang+, CVPR2020]
• LGD-3D-101 [Qiu+, CVPR2019]
• SlowFast [Feichtenhofer+, ICCV2019]
• X3D-XXL [Feichtenhofer+, CVPR2020]
◼
•
1. K400, K600
2. SSv2, Div48
•
• I21K
◼
• Top1, top5, TFLOPs

◼Transformer : TimeSformer
•
• Self-Attention
• Divided Space-Time Attention
◼
•
• SOTA
•
◼
• Self-Attention
• 3D CNN
• Token
• Positional embedding

Similar to 論文紹介：Is Space-Time Attention All You Need for Video Understanding?

Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently. At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you'll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng

Databricks

Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently. At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you’ll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.

Challenging Web-Scale Graph Analytics with Apache Spark

Databricks

Will the computer world collapse in 2038?

Joris Berthelot

xray at SciPy 2015

Stephan Hoyer

深層学習フレームワークにおけるIntel CPU/富岳向け最適化法

MITSUNARI Shigeo

論文紹介：Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...

Toru Tamaki

G1 collector and tuning and Cassandra

Chris Lohfink

"Mesh of Periodic Minimal Surfaces in CGAL."

Vissarion Fisikopoulos

Realtime Analytics with Apache Cassandra

Acunu

Playing in Tune: How We Refactored Cube to Terabyte Scale

MongoDB

Harmony intune final

MongoDB

Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...

Redis Labs

H 264 in cuda presentation

ashoknaik120

WebRTC Standards & Implementation Q&A - Legacy API Support Changes

Amir Zmora

Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...

Direct Dimensions, Inc.

Garbage First Garbage Collector: Where the Rubber Meets the Road!

Monica Beckwith

Scaling the #2ndhalf

Salo Shp

Similar to 論文紹介：Is Space-Time Attention All You Need for Video Understanding? (17)

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng

Challenging Web-Scale Graph Analytics with Apache Spark

Will the computer world collapse in 2038?

xray at SciPy 2015

深層学習フレームワークにおけるIntel CPU/富岳向け最適化法

論文紹介：Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...

G1 collector and tuning and Cassandra

"Mesh of Periodic Minimal Surfaces in CGAL."

Realtime Analytics with Apache Cassandra

Playing in Tune: How We Refactored Cube to Terabyte Scale

Harmony intune final

Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...

H 264 in cuda presentation

WebRTC Standards & Implementation Q&A - Legacy API Support Changes

Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...

Garbage First Garbage Collector: Where the Rubber Meets the Road!

Scaling the #2ndhalf

Recently uploaded

Explore the latest trends and insights on JavaScript usage with Pixlogix's informative blog. Discover key statistics and facts about JavaScript's role in web development, its popularity among developers, and its impact on modern websites. Stay updated with the evolving landscape of JavaScript frameworks and libraries, and learn how they're shaping the future of web development. Gain valuable insights to enhance your JavaScript skills and stay ahead in the digital realm.

JavaScript Usage Statistics 2024 - The Ultimate Guide

Pixlogix Infotech

Design Guidelines for Passkeys 2024.pptx

FIDO Alliance

How to Check GPS Location with a Live Tracker in Pakistan

danishmna97

CORS (Kitworks Team Study 양다윗 발표자료 240510)

Wonjun Hwang

How to Check CNIC Information Online with Pakdata cf

danishmna97

Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...

ScyllaDB

Because observability is such a broad topic – and often something we learn on the job – it can feel like there’s too much to learn at once. But you don’t have to tackle everything and can start with the basics and build from there! Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack. No matter what tooling is in place, there are still observability fundamentals that developers should know. That’s why I’ve put together a primer on the different telemetry types, when to use them, how to understand the data journey, and what to look for in time series graphs.

Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)

Paige Cruz

Hyatt driving innovation and exceptional customer experiences with FIDO passw...

FIDO Alliance

Introduction to FIDO Authentication and Passkeys.pptx

FIDO Alliance

Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.

2024 May Patch Tuesday

Ivanti

ADP Passwordless Journey Case Study.pptx

FIDO Alliance

Here comes another enlightening document that dives into the thrilling world of breaking BitLocker, Windows' attempt at full disk encryption. This analysis will walk you through the myriad of creative hacks, from the classic cold boot attacks—because who doesn't love freezing their computer to steal some data—to exploiting those oh-so-reliable TPM chips that might as well have a "hack me" sign on them. We'll also cover some software vulnerabilities, because Microsoft just wouldn't be the same without a few of those sprinkled in for good measure. And let's not forget about intercepting those elusive decryption keys; it's like a digital treasure hunt! So, whether you're a security expert, a forensic analyst, or just a curious cat in the world of cybersecurity, enjoy the read, and maybe keep that data backed up somewhere safe, yeah? ------- This document provides a comprehensive analysis of the method demonstrated in the video "Breaking Bitlocker - Bypassing the Windows Disk Encryption" where the author showcases a low-cost hardware attack capable of bypassing BitLocker encryption. The analysis will cover various aspects of the attack, including the technical approach, the use of a Trusted Platform Module (TPM) chip, and the implications for security practices. The analysis provides a high-quality summary of the demonstrated attack, ensuring that security professionals and specialists from different fields can understand the potential risks and necessary countermeasures. The document is particularly useful for cybersecurity experts, IT professionals, and organizations that rely on BitLocker for data protection and to highlight the need for ongoing security assessments and the potential for similar vulnerabilities in other encryption systems.

Microsoft BitLocker Bypass Attack Method.pdf

Overkill Security

In the ever-evolving landscape of data management, Zero-ETL is an approach that is reshaping how businesses handle and integrate their data. This webinar explores Zero-ETL, a paradigm shift from the traditional Extract, Transform, Load (ETL) process, offering a more streamlined, efficient, and real-time data integration method. We will begin with an introduction to the concept of Zero-ETL, including how it allows direct access to data in its native environment and real-time data transformation, providing up-to-date information with significantly reduced data redundancy. Next, we'll take you through several demonstrations showing how Zero-ETL can deliver real-time data and enable the free movement of data between systems. We will also discuss the various tools that support all aspects of Zero-ETL, providing attendees with an understanding of how they can adopt this innovative approach in their organizations. Lastly, the session will conclude with an interactive Q&A segment, allowing participants to gain deeper insights into how Zero-ETL can be tailored to their specific business needs and how they can get started today. Join us to discover how Zero-ETL can elevate your organization's data strategy.

The Zero-ETL Approach: Enhancing Data Agility and Insight

Safe Software

State of the Smart Building Startup Landscape 2024!

Memoori

JohnPollard-hybrid-app-RailsConf2024.pptx

JohnPollard37

Vector Search @ sw2con for slideshare.pptx

jbellis

Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx

FIDO Alliance

The presentation was made in “Web3 Fusion: Embracing AI and Beyond” is more than a conference; it's a journey into the heart of digital transformation. The conference a provided a platform where the future of technology meets practical application. This three-day hybrid event, set in the heart of innovation, served as a gateway to the latest trends and transformative discussions in AI, Blockchain, IoT, AR/VR, and their collective impact on the information space.

AI in Action: Real World Use Cases by Anitaraj

AnitaRaj43

Discover the top CodeIgniter development companies that can elevate your project to new heights. Our blog explores the best firms known for their expertise in CodeIgniter framework development. From robust web applications to scalable solutions, these companies deliver excellence. Whether you're a startup or an enterprise, find the perfect match for your development needs on Top CSS Gallery's blog.

Top 10 CodeIgniter Development Companies

TopCSSGallery

Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx

MasterG

Recently uploaded (20)

JavaScript Usage Statistics 2024 - The Ultimate Guide

Design Guidelines for Passkeys 2024.pptx

How to Check GPS Location with a Live Tracker in Pakistan

CORS (Kitworks Team Study 양다윗 발표자료 240510)

How to Check CNIC Information Online with Pakdata cf

Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...

Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)

Hyatt driving innovation and exceptional customer experiences with FIDO passw...

Introduction to FIDO Authentication and Passkeys.pptx

2024 May Patch Tuesday

ADP Passwordless Journey Case Study.pptx

Microsoft BitLocker Bypass Attack Method.pdf

The Zero-ETL Approach: Enhancing Data Agility and Insight

State of the Smart Building Startup Landscape 2024!

JohnPollard-hybrid-app-RailsConf2024.pptx

Vector Search @ sw2con for slideshare.pptx

Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx

AI in Action: Real World Use Cases by Anitaraj

Top 10 CodeIgniter Development Companies

Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx

論文紹介：Is Space-Time Attention All You Need for Video Understanding?

1. Is Space-Time Attention All You Need for Video Understanding? Gedas Bertasius, Heng Wang, Lorenzo Torresani, ICML2021 2023/5/11

2. ◼Transformer : TimeSformer • • Self-Attention ◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021] • Transformer • Embeding • • Transformer Encoder • Self-Attention • MLP • Head

3. ◼ViViT [Arnab+, ICCV2021] • Embedding 3D Conv ◼ (TimeSformer) • ViT 2D Conv Embedding: 3D Conv Embedding: 2D Conv Transformer Encoder Transformer Encoder . . . . . . . . . . .. . .. Attention Self-Attention

4. ◼TimeSformer • • 2D Conv • Time Attention, Space Attention • Attention • Embedding: 2D Conv . . . . . . . .. Transformer Encoder Time Attention Space Attention × 12 Time Attn, Space Attn Time Attn Space Attn . . . +

5. Self-Attention Architectures ◼ Self-Attention • Space Attention (S) • Attn • Joint Space-Time Attention (ST) • Attn • Divided Space-Time Attention (S+T) • Attn • Sparse Local Global Attention (L+G) • Attn • Axial Attention (T+W+H) • Attn

6. ◼ • Kinetics-400 (K400) [Kay+, arXiv2017] • Kinetics-600 (K600) [Carreira+, arXiv2018] • Something-Something-v2 (SSv2) [Goyal+, ICCV2017] • Diving-48 [Li+, ECCV2018] ◼ • 224 × 224 • 8 • 1 32 ◼ • TimeSformer • TimeSformer-HR ◼ • ImageNet-21k (I21K) • ImageNet-1k (I1K) ◼ • 15 • Optimizer SGD • Momentum 0.9 • Weight decay 0.0001

7. 1. Analysis of Self-Attention Schemes 2. Comparison to 3D CNNs 3. Varying the Number of Tokens 4. The Importance of Positional Embeddings 5. Comparison to the State-of-the-Art

8. 1. Analysis of Self-Attention Schemes ✓Self-Attention • Space Attention (S) • Joint Space-Time Attention (ST) • Divided Space-Time Attention (S+T) • Sparse Local Global Attention (L+G) • Axial Attention (T+W+H) ✓ST S+T • 224, 336, 448, 560 • 8, 32, 64, 96 ◼ • K400, SSv2 • I21K

9. ◼Self-Attention • Divided Space-Time • Space Time Attention ◼ST S+T • S+T (Divided)

10. 2. Comparison to 3D CNNs ✓3D CNN • • • • • • I21K, I1K ◼ • TimeSformer • I3D R50 [Wang+, CVPR2018] • SlowFast R50 [Feichtenhofer+, ICCV2019] ◼ • K400 ✓ • I21K I1K ◼ • TimeSformer • 8 224 224 • TimeSformer-HR • 16 448 448 • TimeSformer-L • 96 224 224 ◼ • K400, SSv2

11. ◼3D CNN • TimeSformer • TimeSformer • I21K ◼ • TimeSformer I21K

12. 3. Varying the Number of Tokens ✓ • 224 (default), 336, 448, 560 • 8 (default), 32, 64, 96 ◼ • 16 × 16 224 336 448 560 8 8 × 14 × 14 8 × 21 × 21 8 × 28 × 28 8 × 35 × 35 32 32 × 14 × 14 32 × 21 × 21 32 × 28 × 28 32 × 35 × 35 64 64 × 14 × 14 64 × 21 × 21 64 × 28 × 28 64 × 35 × 35 96 96 × 14 × 14 96 × 21 × 21 96 × 28 × 28 96 × 35 × 35

13. ◼ • • ◼ •

14. The Importance of Positional Embeddings ◼ • • • • ◼ • K400, SSv2 • I21K Embedding: 2D Conv . . . . . . . .. Transformer Encoder Time Attention Space Attention . . . +

15. ◼Space-Time • up

16. Comparison to the State-of-the-Art ✓SOTA • R(2+1)D [Tran+, arXiv2018] • bLVNet [Fan+, 2019] • TSM [Lin+, ICCV2019] • S3D-G [Xie+, ECCV2018] • Oct-I3D+NL [Chen+, ICCV2019] • D3D [Stroud+, WACV2020] • I3D+NL [Wang+, CVPR2018] • Ip-CSN-152 [Tran+, ICCV2019] • CorrNet [Wang+, CVPR2020] • LGD-3D-101 [Qiu+, CVPR2019] • SlowFast [Feichtenhofer+, ICCV2019] • X3D-XXL [Feichtenhofer+, CVPR2020] ◼ • 1. K400, K600 2. SSv2, Div48 • • I21K ◼ • Top1, top5, TFLOPs

17. K400 K600 SSv2 Div48

18. ◼Transformer : TimeSformer • • Self-Attention • Divided Space-Time Attention ◼ • • SOTA • ◼ • Self-Attention • 3D CNN • Token • Positional embedding

論文紹介：Is Space-Time Attention All You Need for Video Understanding?

Recommended

Recommended

More Related Content

Similar to 論文紹介：Is Space-Time Attention All You Need for Video Understanding?

Similar to 論文紹介：Is Space-Time Attention All You Need for Video Understanding? (17)

More from Toru Tamaki

More from Toru Tamaki (20)

Recently uploaded

Recently uploaded (20)

論文紹介：Is Space-Time Attention All You Need for Video Understanding?