論文紹介：Is Space-Time Attention All You Need for Video Understanding?

•

0 likes•85 views

Gedas Bertasius, Heng Wang, Lorenzo Torresani, "Is Space-Time Attention All You Need for Video Understanding?" ICML2021 https://proceedings.mlr.press/v139/bertasius21a.html

Technology

Is Space-Time Attention
All You Need for
Video Understanding?
Gedas Bertasius, Heng Wang, Lorenzo Torresani,
ICML2021
2023/5/11

◼Transformer : TimeSformer
•
• Self-Attention
◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021]
• Transformer
• Embeding
•
• Transformer Encoder
• Self-Attention
• MLP
• Head

◼ViViT [Arnab+, ICCV2021]
• Embedding 3D Conv
◼ (TimeSformer)
• ViT 2D Conv
Embedding:
3D Conv
Embedding:
2D Conv
Transformer
Encoder
Transformer
Encoder
.
.
.
.
.
.
.
.
.
.
..
.
..
Attention
Self-Attention

◼TimeSformer
•
• 2D Conv
• Time Attention, Space Attention
• Attention
•
Embedding:
2D Conv
.
.
. .
.
.
.
..
Transformer Encoder
Time
Attention
Space
Attention
× 12
Time Attn, Space Attn
Time Attn Space Attn
.
.
.
+

Self-Attention Architectures
◼ Self-Attention
• Space Attention (S)
• Attn
• Joint Space-Time Attention (ST)
• Attn
• Divided Space-Time Attention (S+T)
• Attn
• Sparse Local Global Attention (L+G)
• Attn
• Axial Attention (T+W+H)
•
Attn

◼
• Kinetics-400 (K400) [Kay+, arXiv2017]
• Kinetics-600 (K600) [Carreira+, arXiv2018]
• Something-Something-v2 (SSv2)
[Goyal+, ICCV2017]
• Diving-48 [Li+, ECCV2018]
◼
• 224 × 224
• 8
•
1
32
◼
• TimeSformer
• TimeSformer-HR
◼
• ImageNet-21k (I21K)
• ImageNet-1k (I1K)
◼
• 15
• Optimizer SGD
• Momentum 0.9
• Weight decay 0.0001

1. Analysis of Self-Attention Schemes
2. Comparison to 3D CNNs
3. Varying the Number of Tokens
4. The Importance of Positional Embeddings
5. Comparison to the State-of-the-Art

1. Analysis of Self-Attention Schemes
✓Self-Attention
• Space Attention (S)
• Joint Space-Time Attention (ST)
• Divided Space-Time Attention (S+T)
• Sparse Local Global Attention (L+G)
• Axial Attention (T+W+H)
✓ST S+T
• 224, 336, 448, 560
• 8, 32, 64, 96
◼
• K400, SSv2
• I21K

◼Self-Attention
• Divided Space-Time
• Space Time Attention
◼ST S+T
• S+T (Divided)

2. Comparison to 3D CNNs
✓3D CNN
•
•
•
•
•
• I21K, I1K
◼
• TimeSformer
• I3D R50 [Wang+, CVPR2018]
• SlowFast R50 [Feichtenhofer+, ICCV2019]
◼
• K400
✓
• I21K I1K
◼
• TimeSformer
• 8 224 224
• TimeSformer-HR
• 16 448 448
• TimeSformer-L
• 96 224 224
◼
• K400, SSv2

◼3D CNN
• TimeSformer
• TimeSformer
• I21K
◼
• TimeSformer
I21K

3. Varying the Number of Tokens
✓
• 224 (default), 336, 448, 560
• 8 (default), 32, 64, 96
◼
• 16 × 16
224 336 448 560
8 8 × 14 × 14 8 × 21 × 21 8 × 28 × 28 8 × 35 × 35
32 32 × 14 × 14 32 × 21 × 21 32 × 28 × 28 32 × 35 × 35
64 64 × 14 × 14 64 × 21 × 21 64 × 28 × 28 64 × 35 × 35
96 96 × 14 × 14 96 × 21 × 21 96 × 28 × 28 96 × 35 × 35

The Importance of Positional Embeddings
◼
•
•
•
•
◼
• K400, SSv2
• I21K
Embedding:
2D Conv
.
.
. .
.
.
.
..
Transformer Encoder
Time
Attention
Space
Attention
.
.
.
+

Comparison to the State-of-the-Art
✓SOTA
• R(2+1)D [Tran+, arXiv2018]
• bLVNet [Fan+, 2019]
• TSM [Lin+, ICCV2019]
• S3D-G [Xie+, ECCV2018]
• Oct-I3D+NL [Chen+, ICCV2019]
• D3D [Stroud+, WACV2020]
• I3D+NL [Wang+, CVPR2018]
• Ip-CSN-152 [Tran+, ICCV2019]
• CorrNet [Wang+, CVPR2020]
• LGD-3D-101 [Qiu+, CVPR2019]
• SlowFast [Feichtenhofer+, ICCV2019]
• X3D-XXL [Feichtenhofer+, CVPR2020]
◼
•
1. K400, K600
2. SSv2, Div48
•
• I21K
◼
• Top1, top5, TFLOPs

◼Transformer : TimeSformer
•
• Self-Attention
• Divided Space-Time Attention
◼
•
• SOTA
•
◼
• Self-Attention
• 3D CNN
• Token
• Positional embedding

Dino2 - the Amazing Evolution of the VA Smalltalk Virtual Machine First Name: John Last Name: O'Keefe Type: Talk Video1: https://www.youtube.com/watch?v=Ii8Dwq1b6YI Video2: https://www.youtube.com/watch?v=30L7fWvtddU Over the last 18 months we have evolved the VA Smalltalk VM from a Smalltalk model-based 32-bit VM to a C-based 32/64-bit VM. During this talk I will tell the story of our journey along this evolutionary path, describe some of the innovative techniques and approaches we took to reach our goal, and demonstrate the running 64-bit VM. Bio: I have been developing software for over 45 years. I joined the original IBM Smalltalk prototype team in 1990 and was a founding member of the IBM VisualAge Smalltalk development team. I was Team Lead and Chief Architect of IBM VisualAge Smalltalk from 1997 to 2007. In February 2007, I joined Instantiations to lead the VA Smalltalk development team. I am currently the CTO and Principal Smalltalk Architect focusing on future product architecture and development. I live in Durham, NC and work in Raleigh, NC.

Spark Streaming with Cassandra

Jacek Lewandowski

Panoramic Video in Environmental Monitoring Software Development and Applica...

pycontw

第11回配信講義計算科学技術特論A（2021）

RCCSRENKEI

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Yahoo Developer Network

In the analysis of big data there are problematic queries that don’t scale because they require huge compute resources and time to generate exact results. Examples include count distinct, quantiles, most frequent items, joins, matrix computations, and graph analysis. If approximate results are acceptable, there is a class of sub-linear, stochastic streaming algorithms, called "sketches", that can produce results orders-of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of extracting results for these problem queries in real-time, sketches are the only known solution. For any analysis system that requires these problematic queries from big data, sketches are a required toolkit that should be tightly integrated into the system's analysis capabilities. This technology has helped Yahoo successfully reduce data processing times from days to hours, or minutes to seconds on a number of its internal platforms. This talk covers the current state of our Open Source DataSketches.github.io library, which includes adaptations and example code for Pig, Hive, Spark and Druid and gives architectural examples of use and a case study. Speakers: Jon Malkin is a scientist at Yahoo working to extend the DataSketches library. His previous roles have involved large scale data processing for sponsored search, display advertising, user counting, ad targeting, and cross-device user identity modeling. Alexander Saydakov is a senior software engineer at Yahoo working on the open source Data Sketches project. In his previous roles he has been involved in building large-scale back-end data processing systems and frameworks for data analytics and experimentation based on Torque, Hadoop, Pig, Hive and Druid. Alexander’s education background is in the field of applied mathematics.

Cocos2dを使ったゲーム作成の事例

Yuichi Higuchi

ZJPeng.3DSolderBallReconstructionZhejian Peng

[論文読み]Interpretable Coun.ng for Visual Ques.on Answering

hirono kawashima

Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently. At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you'll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.

Challenging Web-Scale Graph Analytics with Apache Spark

Databricks

Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently. At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you’ll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.

Will the computer world collapse in 2038?Joris Berthelot

Video Transformers.pptx

Sangmin Woo

xray at SciPy 2015

Stephan Hoyer

深層学習フレームワークにおけるIntel CPU/富岳向け最適化法

MITSUNARI Shigeo

論文紹介：Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...

Toru Tamaki

G1 collector and tuning and Cassandra

Chris Lohfink

"Mesh of Periodic Minimal Surfaces in CGAL."

Vissarion Fisikopoulos

Realtime Analytics with Apache Cassandra

Acunu

Playing in Tune: How We Refactored Cube to Terabyte ScaleMongoDB

Harmony intune finalMongoDB

Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...

Redis Labs

H 264 in cuda presentationashoknaik120

WebRTC Standards & Implementation Q&A - Legacy API Support Changes

Amir Zmora

Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...

Direct Dimensions, Inc.

Garbage First Garbage Collector: Where the Rubber Meets the Road!

Monica Beckwith

Scaling the #2ndhalf

Salo Shp

論文紹介：When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...

Toru Tamaki

論文紹介：Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation

Toru Tamaki

論文紹介：Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...

Toru Tamaki

論文紹介：ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Toru Tamaki

Similar to 論文紹介：Is Space-Time Attention All You Need for Video Understanding?

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng

Databricks

Challenging Web-Scale Graph Analytics with Apache Spark

Databricks

Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently. At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you’ll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.

Will the computer world collapse in 2038?Joris Berthelot

Video Transformers.pptx

Sangmin Woo

xray at SciPy 2015

Stephan Hoyer

深層学習フレームワークにおけるIntel CPU/富岳向け最適化法

MITSUNARI Shigeo

論文紹介：Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...

Toru Tamaki

G1 collector and tuning and Cassandra

Chris Lohfink

"Mesh of Periodic Minimal Surfaces in CGAL."

Vissarion Fisikopoulos

Realtime Analytics with Apache Cassandra

Acunu

Playing in Tune: How We Refactored Cube to Terabyte ScaleMongoDB

Harmony intune finalMongoDB

Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...

Redis Labs

H 264 in cuda presentationashoknaik120

WebRTC Standards & Implementation Q&A - Legacy API Support Changes

Amir Zmora

Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...

Direct Dimensions, Inc.

Garbage First Garbage Collector: Where the Rubber Meets the Road!

Monica Beckwith

Scaling the #2ndhalf

Salo Shp

Similar to 論文紹介：Is Space-Time Attention All You Need for Video Understanding? (18)

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng

Challenging Web-Scale Graph Analytics with Apache Spark

Will the computer world collapse in 2038?

Video Transformers.pptx

xray at SciPy 2015

深層学習フレームワークにおけるIntel CPU/富岳向け最適化法

論文紹介：Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Lear...

G1 collector and tuning and Cassandra

"Mesh of Periodic Minimal Surfaces in CGAL."

Realtime Analytics with Apache Cassandra

Playing in Tune: How We Refactored Cube to Terabyte Scale

Harmony intune final

Real-Time Spatiotemporal Data Utilization For Future Mobility Services: Atsus...

H 264 in cuda presentation

WebRTC Standards & Implementation Q&A - Legacy API Support Changes

Digifab Conf - Direct Dimensions - 3D Scanning for 3D Printing, Making Realit...

Garbage First Garbage Collector: Where the Rubber Meets the Road!

Scaling the #2ndhalf

Recently uploaded

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

PHP Frameworks: I want to break free (IPC Berlin 2024)

Ralf Eggert

In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development. This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Aggregage

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs

Alex Pruden

This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second). Paper: https://eprint.iacr.org/2023/1886

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

Assure Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...

SOFTTECHHUB

The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing. One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.

By Design, not by Accident - Agile Venture Bolzano 2024

Pierluigi Pugliese

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

Free Complete Python - A step towards Data Science

RinaMondal9

Quantum Computing: Current Landscape and the Future Role of APIs

Vlad Stirbu

Pushing the limits of ePRTC: 100ns holdover for 100 days

Adtran

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

A tale of scale & speed: How the US Navy is enabling software delivery from l...

sonjaschweigert1

Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved: - Reduction in onboarding time from 5 weeks to 1 day - Improved developer experience and productivity through actionable findings and reduction of false positives - Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO) Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production. We will cover: - How to remove silos in DevSecOps - How to build efficient development pipeline roles and component templates - How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence) - How to streamline operations with automated policy checks on container images

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

DevOps and Testing slides at DASA Connect

Kari Kakkonen

Monitoring Java Application Security with JDK Tools and JFR Events

Ana-Maria Mihalceanu

Climate Impact of Software Testing at Nordic Testing Days

Kari Kakkonen

My slides at Nordic Testing Days 6.6.2024 Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

Recently uploaded (20)

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

PHP Frameworks: I want to break free (IPC Berlin 2024)

Generative AI Deep Dive: Advancing from Proof of Concept to Production

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

Assure Contact Center Experiences for Your Customers With ThousandEyes

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...

By Design, not by Accident - Agile Venture Bolzano 2024

FIDO Alliance Osaka Seminar: Overview.pdf

Free Complete Python - A step towards Data Science

Quantum Computing: Current Landscape and the Future Role of APIs

Pushing the limits of ePRTC: 100ns holdover for 100 days

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

A tale of scale & speed: How the US Navy is enabling software delivery from l...

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

Leading Change strategies and insights for effective change management pdf 1.pdf

DevOps and Testing slides at DASA Connect

Monitoring Java Application Security with JDK Tools and JFR Events

Climate Impact of Software Testing at Nordic Testing Days

PCI PIN Basics Webinar from the Controlcase Team

論文紹介：Is Space-Time Attention All You Need for Video Understanding?

1. Is Space-Time Attention All You Need for Video Understanding? Gedas Bertasius, Heng Wang, Lorenzo Torresani, ICML2021 2023/5/11

2. ◼Transformer : TimeSformer • • Self-Attention ◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021] • Transformer • Embeding • • Transformer Encoder • Self-Attention • MLP • Head

3. ◼ViViT [Arnab+, ICCV2021] • Embedding 3D Conv ◼ (TimeSformer) • ViT 2D Conv Embedding: 3D Conv Embedding: 2D Conv Transformer Encoder Transformer Encoder . . . . . . . . . . .. . .. Attention Self-Attention

4. ◼TimeSformer • • 2D Conv • Time Attention, Space Attention • Attention • Embedding: 2D Conv . . . . . . . .. Transformer Encoder Time Attention Space Attention × 12 Time Attn, Space Attn Time Attn Space Attn . . . +

5. Self-Attention Architectures ◼ Self-Attention • Space Attention (S) • Attn • Joint Space-Time Attention (ST) • Attn • Divided Space-Time Attention (S+T) • Attn • Sparse Local Global Attention (L+G) • Attn • Axial Attention (T+W+H) • Attn

6. ◼ • Kinetics-400 (K400) [Kay+, arXiv2017] • Kinetics-600 (K600) [Carreira+, arXiv2018] • Something-Something-v2 (SSv2) [Goyal+, ICCV2017] • Diving-48 [Li+, ECCV2018] ◼ • 224 × 224 • 8 • 1 32 ◼ • TimeSformer • TimeSformer-HR ◼ • ImageNet-21k (I21K) • ImageNet-1k (I1K) ◼ • 15 • Optimizer SGD • Momentum 0.9 • Weight decay 0.0001

7. 1. Analysis of Self-Attention Schemes 2. Comparison to 3D CNNs 3. Varying the Number of Tokens 4. The Importance of Positional Embeddings 5. Comparison to the State-of-the-Art

8. 1. Analysis of Self-Attention Schemes ✓Self-Attention • Space Attention (S) • Joint Space-Time Attention (ST) • Divided Space-Time Attention (S+T) • Sparse Local Global Attention (L+G) • Axial Attention (T+W+H) ✓ST S+T • 224, 336, 448, 560 • 8, 32, 64, 96 ◼ • K400, SSv2 • I21K

9. ◼Self-Attention • Divided Space-Time • Space Time Attention ◼ST S+T • S+T (Divided)

10. 2. Comparison to 3D CNNs ✓3D CNN • • • • • • I21K, I1K ◼ • TimeSformer • I3D R50 [Wang+, CVPR2018] • SlowFast R50 [Feichtenhofer+, ICCV2019] ◼ • K400 ✓ • I21K I1K ◼ • TimeSformer • 8 224 224 • TimeSformer-HR • 16 448 448 • TimeSformer-L • 96 224 224 ◼ • K400, SSv2

11. ◼3D CNN • TimeSformer • TimeSformer • I21K ◼ • TimeSformer I21K

12. 3. Varying the Number of Tokens ✓ • 224 (default), 336, 448, 560 • 8 (default), 32, 64, 96 ◼ • 16 × 16 224 336 448 560 8 8 × 14 × 14 8 × 21 × 21 8 × 28 × 28 8 × 35 × 35 32 32 × 14 × 14 32 × 21 × 21 32 × 28 × 28 32 × 35 × 35 64 64 × 14 × 14 64 × 21 × 21 64 × 28 × 28 64 × 35 × 35 96 96 × 14 × 14 96 × 21 × 21 96 × 28 × 28 96 × 35 × 35

13. ◼ • • ◼ •

14. The Importance of Positional Embeddings ◼ • • • • ◼ • K400, SSv2 • I21K Embedding: 2D Conv . . . . . . . .. Transformer Encoder Time Attention Space Attention . . . +

15. ◼Space-Time • up

16. Comparison to the State-of-the-Art ✓SOTA • R(2+1)D [Tran+, arXiv2018] • bLVNet [Fan+, 2019] • TSM [Lin+, ICCV2019] • S3D-G [Xie+, ECCV2018] • Oct-I3D+NL [Chen+, ICCV2019] • D3D [Stroud+, WACV2020] • I3D+NL [Wang+, CVPR2018] • Ip-CSN-152 [Tran+, ICCV2019] • CorrNet [Wang+, CVPR2020] • LGD-3D-101 [Qiu+, CVPR2019] • SlowFast [Feichtenhofer+, ICCV2019] • X3D-XXL [Feichtenhofer+, CVPR2020] ◼ • 1. K400, K600 2. SSv2, Div48 • • I21K ◼ • Top1, top5, TFLOPs

17. K400 K600 SSv2 Div48

18. ◼Transformer : TimeSformer • • Self-Attention • Divided Space-Time Attention ◼ • • SOTA • ◼ • Self-Attention • 3D CNN • Token • Positional embedding

論文紹介：Is Space-Time Attention All You Need for Video Understanding?

Recommended

Recommended

More Related Content

Similar to 論文紹介：Is Space-Time Attention All You Need for Video Understanding?

Similar to 論文紹介：Is Space-Time Attention All You Need for Video Understanding? (18)

More from Toru Tamaki

More from Toru Tamaki (20)

Recently uploaded

Recently uploaded (20)

論文紹介：Is Space-Time Attention All You Need for Video Understanding?