論文紹介：Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

•

0 likes•22 views

AJ Piergiovanni, Weicheng Kuo, Anelia Angelova, "Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning" arXiv2022 https://arxiv.org/abs/2212.03229

Technology

Rethinking Video ViTs:
Sparse Video Tubes
for Joint Image and Video Learning
AJ Piergiovanni, Weicheng Kuo, Anelia Angelova
arXiv2022
2023/6/8

◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021]
•
•
ViT
Conv2D
Conv3D
...
ViViT [Arnab+, ICCV2021]

◼ViViT [Arnab+, ICCV2021]
• ViT
• 3D
•
◼
•
• 3D 2D
◼
•
... ...

◼ViT TubeViT
1.
• 2D
• 3D
•
•
2. 3D
•
3.
• Video
•
•

1.
◼2D
• (1, 𝐻, 𝑊) (𝑇𝑠, 𝐻, 𝑊) 2D Conv
• 𝑇𝑠
◼3D
• (𝑇, 𝐻, 𝑊) (𝑇𝑠, 𝐻𝑠, 𝑊
𝑠) (𝑡, 𝑥, 𝑦) 3D Conv
•
• 3D
◼
•
•
•

2.
◼
•
•
◼
•
◼
• (𝑡, 𝑥, 𝑦)
• sine, cosine
• 1 6
• 2D 3D
◼
•

2.
(𝑡, 𝑥, 𝑦)
Conv 3D
𝑧𝑖 ∈ 𝑅𝑑
1 2 3 d
4 5 6 …
7 8 9 10 11 12
+
𝑗 = 1 …
𝑗 = 2
cos(𝑦 ∗ 𝑤1)
sin(𝑦 ∗ 𝑤1)
cos(𝑥 ∗ 𝑤1)
sin(𝑥 ∗ 𝑤1)
cos(𝑡 ∗ 𝑤1)
sin(𝑡 ∗ 𝑤1)
• 𝜏 = 10000
• 𝑗 = 1 ~ 𝑑/6
• (𝑡, 𝑥, 𝑦)
𝑤𝑗 = 1/𝜏𝑗
𝑝𝑖,𝑡 = sin 𝑡 ∗ 𝑤𝑗 , cos(𝑡 ∗ 𝑤𝑗)
𝑝𝑖,𝑥 = sin 𝑥 ∗ 𝑤𝑗 , cos(𝑥 ∗ 𝑤𝑗)
𝑝𝑖,𝑦 = sin 𝑦 ∗ 𝑤𝑗 , cos(𝑦 ∗ 𝑤𝑗)
𝑥
𝑦 𝑡

SOTA
◼
•
•
•
• ViT-B, L, H
◼
•
• ImageNet-1k [Deng+, CVPR2009]
•
• Kinetics400, 600, 700
(K400,600,700) [Kay+, arXiv2017,
Carreira+, arXiv2018, Carreira+, arXiv2019]
• Something Something V2 (SSv2)
[Goyal+, ICCV2017]
◼
• 256
•
• K400, 600, 700 64
• SSv2 32
• 300000
• Adam
•
• K400, 600, 700 5e-5
• SSv2 2e-5
◼
• val top-1, top-5 accuracy

◼Video Tube
•
◼ViT-B, L, H
Tube size stride offset
1 8 8 8 16 32 32 (0, 0, 0)
2 16 4 4 6 32 32 (4, 8, 8)
3 4 12 12 16 32 32 (0, 16, 16)
4 1 16 16 32 16 16 (0, 0, 0)

◼
• Kinetics600
• ImageNet-1k Kinetics600
• ImageNet-1k Kinetics600
• 2D ImageNet-1k
Kinetics600
• ImageNet-1k 3D
Kinetics600
◼
• ViT-L
◼
•
• head
• head+ 4
• head+ 8
◼
• ViT-H
•
• ImageNet pretrained weight
◼
• Kinetics600

◼
• ImageNet-1k Kinetics600
◼
•
• Head+ 8
48%

◼ViT TubeViT
• 3D
•
•
◼
• Kinetics400, 600, 700, SSv2 SOTA
• Token

High Performance Computing (HPC) applications are mapped to a cluster of multi-core processors communicating using high speed interconnects. More computational power is harnessed with the addition of hardware accelerators such as Graphics Processing Unit (GPU) cards and Field Programmable Gate Arrays (FPGAs). Particle Image Velocimetry (PIV) is an embarrassingly parallel application that can benefit from acceleration using hybrid architectures. The PIV application is mapped to a Nvidia GPU system, resulting in 3x speedup over a dual quad-core Intel processor implementation. The design methodology used to implement the PIV application on a specialized FPGA platform under development is described in brief and the resulting performance benefit is analyzed.

論文紹介：When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...

Toru Tamaki

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

Bits & Pixels using AI for Good.........

Alison B. Lowndes

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

PHP Frameworks: I want to break free (IPC Berlin 2024)

Ralf Eggert

In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development. This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Product School

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

Search and Society: Reimagining Information Access for Radical Futures

Bhaskar Mitra

The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.

The Future of Platform Engineering

Jemma Hussein Allen

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Product School

DevOps and Testing slides at DASA Connect

Kari Kakkonen

When stars align: studies in data quality, knowledge graphs, and machine lear...

Elena Simperl

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

Accelerate your Kubernetes clusters with Varnish Caching

Thijs Feryn

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

Key Trends Shaping the Future of Infrastructure.pdf

Cheryl Hung

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Bits & Pixels using AI for Good.........

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

PHP Frameworks: I want to break free (IPC Berlin 2024)

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Epistemic Interaction - tuning interfaces to provide information for AI support

FIDO Alliance Osaka Seminar: Overview.pdf

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Search and Society: Reimagining Information Access for Radical Futures

The Future of Platform Engineering

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

DevOps and Testing slides at DASA Connect

When stars align: studies in data quality, knowledge graphs, and machine lear...

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Accelerate your Kubernetes clusters with Varnish Caching

GraphRAG is All You need? LLM & Knowledge Graph

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

Neuro-symbolic is not enough, we need neuro-*semantic*

Key Trends Shaping the Future of Infrastructure.pdf

論文紹介：Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

1. Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning AJ Piergiovanni, Weicheng Kuo, Anelia Angelova arXiv2022 2023/6/8

2. ◼Vision Transformer (ViT) [Dosovitskiy+, ICLR 2021] • • ViT Conv2D Conv3D ... ViViT [Arnab+, ICCV2021]

3. ◼ViT TubeViT • 2D • 2D 3D ◼3D • 3D

4. ◼ViViT [Arnab+, ICCV2021] • ViT • 3D • ◼ • • 3D 2D ◼ • ... ...

5. ◼ViT TubeViT 1. • 2D • 3D • • 2. 3D • 3. • Video • •

6. 1. ◼2D • (1, 𝐻, 𝑊) (𝑇𝑠, 𝐻, 𝑊) 2D Conv • 𝑇𝑠 ◼3D • (𝑇, 𝐻, 𝑊) (𝑇𝑠, 𝐻𝑠, 𝑊 𝑠) (𝑡, 𝑥, 𝑦) 3D Conv • • 3D ◼ • • •

7. 2. ◼ • • ◼ • ◼ • (𝑡, 𝑥, 𝑦) • sine, cosine • 1 6 • 2D 3D ◼ •

8. 2. (𝑡, 𝑥, 𝑦) Conv 3D 𝑧𝑖 ∈ 𝑅𝑑 1 2 3 d 4 5 6 … 7 8 9 10 11 12 + 𝑗 = 1 … 𝑗 = 2 cos(𝑦 ∗ 𝑤1) sin(𝑦 ∗ 𝑤1) cos(𝑥 ∗ 𝑤1) sin(𝑥 ∗ 𝑤1) cos(𝑡 ∗ 𝑤1) sin(𝑡 ∗ 𝑤1) • 𝜏 = 10000 • 𝑗 = 1 ~ 𝑑/6 • (𝑡, 𝑥, 𝑦) 𝑤𝑗 = 1/𝜏𝑗 𝑝𝑖,𝑡 = sin 𝑡 ∗ 𝑤𝑗 , cos(𝑡 ∗ 𝑤𝑗) 𝑝𝑖,𝑥 = sin 𝑥 ∗ 𝑤𝑗 , cos(𝑥 ∗ 𝑤𝑗) 𝑝𝑖,𝑦 = sin 𝑦 ∗ 𝑤𝑗 , cos(𝑦 ∗ 𝑤𝑗) 𝑥 𝑦 𝑡

9. 3.

10. 3.

11. 3.

12. SOTA ◼ • • • • ViT-B, L, H ◼ • • ImageNet-1k [Deng+, CVPR2009] • • Kinetics400, 600, 700 (K400,600,700) [Kay+, arXiv2017, Carreira+, arXiv2018, Carreira+, arXiv2019] • Something Something V2 (SSv2) [Goyal+, ICCV2017] ◼ • 256 • • K400, 600, 700 64 • SSv2 32 • 300000 • Adam • • K400, 600, 700 5e-5 • SSv2 2e-5 ◼ • val top-1, top-5 accuracy

13. ◼Video Tube • ◼ViT-B, L, H Tube size stride offset 1 8 8 8 16 32 32 (0, 0, 0) 2 16 4 4 6 32 32 (4, 8, 8) 3 4 12 12 16 32 32 (0, 16, 16) 4 1 16 16 32 16 16 (0, 0, 0)

14. ◼ImageNet-1k Kinetics400 • SOTA

15. K600 K700 SSv2

16. ◼ • Kinetics600 • ImageNet-1k Kinetics600 • ImageNet-1k Kinetics600 • 2D ImageNet-1k Kinetics600 • ImageNet-1k 3D Kinetics600 ◼ • ViT-L ◼ • • head • head+ 4 • head+ 8 ◼ • ViT-H • • ImageNet pretrained weight ◼ • Kinetics600

17. ◼ • ImageNet-1k Kinetics600 ◼ • • Head+ 8 48%

18. ◼ViT TubeViT • 3D • • ◼ • Kinetics400, 600, 700, SSv2 SOTA • Token

論文紹介：Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

Recommended

Recommended

More Related Content

More from Toru Tamaki

More from Toru Tamaki (20)

Recently uploaded

Recently uploaded (20)

論文紹介：Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning