Video Transformers.pptx

2022-04-21
Sangmin Woo
Computational Intelligence Lab.
School of Electrical Engineering
Korea Advanced Institute of Science and Technology (KAIST)
Video Transformers

2
Video Transformers
VTN: Video Transformer Network
Spatial Backbones (ViT/ResNet/DeiT) + Temporal Transformer
(Longformer)
Neimark, Daniel, et al. "Video transformer network." arXiv 21.02

3
Video Transformers
TimeSformer
Axial (T+W+H) < Local-Global < Space < Joint Space-Time < Space +
Time
Bertasius, Gedas, et al. "Is Space-Time Attention All You Need for Video Understanding?." arXiv 21.02.

4
Video Transformers
STAM: Space Time Attention Model
Spatial Encoder (ViT-B: 12-layer) + Temporal Transformer (6-layer, 8-
head)
Sharir, Gilad, et al. “An Image is Worth 16x16 Words, What is a Video Worth?” arxiv 21.03

5
Video Transformers
ViViT: Video Vision Transformer
1. Joint Space-Time (Best on Kinetics 400)
2. Factorized Encoder (Best on Epic Kitchens; Space -> Time)
3. Factorized Self-Attention (Space-Time -> Space-Time)
4. Factorized Dot-Product (Half Head: Space, Half Head: Time)
Arnab, Anurag, et al. "Vivit: A video vision transformer." arXiv 21.03.

6
Video Transformers
VidTr: Video Transformer
Space < Joint < Time + Space ≈ Space + Time
Li, Xinyu, et al. "VidTr: Video Transformer Without Convolutions." arXiv 21.04

7
Takeaways from Recent Transformers
Efficient Transformers
 Deformable DETR [1]
 DeiT: Data-efficient image Transformers; Distillation Token [3]
Video Transformers
 Spatial Encoder* + Temporal Transformer [4, 6]
* ResNet / ViT [2] / DeiT [3]
 Axial < Local-Global< Space < Joint Space-Time < Space + Time
[5]
 Joint Space-Time ≈ Space + Time; Tubelet Embedding [7]
 Space < Joint < Time + Space ≈ Space + Time [8]
 Video Swin Transformer; Shifted Window [9]
[1] Zhu, Xizhou, et al. "Deformable DETR: Deformable transformers for end-to-end object detection." ICLR 2021
[2] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021
[3] Touvron, Hugo, et al. "Training data-efficient image transformers & distillation through attention." ICML 2021
[4] Neimark, Daniel, et al. "Video transformer network." arXiv 21.02
[5] Bertasius, Gedas, et al. "Is Space-Time Attention All You Need for Video Understanding?." arXiv 21.02.
[6] Sharir, Gilad, et al. “An Image is Worth 16x16 Words, What is a Video Worth?” arxiv 21.03
[7] Arnab, Anurag, et al. "Vivit: A video vision transformer." arXiv 21.03.
[8] Li, Xinyu, et al. "VidTr: Video Transformer Without Convolutions." arXiv 21.04
[9] Liu, Ze, et al. "Video swin transformer." arXiv 21.06

8
Common Practice
During inference, due to the addition of a temporal dimension, 3D networks are
limited by memory and runtime to clips of a small spatial scale and a low number
of frames.
In I3D, the authors use the whole video during inference, averaging predictions
temporally.
More recent studies that achieved state-of-the-art results processed numerous,
but relatively short, clips during inference.
In Non-local, inference is done by sampling 10 clips evenly from the full-length
video and average the softmax scores to achieve the final prediction.
SlowFast follows the same practice and introduces the term “view” – a temporal
clip with a spatial crop. SlowFast uses 10 temporal clips with 3 spatial crops at
inference time; thus, 30 different views are averaged for the final prediction.
X3D follows the same practice, but in addition, it uses larger spatial scales to
achieve its best results on 30 different views.
This common practice of multi-view inference is somewhat counterintuitive,
especially when handling long videos. A more intuitive way is to “look” at the entire
video context before deciding on the action, rather than viewing only small
portions of it.

9
Design Choices
TimeSformer [1]
 16, 32 frames / clip (2.56 or 5.12 seconds)
 Spatial dimension: 224x224
 Multi-view inference: sample 10 clips uniformly & crop 3 views per clip →
avg. softmax scores of 30 predictions (i.e., 10x3 views)
 Full video inference: read all frames & sub-sample to 250 frames uniformly
VTN (Video Transformer Network) [2]
 8 frames / clip (frame sample rate: 1/16)
 Patch size: 16x16
 Inference: sample center clip & crop 3 views → avg. softmax scores of 3
predictions
STAM (Space Time Attention Model) [3]
 16 frames/video vs. 32 vs. 64: 16 (79.3) < 32 (79.9) < 64 (80.5)
[3] Sharir, Gilad, et al. “An Image is Worth 16x16 Words, What is a Video Worth?” arxiv 21.03

10
Design Choices
ViViT [1]
 32 frames / clip (stride: 2; sampling 8 frames; tubelet length=4)
 Crop size ∈ [224, 320]: 224 (80.3/58.9) vs. 288 (80.7/147.6) vs. 320
(81.0/238.8)
 32 frame-tubelet length 4 vs. 64-8 vs. 128-16: 32-4 > 64-8 > 128-16
 Inference: 4 views
VidTr [2]
 Cubic patch (4x16x16) vs. Square patch (1x16x16): Cubic (73.1) < Square
(75.5)
 Patch size 16x16 vs. 32x32: 16 (77.7) > 32 (71.2)
 Inference: 10x3 views
Video Swin Transformer [3]
 32 frames / clip (temporal stride: 2)
 3D token size: 16x56x56
 Inference: 4x3 views
[1] Arnab, Anurag, et al. "Vivit: A video vision transformer." arXiv 21.03.
[2] Li, Xinyu, et al. "VidTr: Video Transformer Without Convolutions." arXiv 21.04
[3] Liu, Ze, et al. "Video swin transformer." arXiv 21.06

11
How to Design Video Transformer?
Factorized Space-Time Transformers (Space first -> Time next)
- Most of the Video Transformers follow this approach
- 2-step: Spatial Transformer + Temporal Transformer
- Spatial Backbone: ResNet101? ViT?
- Input: Image patches -> Spatial Trans. -> Output Feat. -> Temporal Trans.
- Number of frames: 16 , 32
- Computationally tractable (P: #patches, F: #frames): P2
+ F2
Joint Space-Time Transformer
- 1-step: Single Transformer (Joint Space-Time Attention)
- Input: Flattened video patches
- Number of frames: ?
- Computationally expensive: P2
× F2
DETR-style (C3D + Transformer)
- C3D backbone -> Output Feature -> Flatten -> Transformer

Thank You!
Sangmin Woo
sangminwoo.github.i
o
smwoo95@kaist.ac.k
Q&A

13
Appendix: Vision Transformers
Sorted by Date
https://github.com/DirtyHarryLYL/Transformer-in-Vision
Sorted by Task
https://github.com/Yangzhangcst/Transformer-in-
Computer-Vision
Sorted by Venue
https://github.com/dk-liang/Awesome-Visual-Transformer

Video Transformers.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Video Transformers.pptx

Similar to Video Transformers.pptx (20)

More from Sangmin Woo

More from Sangmin Woo (13)

Recently uploaded

Recently uploaded (20)

Video Transformers.pptx

Editor's Notes