SlideShare a Scribd company logo
2022-04-21
Sangmin Woo
Computational Intelligence Lab.
School of Electrical Engineering
Korea Advanced Institute of Science and Technology (KAIST)
Video Transformers
2
Video Transformers
VTN: Video Transformer Network
Spatial Backbones (ViT/ResNet/DeiT) + Temporal Transformer
(Longformer)
Neimark, Daniel, et al. "Video transformer network." arXiv 21.02
3
Video Transformers
TimeSformer
Axial (T+W+H) < Local-Global < Space < Joint Space-Time < Space +
Time
Bertasius, Gedas, et al. "Is Space-Time Attention All You Need for Video Understanding?." arXiv 21.02.
4
Video Transformers
STAM: Space Time Attention Model
Spatial Encoder (ViT-B: 12-layer) + Temporal Transformer (6-layer, 8-
head)
Sharir, Gilad, et al. “An Image is Worth 16x16 Words, What is a Video Worth?” arxiv 21.03
5
Video Transformers
ViViT: Video Vision Transformer
1. Joint Space-Time (Best on Kinetics 400)
2. Factorized Encoder (Best on Epic Kitchens; Space -> Time)
3. Factorized Self-Attention (Space-Time -> Space-Time)
4. Factorized Dot-Product (Half Head: Space, Half Head: Time)
Arnab, Anurag, et al. "Vivit: A video vision transformer." arXiv 21.03.
6
Video Transformers
VidTr: Video Transformer
Space < Joint < Time + Space ≈ Space + Time
Li, Xinyu, et al. "VidTr: Video Transformer Without Convolutions." arXiv 21.04
7
Takeaways from Recent Transformers
Efficient Transformers
 Deformable DETR [1]
 DeiT: Data-efficient image Transformers; Distillation Token [3]
Video Transformers
 Spatial Encoder* + Temporal Transformer [4, 6]
* ResNet / ViT [2] / DeiT [3]
 Axial < Local-Global< Space < Joint Space-Time < Space + Time
[5]
 Joint Space-Time ≈ Space + Time; Tubelet Embedding [7]
 Space < Joint < Time + Space ≈ Space + Time [8]
 Video Swin Transformer; Shifted Window [9]
[1] Zhu, Xizhou, et al. "Deformable DETR: Deformable transformers for end-to-end object detection." ICLR 2021
[2] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021
[3] Touvron, Hugo, et al. "Training data-efficient image transformers & distillation through attention." ICML 2021
[4] Neimark, Daniel, et al. "Video transformer network." arXiv 21.02
[5] Bertasius, Gedas, et al. "Is Space-Time Attention All You Need for Video Understanding?." arXiv 21.02.
[6] Sharir, Gilad, et al. “An Image is Worth 16x16 Words, What is a Video Worth?” arxiv 21.03
[7] Arnab, Anurag, et al. "Vivit: A video vision transformer." arXiv 21.03.
[8] Li, Xinyu, et al. "VidTr: Video Transformer Without Convolutions." arXiv 21.04
[9] Liu, Ze, et al. "Video swin transformer." arXiv 21.06
8
Common Practice
During inference, due to the addition of a temporal dimension, 3D networks are
limited by memory and runtime to clips of a small spatial scale and a low number
of frames.
In I3D, the authors use the whole video during inference, averaging predictions
temporally.
More recent studies that achieved state-of-the-art results processed numerous,
but relatively short, clips during inference.
In Non-local, inference is done by sampling 10 clips evenly from the full-length
video and average the softmax scores to achieve the final prediction.
SlowFast follows the same practice and introduces the term “view” – a temporal
clip with a spatial crop. SlowFast uses 10 temporal clips with 3 spatial crops at
inference time; thus, 30 different views are averaged for the final prediction.
X3D follows the same practice, but in addition, it uses larger spatial scales to
achieve its best results on 30 different views.
This common practice of multi-view inference is somewhat counterintuitive,
especially when handling long videos. A more intuitive way is to “look” at the entire
video context before deciding on the action, rather than viewing only small
portions of it.
[1] Bertasius, Gedas, et al. "Is Space-Time Attention All You Need for Video Understanding?." arXiv 21.02.
[2] Neimark, Daniel, et al. "Video transformer network." arXiv 21.02
9
Design Choices
TimeSformer [1]
 16, 32 frames / clip (2.56 or 5.12 seconds)
 Spatial dimension: 224x224
 Multi-view inference: sample 10 clips uniformly & crop 3 views per clip →
avg. softmax scores of 30 predictions (i.e., 10x3 views)
 Full video inference: read all frames & sub-sample to 250 frames uniformly
VTN (Video Transformer Network) [2]
 8 frames / clip (frame sample rate: 1/16)
 Spatial dimension: 224x224
 Patch size: 16x16
 Inference: sample center clip & crop 3 views → avg. softmax scores of 3
predictions
STAM (Space Time Attention Model) [3]
 16 frames/video vs. 32 vs. 64: 16 (79.3) < 32 (79.9) < 64 (80.5)
 Spatial dimension: 224x224
[1] Bertasius, Gedas, et al. "Is Space-Time Attention All You Need for Video Understanding?." arXiv 21.02.
[2] Neimark, Daniel, et al. "Video transformer network." arXiv 21.02
[3] Sharir, Gilad, et al. “An Image is Worth 16x16 Words, What is a Video Worth?” arxiv 21.03
10
Design Choices
ViViT [1]
 32 frames / clip (stride: 2; sampling 8 frames; tubelet length=4)
 Crop size ∈ [224, 320]: 224 (80.3/58.9) vs. 288 (80.7/147.6) vs. 320
(81.0/238.8)
 32 frame-tubelet length 4 vs. 64-8 vs. 128-16: 32-4 > 64-8 > 128-16
 Inference: 4 views
VidTr [2]
 Cubic patch (4x16x16) vs. Square patch (1x16x16): Cubic (73.1) < Square
(75.5)
 Patch size 16x16 vs. 32x32: 16 (77.7) > 32 (71.2)
 Inference: 10x3 views
Video Swin Transformer [3]
 32 frames / clip (temporal stride: 2)
 Spatial dimension: 224x224
 3D token size: 16x56x56
 Inference: 4x3 views
[1] Arnab, Anurag, et al. "Vivit: A video vision transformer." arXiv 21.03.
[2] Li, Xinyu, et al. "VidTr: Video Transformer Without Convolutions." arXiv 21.04
[3] Liu, Ze, et al. "Video swin transformer." arXiv 21.06
11
How to Design Video Transformer?
Factorized Space-Time Transformers (Space first -> Time next)
- Most of the Video Transformers follow this approach
- 2-step: Spatial Transformer + Temporal Transformer
- Spatial Backbone: ResNet101? ViT?
- Input: Image patches -> Spatial Trans. -> Output Feat. -> Temporal Trans.
- Number of frames: 16 , 32
- Computationally tractable (P: #patches, F: #frames): P2
+ F2
Joint Space-Time Transformer
- 1-step: Single Transformer (Joint Space-Time Attention)
- Input: Flattened video patches
- Number of frames: ?
- Computationally expensive: P2
× F2
DETR-style (C3D + Transformer)
- C3D backbone -> Output Feature -> Flatten -> Transformer
Thank You!
Sangmin Woo
sangminwoo.github.i
o
smwoo95@kaist.ac.k
Q&A
13
Appendix: Vision Transformers
Sorted by Date
https://github.com/DirtyHarryLYL/Transformer-in-Vision
Sorted by Task
https://github.com/Yangzhangcst/Transformer-in-
Computer-Vision
Sorted by Venue
https://github.com/dk-liang/Awesome-Visual-Transformer

More Related Content

What's hot

[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
Taegyun Jeon
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
Sungchul Kim
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
Mark Chang
 
Canny edge detection
Canny edge detectionCanny edge detection
Canny edge detection
ahmedkhaledfayez
 
Single Image Super Resolution Overview
Single Image Super Resolution OverviewSingle Image Super Resolution Overview
Single Image Super Resolution Overview
LEE HOSEONG
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Sungchul Kim
 
Exploring Methods to Improve Edge Detection with Canny Algorithm
Exploring Methods to Improve Edge Detection with Canny AlgorithmExploring Methods to Improve Edge Detection with Canny Algorithm
Exploring Methods to Improve Edge Detection with Canny Algorithm
Prasad Thakur
 
Jpeg dct
Jpeg dctJpeg dct
Jpeg dct
darshan2518
 
Super Resolution
Super ResolutionSuper Resolution
Super Resolution
alokahuti
 
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044
Jinwon Lee
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
changedaeoh
 
Swin transformer
Swin transformerSwin transformer
Swin transformer
JAEMINJEONG5
 
Image denoising
Image denoisingImage denoising
Image denoising
Abdur Rehman
 
Cnn method
Cnn methodCnn method
Cnn method
AmirSajedi1
 
Diffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesisDiffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesis
BeerenSahu
 
You Only Look Once: Unified, Real-Time Object Detection
You Only Look Once: Unified, Real-Time Object DetectionYou Only Look Once: Unified, Real-Time Object Detection
You Only Look Once: Unified, Real-Time Object Detection
DADAJONJURAKUZIEV
 
Basic Generative Adversarial Networks
Basic Generative Adversarial NetworksBasic Generative Adversarial Networks
Basic Generative Adversarial Networks
Dong Heon Cho
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)
Prakhar Rastogi
 
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation..."Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
Edge AI and Vision Alliance
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
Brodmann17
 

What's hot (20)

[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Canny edge detection
Canny edge detectionCanny edge detection
Canny edge detection
 
Single Image Super Resolution Overview
Single Image Super Resolution OverviewSingle Image Super Resolution Overview
Single Image Super Resolution Overview
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
 
Exploring Methods to Improve Edge Detection with Canny Algorithm
Exploring Methods to Improve Edge Detection with Canny AlgorithmExploring Methods to Improve Edge Detection with Canny Algorithm
Exploring Methods to Improve Edge Detection with Canny Algorithm
 
Jpeg dct
Jpeg dctJpeg dct
Jpeg dct
 
Super Resolution
Super ResolutionSuper Resolution
Super Resolution
 
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
Swin transformer
Swin transformerSwin transformer
Swin transformer
 
Image denoising
Image denoisingImage denoising
Image denoising
 
Cnn method
Cnn methodCnn method
Cnn method
 
Diffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesisDiffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesis
 
You Only Look Once: Unified, Real-Time Object Detection
You Only Look Once: Unified, Real-Time Object DetectionYou Only Look Once: Unified, Real-Time Object Detection
You Only Look Once: Unified, Real-Time Object Detection
 
Basic Generative Adversarial Networks
Basic Generative Adversarial NetworksBasic Generative Adversarial Networks
Basic Generative Adversarial Networks
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)
 
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation..."Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
 

Similar to Video Transformers.pptx

Dual Band Watermarking using 2-D DWT and 2-Level SVD for Robust Watermarking ...
Dual Band Watermarking using 2-D DWT and 2-Level SVD for Robust Watermarking ...Dual Band Watermarking using 2-D DWT and 2-Level SVD for Robust Watermarking ...
Dual Band Watermarking using 2-D DWT and 2-Level SVD for Robust Watermarking ...
International Journal of Science and Research (IJSR)
 
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
IJCSIS Research Publications
 
A Video Watermarking Scheme to Hinder Camcorder Piracy
A Video Watermarking Scheme to Hinder Camcorder PiracyA Video Watermarking Scheme to Hinder Camcorder Piracy
A Video Watermarking Scheme to Hinder Camcorder Piracy
IOSR Journals
 
Sem vaibhav belkhude
Sem vaibhav belkhudeSem vaibhav belkhude
Sem vaibhav belkhude
Vaibhav Belkhude
 
538 207-219
538 207-219538 207-219
538 207-219
idescitation
 
Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality
Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual RealityFixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality
Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality
Wen-Chih Lo
 
Paper discussion:Video-to-Video Synthesis (NIPS 2018)
Paper discussion:Video-to-Video Synthesis (NIPS 2018)Paper discussion:Video-to-Video Synthesis (NIPS 2018)
Paper discussion:Video-to-Video Synthesis (NIPS 2018)
Motaz Sabri
 
Secured Video Watermarking Based On DWT
Secured Video Watermarking Based On DWTSecured Video Watermarking Based On DWT
Secured Video Watermarking Based On DWT
Editor IJMTER
 
Paper id 36201508
Paper id 36201508Paper id 36201508
Paper id 36201508
IJRAT
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...
Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...
Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...
Vignesh V Menon
 
Machine Learning approaches at video compression
Machine Learning approaches at video compression Machine Learning approaches at video compression
Machine Learning approaches at video compression
Roberto Iacoviello
 
Hybrid Approach for Robust Digital Video Watermarking
Hybrid Approach for Robust Digital Video WatermarkingHybrid Approach for Robust Digital Video Watermarking
Hybrid Approach for Robust Digital Video Watermarking
IJSRD
 
"Demystifying Deep Neural Networks," a Presentation from BDTI
"Demystifying Deep Neural Networks," a Presentation from BDTI"Demystifying Deep Neural Networks," a Presentation from BDTI
"Demystifying Deep Neural Networks," a Presentation from BDTI
Edge AI and Vision Alliance
 
A Hybrid DWT-SVD Method for Digital Video Watermarking Using Random Frame Sel...
A Hybrid DWT-SVD Method for Digital Video Watermarking Using Random Frame Sel...A Hybrid DWT-SVD Method for Digital Video Watermarking Using Random Frame Sel...
A Hybrid DWT-SVD Method for Digital Video Watermarking Using Random Frame Sel...
researchinventy
 
C0161018
C0161018C0161018
C0161018
IOSR Journals
 
C0161018
C0161018C0161018
C0161018
IOSR Journals
 
Video Compression Algorithm Based on Frame Difference Approaches
Video Compression Algorithm Based on Frame Difference Approaches Video Compression Algorithm Based on Frame Difference Approaches
Video Compression Algorithm Based on Frame Difference Approaches
ijsc
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
Edge AI and Vision Alliance
 

Similar to Video Transformers.pptx (20)

Dual Band Watermarking using 2-D DWT and 2-Level SVD for Robust Watermarking ...
Dual Band Watermarking using 2-D DWT and 2-Level SVD for Robust Watermarking ...Dual Band Watermarking using 2-D DWT and 2-Level SVD for Robust Watermarking ...
Dual Band Watermarking using 2-D DWT and 2-Level SVD for Robust Watermarking ...
 
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
 
A Video Watermarking Scheme to Hinder Camcorder Piracy
A Video Watermarking Scheme to Hinder Camcorder PiracyA Video Watermarking Scheme to Hinder Camcorder Piracy
A Video Watermarking Scheme to Hinder Camcorder Piracy
 
Sem vaibhav belkhude
Sem vaibhav belkhudeSem vaibhav belkhude
Sem vaibhav belkhude
 
538 207-219
538 207-219538 207-219
538 207-219
 
Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality
Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual RealityFixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality
Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality
 
Paper discussion:Video-to-Video Synthesis (NIPS 2018)
Paper discussion:Video-to-Video Synthesis (NIPS 2018)Paper discussion:Video-to-Video Synthesis (NIPS 2018)
Paper discussion:Video-to-Video Synthesis (NIPS 2018)
 
Secured Video Watermarking Based On DWT
Secured Video Watermarking Based On DWTSecured Video Watermarking Based On DWT
Secured Video Watermarking Based On DWT
 
Paper id 36201508
Paper id 36201508Paper id 36201508
Paper id 36201508
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...
Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...
Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...
 
Machine Learning approaches at video compression
Machine Learning approaches at video compression Machine Learning approaches at video compression
Machine Learning approaches at video compression
 
Hybrid Approach for Robust Digital Video Watermarking
Hybrid Approach for Robust Digital Video WatermarkingHybrid Approach for Robust Digital Video Watermarking
Hybrid Approach for Robust Digital Video Watermarking
 
"Demystifying Deep Neural Networks," a Presentation from BDTI
"Demystifying Deep Neural Networks," a Presentation from BDTI"Demystifying Deep Neural Networks," a Presentation from BDTI
"Demystifying Deep Neural Networks," a Presentation from BDTI
 
A Hybrid DWT-SVD Method for Digital Video Watermarking Using Random Frame Sel...
A Hybrid DWT-SVD Method for Digital Video Watermarking Using Random Frame Sel...A Hybrid DWT-SVD Method for Digital Video Watermarking Using Random Frame Sel...
A Hybrid DWT-SVD Method for Digital Video Watermarking Using Random Frame Sel...
 
C0161018
C0161018C0161018
C0161018
 
C0161018
C0161018C0161018
C0161018
 
Video Compression Algorithm Based on Frame Difference Approaches
Video Compression Algorithm Based on Frame Difference Approaches Video Compression Algorithm Based on Frame Difference Approaches
Video Compression Algorithm Based on Frame Difference Approaches
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 

More from Sangmin Woo

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptx
Sangmin Woo
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
Sangmin Woo
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
Sangmin Woo
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptx
Sangmin Woo
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
Sangmin Woo
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
Sangmin Woo
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
Sangmin Woo
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Sangmin Woo
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient Transformers
Sangmin Woo
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Sangmin Woo
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
Sangmin Woo
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
Sangmin Woo
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
Sangmin Woo
 

More from Sangmin Woo (13)

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptx
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptx
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient Transformers
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 

Recently uploaded

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 

Recently uploaded (20)

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 

Video Transformers.pptx

  • 1. 2022-04-21 Sangmin Woo Computational Intelligence Lab. School of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST) Video Transformers
  • 2. 2 Video Transformers VTN: Video Transformer Network Spatial Backbones (ViT/ResNet/DeiT) + Temporal Transformer (Longformer) Neimark, Daniel, et al. "Video transformer network." arXiv 21.02
  • 3. 3 Video Transformers TimeSformer Axial (T+W+H) < Local-Global < Space < Joint Space-Time < Space + Time Bertasius, Gedas, et al. "Is Space-Time Attention All You Need for Video Understanding?." arXiv 21.02.
  • 4. 4 Video Transformers STAM: Space Time Attention Model Spatial Encoder (ViT-B: 12-layer) + Temporal Transformer (6-layer, 8- head) Sharir, Gilad, et al. “An Image is Worth 16x16 Words, What is a Video Worth?” arxiv 21.03
  • 5. 5 Video Transformers ViViT: Video Vision Transformer 1. Joint Space-Time (Best on Kinetics 400) 2. Factorized Encoder (Best on Epic Kitchens; Space -> Time) 3. Factorized Self-Attention (Space-Time -> Space-Time) 4. Factorized Dot-Product (Half Head: Space, Half Head: Time) Arnab, Anurag, et al. "Vivit: A video vision transformer." arXiv 21.03.
  • 6. 6 Video Transformers VidTr: Video Transformer Space < Joint < Time + Space ≈ Space + Time Li, Xinyu, et al. "VidTr: Video Transformer Without Convolutions." arXiv 21.04
  • 7. 7 Takeaways from Recent Transformers Efficient Transformers  Deformable DETR [1]  DeiT: Data-efficient image Transformers; Distillation Token [3] Video Transformers  Spatial Encoder* + Temporal Transformer [4, 6] * ResNet / ViT [2] / DeiT [3]  Axial < Local-Global< Space < Joint Space-Time < Space + Time [5]  Joint Space-Time ≈ Space + Time; Tubelet Embedding [7]  Space < Joint < Time + Space ≈ Space + Time [8]  Video Swin Transformer; Shifted Window [9] [1] Zhu, Xizhou, et al. "Deformable DETR: Deformable transformers for end-to-end object detection." ICLR 2021 [2] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021 [3] Touvron, Hugo, et al. "Training data-efficient image transformers & distillation through attention." ICML 2021 [4] Neimark, Daniel, et al. "Video transformer network." arXiv 21.02 [5] Bertasius, Gedas, et al. "Is Space-Time Attention All You Need for Video Understanding?." arXiv 21.02. [6] Sharir, Gilad, et al. “An Image is Worth 16x16 Words, What is a Video Worth?” arxiv 21.03 [7] Arnab, Anurag, et al. "Vivit: A video vision transformer." arXiv 21.03. [8] Li, Xinyu, et al. "VidTr: Video Transformer Without Convolutions." arXiv 21.04 [9] Liu, Ze, et al. "Video swin transformer." arXiv 21.06
  • 8. 8 Common Practice During inference, due to the addition of a temporal dimension, 3D networks are limited by memory and runtime to clips of a small spatial scale and a low number of frames. In I3D, the authors use the whole video during inference, averaging predictions temporally. More recent studies that achieved state-of-the-art results processed numerous, but relatively short, clips during inference. In Non-local, inference is done by sampling 10 clips evenly from the full-length video and average the softmax scores to achieve the final prediction. SlowFast follows the same practice and introduces the term “view” – a temporal clip with a spatial crop. SlowFast uses 10 temporal clips with 3 spatial crops at inference time; thus, 30 different views are averaged for the final prediction. X3D follows the same practice, but in addition, it uses larger spatial scales to achieve its best results on 30 different views. This common practice of multi-view inference is somewhat counterintuitive, especially when handling long videos. A more intuitive way is to “look” at the entire video context before deciding on the action, rather than viewing only small portions of it. [1] Bertasius, Gedas, et al. "Is Space-Time Attention All You Need for Video Understanding?." arXiv 21.02. [2] Neimark, Daniel, et al. "Video transformer network." arXiv 21.02
  • 9. 9 Design Choices TimeSformer [1]  16, 32 frames / clip (2.56 or 5.12 seconds)  Spatial dimension: 224x224  Multi-view inference: sample 10 clips uniformly & crop 3 views per clip → avg. softmax scores of 30 predictions (i.e., 10x3 views)  Full video inference: read all frames & sub-sample to 250 frames uniformly VTN (Video Transformer Network) [2]  8 frames / clip (frame sample rate: 1/16)  Spatial dimension: 224x224  Patch size: 16x16  Inference: sample center clip & crop 3 views → avg. softmax scores of 3 predictions STAM (Space Time Attention Model) [3]  16 frames/video vs. 32 vs. 64: 16 (79.3) < 32 (79.9) < 64 (80.5)  Spatial dimension: 224x224 [1] Bertasius, Gedas, et al. "Is Space-Time Attention All You Need for Video Understanding?." arXiv 21.02. [2] Neimark, Daniel, et al. "Video transformer network." arXiv 21.02 [3] Sharir, Gilad, et al. “An Image is Worth 16x16 Words, What is a Video Worth?” arxiv 21.03
  • 10. 10 Design Choices ViViT [1]  32 frames / clip (stride: 2; sampling 8 frames; tubelet length=4)  Crop size ∈ [224, 320]: 224 (80.3/58.9) vs. 288 (80.7/147.6) vs. 320 (81.0/238.8)  32 frame-tubelet length 4 vs. 64-8 vs. 128-16: 32-4 > 64-8 > 128-16  Inference: 4 views VidTr [2]  Cubic patch (4x16x16) vs. Square patch (1x16x16): Cubic (73.1) < Square (75.5)  Patch size 16x16 vs. 32x32: 16 (77.7) > 32 (71.2)  Inference: 10x3 views Video Swin Transformer [3]  32 frames / clip (temporal stride: 2)  Spatial dimension: 224x224  3D token size: 16x56x56  Inference: 4x3 views [1] Arnab, Anurag, et al. "Vivit: A video vision transformer." arXiv 21.03. [2] Li, Xinyu, et al. "VidTr: Video Transformer Without Convolutions." arXiv 21.04 [3] Liu, Ze, et al. "Video swin transformer." arXiv 21.06
  • 11. 11 How to Design Video Transformer? Factorized Space-Time Transformers (Space first -> Time next) - Most of the Video Transformers follow this approach - 2-step: Spatial Transformer + Temporal Transformer - Spatial Backbone: ResNet101? ViT? - Input: Image patches -> Spatial Trans. -> Output Feat. -> Temporal Trans. - Number of frames: 16 , 32 - Computationally tractable (P: #patches, F: #frames): P2 + F2 Joint Space-Time Transformer - 1-step: Single Transformer (Joint Space-Time Attention) - Input: Flattened video patches - Number of frames: ? - Computationally expensive: P2 × F2 DETR-style (C3D + Transformer) - C3D backbone -> Output Feature -> Flatten -> Transformer
  • 13. 13 Appendix: Vision Transformers Sorted by Date https://github.com/DirtyHarryLYL/Transformer-in-Vision Sorted by Task https://github.com/Yangzhangcst/Transformer-in- Computer-Vision Sorted by Venue https://github.com/dk-liang/Awesome-Visual-Transformer

Editor's Notes

  1. Thank you.