SlideShare a Scribd company logo
2021.10.27.
KAIST ALIN-LAB
Sangwoo Mo
1
Goal: Video Recognition
2
• Understand what is happening in the video (extension of image recognition)
• Action recognition (i.e., classification)
• Spatio-temporal action detection
Background: Video Transformers
3
• Transformer architectures have shown remarkable success in video recognition
• Extending Vision Transformer (ViT), apply attention over 𝑇×𝐻𝑊 patch tokens
• Previous works focused on designing an efficient attention over the patch tokens
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
Background: Video Transformers
4
• Transformer architectures have shown remarkable success in video recognition
• Naïve approach = Joint Attention (attention over all patches)
Bertasius et al. Is Space-Time Attention All You Need for Video Understanding? ICML 2021.
𝑆 = 𝐻𝑊
Background: Video Transformers
5
• Transformer architectures have shown remarkable success in video recognition
• Divided Attention: Each patch attends to the spatial and temporal patches alternatively
Bertasius et al. Is Space-Time Attention All You Need for Video Understanding? ICML 2021.
𝑆 = 𝐻𝑊
Background: Video Transformers
6
• Transformer architectures have shown remarkable success in video recognition
• Since divided attention only (temporally) attends to the same position of the patch,
it does not catch the moving trajectory of the objects
Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
Background: Video Transformers
7
• Transformer architectures have shown remarkable success in video recognition
• Trajectory Attention: Divide attention operation in two stages
1. Compute attention map over all space-time patches (𝑠𝑡 × 𝑠!𝑡!)
then apply spatial pooling to make trajectory features (𝑠𝑡 × 𝑡!)
Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
Background: Video Transformers
8
• Transformer architectures have shown remarkable success in video recognition
• Trajectory Attention: Divide attention operation in two stages
2. Apply temporal attention over the trajectory features (𝑠𝑡)
Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
Background: Video Transformers
9
• Transformer architectures have shown remarkable success in video recognition
• Trajectory Attention: Divide attention operation in two stages
Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
However, it still does not explicitly model the objects!
Only aggregating the effects of all possible spatio-temporal relations
Method: Object-Region Video Transformer (ORViT)
10
• Idea: The attention should be applied in object level1, in addition to the patch level
• The patch attends to the all objects and patches in all time frames2
1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets.
2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost.
Method: Object-Region Video Transformer (ORViT)
11
• Idea: The attention should be applied in object level1, in addition to the patch level
• The patch attends to the all objects and patches in all time frames2
• Specifically, ORViT considers three aspects of the objects:
• Objects (themselves)
• Interactions over objects
• Dynamics of objects
1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets.
2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost.
Method: Object-Region Video Transformer (ORViT)
12
• Idea: The attention should be applied in object level1, in addition to the patch level
• The patch attends to the all objects and patches in all time frames2
• Specifically, ORViT considers three aspects of the objects:
• Objects (themselves)
• Interactions over objects
• Dynamics of objects
1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets.
2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost.
Object-Region Attention
Object-Dynamics Module
Method: Object-Region Attention
13
• Object-Region Attention computes attention over both patches and objects
• Query: patches / Key & Value: patches + objects
• Object features are given by the ROIAlign (and MaxPool) of patch features
where the coordinate embedding is given by
the sum of MLP(𝐵) and learnable vector 𝑃
Method: Object-Dynamics Module
14
• Object-Dynamics Module computes attention over object locations
• Then, the dynamics features are spatially expanded by Box Position Encoder
The coordinate embedding
is given by the sum of .
MLP 𝐵
and learnable vector /
𝑃
Query & Key & Value: objects
Method: Overall ORViT Block
15
• Substitute attention blocks to the ORViT blocks
• It is important to apply the ORViT blocks in the lower layers
Results: Action Recognition
16
• ORViT significantly improves the baseline models
* Use detected boxes for Diving48 and Epic-Kitchens100. Yet, ORViT gives 8% improvement for Diving48.
Note that the box quality is
important, as shown in (a)
Results: Compositional Action Recognition
17
• ORViT is more effective for the for the following scenarios:1
• Compositional: Class = verb + noun / some test combinations are not in the training set
• Few-shot: Train on base classes, and fine-tune on few-shot novel classes
1. Indeed, ORViT better disentangles the objects (noun) and actions (verb).
SomethingElse dataset
Results: Spatio-temporal Action Detection
18
• ORViT also works well for spatio-temporal action detection
• Apply RoIAlign head on top of the spatio-temporal features
• All models use same boxes; hence, only differ from the box classification
Results: Ablation Study
19
• All proposed components contribute to the performance
• It is crucial to apply the ORViT module in lower layers (layer 2 ≫ layer 12)
• Cf. Trajectory attention performs the best
Results: Attention Maps (CLS)
20
• ORViT better attends on the salient objects of the video
• ORViT-Mformer consistently attends on the papers (main objects of the video) while
Mformer attends on the human face (salient for the scene, but not for the whole video)
* Attention map corresponding to the CLS query.
Results: Attention Maps (Objects)
21
• The attention map of each object visualizes its affecting regions
• Note that remote controllers attend on their regions, while hand has a broader map
* Attention map of each object to the patches.
22
Thank you for listening! 😀

More Related Content

What's hot

Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat Minima
Sangwoo Mo
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)
Sangwoo Mo
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
Sangwoo Mo
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
Sangwoo Mo
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
Sangwoo Mo
 
Emergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep RepresentationsEmergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep Representations
Sangwoo Mo
 
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)
Sangwoo Mo
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Koh Takeuchi
 
Focal loss for dense object detection
Focal loss for dense object detectionFocal loss for dense object detection
Focal loss for dense object detection
DaeHeeKim31
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
Sangmin Woo
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
Joonhyung Lee
 
Pelee: a real time object detection system on mobile devices Paper Review
Pelee: a real time object detection system on mobile devices Paper ReviewPelee: a real time object detection system on mobile devices Paper Review
Pelee: a real time object detection system on mobile devices Paper Review
LEE HOSEONG
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
Sungchul Kim
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
Kien Le
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
Universitat Politècnica de Catalunya
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
A beginner's guide to Style Transfer and recent trends
A beginner's guide to Style Transfer and recent trendsA beginner's guide to Style Transfer and recent trends
A beginner's guide to Style Transfer and recent trends
JaeJun Yoo
 
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
Taiji Suzuki
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Sungchul Kim
 

What's hot (20)

Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat Minima
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
 
Emergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep RepresentationsEmergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep Representations
 
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
 
Focal loss for dense object detection
Focal loss for dense object detectionFocal loss for dense object detection
Focal loss for dense object detection
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Pelee: a real time object detection system on mobile devices Paper Review
Pelee: a real time object detection system on mobile devices Paper ReviewPelee: a real time object detection system on mobile devices Paper Review
Pelee: a real time object detection system on mobile devices Paper Review
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
A beginner's guide to Style Transfer and recent trends
A beginner's guide to Style Transfer and recent trendsA beginner's guide to Style Transfer and recent trends
A beginner's guide to Style Transfer and recent trends
 
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
 

Similar to Object-Region Video Transformers

Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level Feature
Dongmin Choi
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
Edge AI and Vision Alliance
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNN
Junho Cho
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
Seunghyun Hwang
 
20220811 - computer vision
20220811 - computer vision20220811 - computer vision
20220811 - computer vision
Jamie (Taka) Wang
 
Object Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IObject Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet I
Wanjin Yu
 
MLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, CaptioningMLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, Captioning
Charles Deledalle
 
John W. Vinti Particle Tracker Final Presentation
John W. Vinti Particle Tracker Final PresentationJohn W. Vinti Particle Tracker Final Presentation
John W. Vinti Particle Tracker Final PresentationJohn Vinti
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Sergey Karayev
 
Object Detection - Míriam Bellver - UPC Barcelona 2018
Object Detection - Míriam Bellver - UPC Barcelona 2018Object Detection - Míriam Bellver - UPC Barcelona 2018
Object Detection - Míriam Bellver - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018
Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018
Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
Sangwoo Mo
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with Transformers
Databricks
 
Unsupervised Video Summarization via Attention-Driven Adversarial Learning
Unsupervised Video Summarization via Attention-Driven Adversarial LearningUnsupervised Video Summarization via Attention-Driven Adversarial Learning
Unsupervised Video Summarization via Attention-Driven Adversarial Learning
VasileiosMezaris
 
Constrained Optimization with Genetic Algorithms and Project Bonsai
Constrained Optimization with Genetic Algorithms and Project BonsaiConstrained Optimization with Genetic Algorithms and Project Bonsai
Constrained Optimization with Genetic Algorithms and Project Bonsai
Ivo Andreev
 
[20240603_LabSeminar_Huy]TransMOT: Spatial-Temporal Graph Transformer for Mul...
[20240603_LabSeminar_Huy]TransMOT: Spatial-Temporal Graph Transformer for Mul...[20240603_LabSeminar_Huy]TransMOT: Spatial-Temporal Graph Transformer for Mul...
[20240603_LabSeminar_Huy]TransMOT: Spatial-Temporal Graph Transformer for Mul...
thanhdowork
 
D3L4-objects.pdf
D3L4-objects.pdfD3L4-objects.pdf
D3L4-objects.pdf
ssusere945ae
 
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Saimunur Rahman
 
"3D from 2D: Theory, Implementation, and Applications of Structure from Motio...
"3D from 2D: Theory, Implementation, and Applications of Structure from Motio..."3D from 2D: Theory, Implementation, and Applications of Structure from Motio...
"3D from 2D: Theory, Implementation, and Applications of Structure from Motio...
Edge AI and Vision Alliance
 
Object-Centric Debugging: a preview
Object-Centric Debugging: a previewObject-Centric Debugging: a preview
Object-Centric Debugging: a preview
Pharo
 

Similar to Object-Region Video Transformers (20)

Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level Feature
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNN
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
 
20220811 - computer vision
20220811 - computer vision20220811 - computer vision
20220811 - computer vision
 
Object Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IObject Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet I
 
MLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, CaptioningMLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, Captioning
 
John W. Vinti Particle Tracker Final Presentation
John W. Vinti Particle Tracker Final PresentationJohn W. Vinti Particle Tracker Final Presentation
John W. Vinti Particle Tracker Final Presentation
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
Object Detection - Míriam Bellver - UPC Barcelona 2018
Object Detection - Míriam Bellver - UPC Barcelona 2018Object Detection - Míriam Bellver - UPC Barcelona 2018
Object Detection - Míriam Bellver - UPC Barcelona 2018
 
Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018
Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018
Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018
 
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with Transformers
 
Unsupervised Video Summarization via Attention-Driven Adversarial Learning
Unsupervised Video Summarization via Attention-Driven Adversarial LearningUnsupervised Video Summarization via Attention-Driven Adversarial Learning
Unsupervised Video Summarization via Attention-Driven Adversarial Learning
 
Constrained Optimization with Genetic Algorithms and Project Bonsai
Constrained Optimization with Genetic Algorithms and Project BonsaiConstrained Optimization with Genetic Algorithms and Project Bonsai
Constrained Optimization with Genetic Algorithms and Project Bonsai
 
[20240603_LabSeminar_Huy]TransMOT: Spatial-Temporal Graph Transformer for Mul...
[20240603_LabSeminar_Huy]TransMOT: Spatial-Temporal Graph Transformer for Mul...[20240603_LabSeminar_Huy]TransMOT: Spatial-Temporal Graph Transformer for Mul...
[20240603_LabSeminar_Huy]TransMOT: Spatial-Temporal Graph Transformer for Mul...
 
D3L4-objects.pdf
D3L4-objects.pdfD3L4-objects.pdf
D3L4-objects.pdf
 
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
 
"3D from 2D: Theory, Implementation, and Applications of Structure from Motio...
"3D from 2D: Theory, Implementation, and Applications of Structure from Motio..."3D from 2D: Theory, Implementation, and Applications of Structure from Motio...
"3D from 2D: Theory, Implementation, and Applications of Structure from Motio...
 
Object-Centric Debugging: a preview
Object-Centric Debugging: a previewObject-Centric Debugging: a preview
Object-Centric Debugging: a preview
 

More from Sangwoo Mo

Brief History of Visual Representation Learning
Brief History of Visual Representation LearningBrief History of Visual Representation Learning
Brief History of Visual Representation Learning
Sangwoo Mo
 
Learning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated DataLearning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated Data
Sangwoo Mo
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement Learning
Sangwoo Mo
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
Sangwoo Mo
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
Sangwoo Mo
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
Sangwoo Mo
 
Neural Processes
Neural ProcessesNeural Processes
Neural Processes
Sangwoo Mo
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)
Sangwoo Mo
 
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
Sangwoo Mo
 
Topology for Computing: Homology
Topology for Computing: HomologyTopology for Computing: Homology
Topology for Computing: Homology
Sangwoo Mo
 
Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based Policies
Sangwoo Mo
 
Statistical Decision Theory
Statistical Decision TheoryStatistical Decision Theory
Statistical Decision Theory
Sangwoo Mo
 
On Unifying Deep Generative Models
On Unifying Deep Generative ModelsOn Unifying Deep Generative Models
On Unifying Deep Generative Models
Sangwoo Mo
 
Dropout as a Bayesian Approximation
Dropout as a Bayesian ApproximationDropout as a Bayesian Approximation
Dropout as a Bayesian Approximation
Sangwoo Mo
 

More from Sangwoo Mo (14)

Brief History of Visual Representation Learning
Brief History of Visual Representation LearningBrief History of Visual Representation Learning
Brief History of Visual Representation Learning
 
Learning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated DataLearning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated Data
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement Learning
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Neural Processes
Neural ProcessesNeural Processes
Neural Processes
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)
 
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
 
Topology for Computing: Homology
Topology for Computing: HomologyTopology for Computing: Homology
Topology for Computing: Homology
 
Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based Policies
 
Statistical Decision Theory
Statistical Decision TheoryStatistical Decision Theory
Statistical Decision Theory
 
On Unifying Deep Generative Models
On Unifying Deep Generative ModelsOn Unifying Deep Generative Models
On Unifying Deep Generative Models
 
Dropout as a Bayesian Approximation
Dropout as a Bayesian ApproximationDropout as a Bayesian Approximation
Dropout as a Bayesian Approximation
 

Recently uploaded

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 

Recently uploaded (20)

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 

Object-Region Video Transformers

  • 2. Goal: Video Recognition 2 • Understand what is happening in the video (extension of image recognition) • Action recognition (i.e., classification) • Spatio-temporal action detection
  • 3. Background: Video Transformers 3 • Transformer architectures have shown remarkable success in video recognition • Extending Vision Transformer (ViT), apply attention over 𝑇×𝐻𝑊 patch tokens • Previous works focused on designing an efficient attention over the patch tokens Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
  • 4. Background: Video Transformers 4 • Transformer architectures have shown remarkable success in video recognition • Naïve approach = Joint Attention (attention over all patches) Bertasius et al. Is Space-Time Attention All You Need for Video Understanding? ICML 2021. 𝑆 = 𝐻𝑊
  • 5. Background: Video Transformers 5 • Transformer architectures have shown remarkable success in video recognition • Divided Attention: Each patch attends to the spatial and temporal patches alternatively Bertasius et al. Is Space-Time Attention All You Need for Video Understanding? ICML 2021. 𝑆 = 𝐻𝑊
  • 6. Background: Video Transformers 6 • Transformer architectures have shown remarkable success in video recognition • Since divided attention only (temporally) attends to the same position of the patch, it does not catch the moving trajectory of the objects Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
  • 7. Background: Video Transformers 7 • Transformer architectures have shown remarkable success in video recognition • Trajectory Attention: Divide attention operation in two stages 1. Compute attention map over all space-time patches (𝑠𝑡 × 𝑠!𝑡!) then apply spatial pooling to make trajectory features (𝑠𝑡 × 𝑡!) Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
  • 8. Background: Video Transformers 8 • Transformer architectures have shown remarkable success in video recognition • Trajectory Attention: Divide attention operation in two stages 2. Apply temporal attention over the trajectory features (𝑠𝑡) Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
  • 9. Background: Video Transformers 9 • Transformer architectures have shown remarkable success in video recognition • Trajectory Attention: Divide attention operation in two stages Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021. However, it still does not explicitly model the objects! Only aggregating the effects of all possible spatio-temporal relations
  • 10. Method: Object-Region Video Transformer (ORViT) 10 • Idea: The attention should be applied in object level1, in addition to the patch level • The patch attends to the all objects and patches in all time frames2 1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets. 2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost.
  • 11. Method: Object-Region Video Transformer (ORViT) 11 • Idea: The attention should be applied in object level1, in addition to the patch level • The patch attends to the all objects and patches in all time frames2 • Specifically, ORViT considers three aspects of the objects: • Objects (themselves) • Interactions over objects • Dynamics of objects 1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets. 2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost.
  • 12. Method: Object-Region Video Transformer (ORViT) 12 • Idea: The attention should be applied in object level1, in addition to the patch level • The patch attends to the all objects and patches in all time frames2 • Specifically, ORViT considers three aspects of the objects: • Objects (themselves) • Interactions over objects • Dynamics of objects 1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets. 2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost. Object-Region Attention Object-Dynamics Module
  • 13. Method: Object-Region Attention 13 • Object-Region Attention computes attention over both patches and objects • Query: patches / Key & Value: patches + objects • Object features are given by the ROIAlign (and MaxPool) of patch features where the coordinate embedding is given by the sum of MLP(𝐵) and learnable vector 𝑃
  • 14. Method: Object-Dynamics Module 14 • Object-Dynamics Module computes attention over object locations • Then, the dynamics features are spatially expanded by Box Position Encoder The coordinate embedding is given by the sum of . MLP 𝐵 and learnable vector / 𝑃 Query & Key & Value: objects
  • 15. Method: Overall ORViT Block 15 • Substitute attention blocks to the ORViT blocks • It is important to apply the ORViT blocks in the lower layers
  • 16. Results: Action Recognition 16 • ORViT significantly improves the baseline models * Use detected boxes for Diving48 and Epic-Kitchens100. Yet, ORViT gives 8% improvement for Diving48. Note that the box quality is important, as shown in (a)
  • 17. Results: Compositional Action Recognition 17 • ORViT is more effective for the for the following scenarios:1 • Compositional: Class = verb + noun / some test combinations are not in the training set • Few-shot: Train on base classes, and fine-tune on few-shot novel classes 1. Indeed, ORViT better disentangles the objects (noun) and actions (verb). SomethingElse dataset
  • 18. Results: Spatio-temporal Action Detection 18 • ORViT also works well for spatio-temporal action detection • Apply RoIAlign head on top of the spatio-temporal features • All models use same boxes; hence, only differ from the box classification
  • 19. Results: Ablation Study 19 • All proposed components contribute to the performance • It is crucial to apply the ORViT module in lower layers (layer 2 ≫ layer 12) • Cf. Trajectory attention performs the best
  • 20. Results: Attention Maps (CLS) 20 • ORViT better attends on the salient objects of the video • ORViT-Mformer consistently attends on the papers (main objects of the video) while Mformer attends on the human face (salient for the scene, but not for the whole video) * Attention map corresponding to the CLS query.
  • 21. Results: Attention Maps (Objects) 21 • The attention map of each object visualizes its affecting regions • Note that remote controllers attend on their regions, while hand has a broader map * Attention map of each object to the patches.
  • 22. 22 Thank you for listening! 😀