1) The document discusses object-region video transformers (ORViT) for video recognition. ORViT applies attention at both the patch and object levels.
2) ORViT considers three aspects of objects: the objects themselves, interactions between objects, and object dynamics over time.
3) Experimental results show ORViT outperforms baseline models on action recognition, compositional action recognition, and spatio-temporal action detection tasks. ORViT better captures object-level information and dynamics compared to patch-level attention alone.
2. Goal: Video Recognition
2
• Understand what is happening in the video (extension of image recognition)
• Action recognition (i.e., classification)
• Spatio-temporal action detection
3. Background: Video Transformers
3
• Transformer architectures have shown remarkable success in video recognition
• Extending Vision Transformer (ViT), apply attention over 𝑇×𝐻𝑊 patch tokens
• Previous works focused on designing an efficient attention over the patch tokens
Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
4. Background: Video Transformers
4
• Transformer architectures have shown remarkable success in video recognition
• Naïve approach = Joint Attention (attention over all patches)
Bertasius et al. Is Space-Time Attention All You Need for Video Understanding? ICML 2021.
𝑆 = 𝐻𝑊
5. Background: Video Transformers
5
• Transformer architectures have shown remarkable success in video recognition
• Divided Attention: Each patch attends to the spatial and temporal patches alternatively
Bertasius et al. Is Space-Time Attention All You Need for Video Understanding? ICML 2021.
𝑆 = 𝐻𝑊
6. Background: Video Transformers
6
• Transformer architectures have shown remarkable success in video recognition
• Since divided attention only (temporally) attends to the same position of the patch,
it does not catch the moving trajectory of the objects
Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
7. Background: Video Transformers
7
• Transformer architectures have shown remarkable success in video recognition
• Trajectory Attention: Divide attention operation in two stages
1. Compute attention map over all space-time patches (𝑠𝑡 × 𝑠!𝑡!)
then apply spatial pooling to make trajectory features (𝑠𝑡 × 𝑡!)
Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
8. Background: Video Transformers
8
• Transformer architectures have shown remarkable success in video recognition
• Trajectory Attention: Divide attention operation in two stages
2. Apply temporal attention over the trajectory features (𝑠𝑡)
Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
9. Background: Video Transformers
9
• Transformer architectures have shown remarkable success in video recognition
• Trajectory Attention: Divide attention operation in two stages
Patrick et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021.
However, it still does not explicitly model the objects!
Only aggregating the effects of all possible spatio-temporal relations
10. Method: Object-Region Video Transformer (ORViT)
10
• Idea: The attention should be applied in object level1, in addition to the patch level
• The patch attends to the all objects and patches in all time frames2
1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets.
2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost.
11. Method: Object-Region Video Transformer (ORViT)
11
• Idea: The attention should be applied in object level1, in addition to the patch level
• The patch attends to the all objects and patches in all time frames2
• Specifically, ORViT considers three aspects of the objects:
• Objects (themselves)
• Interactions over objects
• Dynamics of objects
1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets.
2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost.
12. Method: Object-Region Video Transformer (ORViT)
12
• Idea: The attention should be applied in object level1, in addition to the patch level
• The patch attends to the all objects and patches in all time frames2
• Specifically, ORViT considers three aspects of the objects:
• Objects (themselves)
• Interactions over objects
• Dynamics of objects
1. Objects locations are precomputed by an off-the-shelf detector. Use a fixed number of objects depends on datasets.
2. It increases the number of tokens to 𝑇𝐻𝑊 + 𝑇𝑂, which slightly increases the computational cost.
Object-Region Attention
Object-Dynamics Module
13. Method: Object-Region Attention
13
• Object-Region Attention computes attention over both patches and objects
• Query: patches / Key & Value: patches + objects
• Object features are given by the ROIAlign (and MaxPool) of patch features
where the coordinate embedding is given by
the sum of MLP(𝐵) and learnable vector 𝑃
14. Method: Object-Dynamics Module
14
• Object-Dynamics Module computes attention over object locations
• Then, the dynamics features are spatially expanded by Box Position Encoder
The coordinate embedding
is given by the sum of .
MLP 𝐵
and learnable vector /
𝑃
Query & Key & Value: objects
15. Method: Overall ORViT Block
15
• Substitute attention blocks to the ORViT blocks
• It is important to apply the ORViT blocks in the lower layers
16. Results: Action Recognition
16
• ORViT significantly improves the baseline models
* Use detected boxes for Diving48 and Epic-Kitchens100. Yet, ORViT gives 8% improvement for Diving48.
Note that the box quality is
important, as shown in (a)
17. Results: Compositional Action Recognition
17
• ORViT is more effective for the for the following scenarios:1
• Compositional: Class = verb + noun / some test combinations are not in the training set
• Few-shot: Train on base classes, and fine-tune on few-shot novel classes
1. Indeed, ORViT better disentangles the objects (noun) and actions (verb).
SomethingElse dataset
18. Results: Spatio-temporal Action Detection
18
• ORViT also works well for spatio-temporal action detection
• Apply RoIAlign head on top of the spatio-temporal features
• All models use same boxes; hence, only differ from the box classification
19. Results: Ablation Study
19
• All proposed components contribute to the performance
• It is crucial to apply the ORViT module in lower layers (layer 2 ≫ layer 12)
• Cf. Trajectory attention performs the best
20. Results: Attention Maps (CLS)
20
• ORViT better attends on the salient objects of the video
• ORViT-Mformer consistently attends on the papers (main objects of the video) while
Mformer attends on the human face (salient for the scene, but not for the whole video)
* Attention map corresponding to the CLS query.
21. Results: Attention Maps (Objects)
21
• The attention map of each object visualizes its affecting regions
• Note that remote controllers attend on their regions, while hand has a broader map
* Attention map of each object to the patches.