Action Genome: Action As Composition of Spatio Temporal Scene Graphs

2020 IRRLAB Presentation
IRRLAB
Action Genome:
Actions as Compositions of Spatio-
temporal Scene Graphs
Sangmin Woo
2020.07.09
Jingwei Ji Ranjay Krishna Li Fei-Fei Juan Carlos Niebles
Stanford University
CVPR 2020

2 / 21
IRRLABContents
 Why Action Genome?
 Action Genome
 Method
 Experiments
 Future Work

3 / 21
IRRLABWhy Action Genome?
 Scene Graph
• The scene graph representation provides a scaffold that allows vision
models to tackle complex inference tasks by breaking scenes into its
corresponding objects and their visual relationships.
[1] Zellers, R., Yatskar, M., Thomson, S., & Choi, Y. (2018). Neural Motifs: Scene Graph Parsing with Global Context. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (pp. 5831-5840).

4 / 21
 Versatility of Scene Graph
• A number of works have proposed deep network-based approaches for generating
the scene graphs, confirming its importance to the field
• e.g., Image retrieval, Visual Question Answering, Image Captioning, Image
Generation, Video Understanding
[2] Ashual, O., & Wolf, L. (2019). Specifying Object Attributes and Relations in Interactive Scene Generation. In Proceedings of the IEEE
International Conference on Computer Vision (pp. 4561-4569).

5 / 21
 Scene Graph in Video?!
• However, decompositions for temporal events has not been explored
much, even though representing events with structured representations
could lead to more accurate and grounded action understanding.
• Meanwhile, people segment events into consistent groups and actively
encode those ongoing activities in a hierarchical part structure.

6 / 21
 Benefits?
• Such decompositions can enable machines to predict future and past
scene graphs with objects and relationships as an action occurs
• we can predict that the person is about to sit on the sofa when we see
them move in front of it.
• Similarly, such decomposition can also enable machines to learn from
few examples
• we can recognize the same action when we see a different person
move towards a different chair.

7 / 21
 Benefits?
few examples
• Okay… such decompositions can provide the scaffolds to improve vision
models... Then, how is it possible to correctly create representative
hierarchies for a wide variety of complex actions?

8 / 21
 Benefits?
few examples
• Okay… such decompositions can provide the scaffolds to improve vision
models... Then, how is it possible to correctly create representative
hierarchies for a wide variety of complex actions?
• Action Genome!

9 / 21
IRRLABAction Genome
 Action Genome
• A representation that decomposes actions into spatio-temporal scene
graphs.
• Object detection faced a similar challenge of large variation within any
object category. So, just as progress in 2D perception was catalyzed by
taxonomies, partonomies, and ontologies.
• Action Genome aims to improve temporal understanding by partonomy.
 Data Statistics
• Built upon Charades, Action Genome provides
• 476K object bounding boxes of 35 object classes (excluding “person”)
• 1.72M relationships of 25 relationship classes
• 234K video frames
• 157 action categories

10/ 21
IRRLABAction Genome
 Charades vs. Action Genome
• Charades provides an injective mapping from each action to a verb
• Charades’ verbs are clip-level labels, such as “𝑎𝑤𝑎𝑘𝑒𝑛”
• Action Genome decompose them into frame-level human-object
relationships, such as a sequence of < 𝑝𝑒𝑟𝑠𝑜𝑛 – 𝑙𝑦𝑖𝑛𝑔 𝑜𝑛 – 𝑏𝑒𝑑 >, <
𝑝𝑒𝑟𝑠𝑜𝑛 – 𝑠𝑖𝑡𝑡𝑖𝑛𝑔 𝑜𝑛 – 𝑏𝑒𝑑 > and < 𝑝𝑒𝑟𝑠𝑜𝑛 – 𝑛𝑜𝑡 𝑐𝑜𝑛𝑡𝑎𝑐𝑡𝑖𝑛𝑔 – 𝑏𝑒𝑑 >.

11 / 21
IRRLABAction Genome
 Action Genome
• Most perspectives on action decomposition converge on the prototypical
unit of action-object couplets.
• Action-object couplets refer to transitive actions performed on objects
(e.g. “moving a chair” or “throwing a ball”) and intransitive self-actions
(e.g. “moving towards the sofa”).
• Action Genome’s dynamic scene graph representations capture both such
types of events and as such, represent the prototypical unit.
• This representation enables the study for tasks such as:
• Spatio-temporal scene graph prediction
• Improving action recognition
• Few-shot action detection

12/ 21
IRRLABAction Genome
 Three types of relationships in Action Genome
• Attention: which objects people are looking at?
• Spatial: how objects are laid out spatially?
• Contact: semantic relationships involving people manipulating objects.

13/ 21
IRRLABAction Genome
 Action Genome’s annotation pipeline
• For every action, 5 frames are uniformly sampled across the action.

14/ 21
IRRLABMethod (for Action Recognition)
 Action Genome for Action Recognition
• Scene Graph Feature Banks (SGFB): It accumulates features of a long video as a
fixed size representation for action recognition.
1. SGFB generates a spatio-temporal scene graph for every frame.
2. Scene graphs are then encoded to formulate a feature banks.
3. Feature banks are combined with 3D CNN features using feature bank operators
(FBO), which can be instantiated as mean/max pooling or non-local blocks.
4. The final aggregated feature is used to predict action labels for the video.
[3] Wu, C. Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., & Girshick, R. (2019). Long-Term Feature Banks for Detailed Video
Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 284-293).

15/ 21
IRRLABMethod (for Action Recognition)
 Action Genome for Action Recognition

16/ 21
IRRLABExperiments
 Action Recognition on Charades
• The Charades dataset contains 9, 848 crowdsourced videos with a length
of 30 seconds on average.
• SGFB outperformed all existing methods when simultaneously predict
scene graphs while performing action recognition (44.3% mAP).
• Utilizing ground truth scene graphs (Oracle) can significantly boost
performance (60.3% mAP).

17/ 21
IRRLABExperiments
 Few-shot action recognition
• Intuitively, predicting actions should be easier (and well generalize to rare
actions) from a symbolic embedding of scene graphs than from pixels.
1) 157 action classes are split into a base set of 137 classes and a novel set
of 20 classes.
2) A backbone feature extractor (R101-I3D-NL) is first trained on all video
examples of the base classes, which is shared by the baseline LFB, SGFB,
and SGFB oracle.
3) Next, each model are trained with only k examples from each novel class,
where k = 1, 5, 10, for 50 epochs.
4) Finally, models are evaluated on all examples of novel classes in the
Charades validation set.

18/ 21
IRRLABExperiments
 Spatio-temporal scene graph prediction
• Unlike image-based scene graph prediction, which only has a single
image as input,
• This task expects a video as input and therefore, can utilize temporal
information from neighboring frames to strength its predictions.
• There is significant room for improvement, especially since these existing
methods were designed to be conditioned on a single frame and do not
consider the entire video sequence as a whole.

19/ 21
IRRLABExperiments
 Summary
1. Predicting scene graphs can benefit action recognition
• Improving the state-of-the-art on the Charades dataset from 42.5% to
44.3% and to 60.3% when using oracle scene graphs.
2. Compositional understanding of actions induces better generalization.
• In few-shot action recognition, it achieves 42.7% mAP using as few
as 10 training examples.
3. In new task of spatio-temporal scene graph prediction, there is significant
room for improvement
• Existing methods were designed to be conditioned on a single frame
and do not consider the entire video sequence as a whole.

20/ 21
IRRLABFuture Work
 Future Research Directions
• Spatio-temporal action localization
• The majority of action localization methods focus on localizing the person
performing the action but ignore the objects.
• Action Genome can enable research on localization of both actors and objects.
• Explainable action models
• Action Genome provides frame-level labels of attention in the form of objects
that a person performing the action is either looking at or interacting with.
• These labels can be used to further train explainable models.
• Video generation from spatio-temporal scene graphs

Thank You
2020 IRRLAB Presentation
IRRLAB
shmwoo9395@{gist.ac.kr, gmail.com}

Action Genome: Action As Composition of Spatio Temporal Scene Graphs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Action Genome: Action As Composition of Spatio Temporal Scene Graphs

Similar to Action Genome: Action As Composition of Spatio Temporal Scene Graphs (20)

More from Sangmin Woo

More from Sangmin Woo (14)

Recently uploaded

Recently uploaded (20)

Action Genome: Action As Composition of Spatio Temporal Scene Graphs

Editor's Notes