SlideShare a Scribd company logo
2020 IRRLAB Presentation
IRRLAB
Action Genome:
Actions as Compositions of Spatio-
temporal Scene Graphs
Sangmin Woo
2020.07.09
Jingwei Ji Ranjay Krishna Li Fei-Fei Juan Carlos Niebles
Stanford University
CVPR 2020
2 / 21
IRRLABContents
 Why Action Genome?
 Action Genome
 Method
 Experiments
 Future Work
3 / 21
IRRLABWhy Action Genome?
 Scene Graph
• The scene graph representation provides a scaffold that allows vision
models to tackle complex inference tasks by breaking scenes into its
corresponding objects and their visual relationships.
[1] Zellers, R., Yatskar, M., Thomson, S., & Choi, Y. (2018). Neural Motifs: Scene Graph Parsing with Global Context. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (pp. 5831-5840).
4 / 21
IRRLABWhy Action Genome?
 Versatility of Scene Graph
• A number of works have proposed deep network-based approaches for generating
the scene graphs, confirming its importance to the field
• e.g., Image retrieval, Visual Question Answering, Image Captioning, Image
Generation, Video Understanding
[2] Ashual, O., & Wolf, L. (2019). Specifying Object Attributes and Relations in Interactive Scene Generation. In Proceedings of the IEEE
International Conference on Computer Vision (pp. 4561-4569).
5 / 21
IRRLABWhy Action Genome?
 Scene Graph in Video?!
• However, decompositions for temporal events has not been explored
much, even though representing events with structured representations
could lead to more accurate and grounded action understanding.
• Meanwhile, people segment events into consistent groups and actively
encode those ongoing activities in a hierarchical part structure.
6 / 21
IRRLABWhy Action Genome?
 Benefits?
• Such decompositions can enable machines to predict future and past
scene graphs with objects and relationships as an action occurs
• we can predict that the person is about to sit on the sofa when we see
them move in front of it.
• Similarly, such decomposition can also enable machines to learn from
few examples
• we can recognize the same action when we see a different person
move towards a different chair.
7 / 21
IRRLABWhy Action Genome?
 Benefits?
• Such decompositions can enable machines to predict future and past
scene graphs with objects and relationships as an action occurs
• we can predict that the person is about to sit on the sofa when we see
them move in front of it.
• Similarly, such decomposition can also enable machines to learn from
few examples
• we can recognize the same action when we see a different person
move towards a different chair.
• Okay… such decompositions can provide the scaffolds to improve vision
models... Then, how is it possible to correctly create representative
hierarchies for a wide variety of complex actions?
8 / 21
IRRLABWhy Action Genome?
 Benefits?
• Such decompositions can enable machines to predict future and past
scene graphs with objects and relationships as an action occurs
• we can predict that the person is about to sit on the sofa when we see
them move in front of it.
• Similarly, such decomposition can also enable machines to learn from
few examples
• we can recognize the same action when we see a different person
move towards a different chair.
• Okay… such decompositions can provide the scaffolds to improve vision
models... Then, how is it possible to correctly create representative
hierarchies for a wide variety of complex actions?
• Action Genome!
9 / 21
IRRLABAction Genome
 Action Genome
• A representation that decomposes actions into spatio-temporal scene
graphs.
• Object detection faced a similar challenge of large variation within any
object category. So, just as progress in 2D perception was catalyzed by
taxonomies, partonomies, and ontologies.
• Action Genome aims to improve temporal understanding by partonomy.
 Data Statistics
• Built upon Charades, Action Genome provides
• 476K object bounding boxes of 35 object classes (excluding “person”)
• 1.72M relationships of 25 relationship classes
• 234K video frames
• 157 action categories
10/ 21
IRRLABAction Genome
 Charades vs. Action Genome
• Charades provides an injective mapping from each action to a verb
• Charades’ verbs are clip-level labels, such as “𝑎𝑤𝑎𝑘𝑒𝑛”
• Action Genome decompose them into frame-level human-object
relationships, such as a sequence of < 𝑝𝑒𝑟𝑠𝑜𝑛 – 𝑙𝑦𝑖𝑛𝑔 𝑜𝑛 – 𝑏𝑒𝑑 >, <
𝑝𝑒𝑟𝑠𝑜𝑛 – 𝑠𝑖𝑡𝑡𝑖𝑛𝑔 𝑜𝑛 – 𝑏𝑒𝑑 > and < 𝑝𝑒𝑟𝑠𝑜𝑛 – 𝑛𝑜𝑡 𝑐𝑜𝑛𝑡𝑎𝑐𝑡𝑖𝑛𝑔 – 𝑏𝑒𝑑 >.
11 / 21
IRRLABAction Genome
 Action Genome
• Most perspectives on action decomposition converge on the prototypical
unit of action-object couplets.
• Action-object couplets refer to transitive actions performed on objects
(e.g. “moving a chair” or “throwing a ball”) and intransitive self-actions
(e.g. “moving towards the sofa”).
• Action Genome’s dynamic scene graph representations capture both such
types of events and as such, represent the prototypical unit.
• This representation enables the study for tasks such as:
• Spatio-temporal scene graph prediction
• Improving action recognition
• Few-shot action detection
12/ 21
IRRLABAction Genome
 Three types of relationships in Action Genome
• Attention: which objects people are looking at?
• Spatial: how objects are laid out spatially?
• Contact: semantic relationships involving people manipulating objects.
13/ 21
IRRLABAction Genome
 Action Genome’s annotation pipeline
• For every action, 5 frames are uniformly sampled across the action.
14/ 21
IRRLABMethod (for Action Recognition)
 Action Genome for Action Recognition
• Scene Graph Feature Banks (SGFB): It accumulates features of a long video as a
fixed size representation for action recognition.
1. SGFB generates a spatio-temporal scene graph for every frame.
2. Scene graphs are then encoded to formulate a feature banks.
3. Feature banks are combined with 3D CNN features using feature bank operators
(FBO), which can be instantiated as mean/max pooling or non-local blocks.
4. The final aggregated feature is used to predict action labels for the video.
[3] Wu, C. Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., & Girshick, R. (2019). Long-Term Feature Banks for Detailed Video
Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 284-293).
15/ 21
IRRLABMethod (for Action Recognition)
 Action Genome for Action Recognition
16/ 21
IRRLABExperiments
 Action Recognition on Charades
• The Charades dataset contains 9, 848 crowdsourced videos with a length
of 30 seconds on average.
• SGFB outperformed all existing methods when simultaneously predict
scene graphs while performing action recognition (44.3% mAP).
• Utilizing ground truth scene graphs (Oracle) can significantly boost
performance (60.3% mAP).
17/ 21
IRRLABExperiments
 Few-shot action recognition
• Intuitively, predicting actions should be easier (and well generalize to rare
actions) from a symbolic embedding of scene graphs than from pixels.
1) 157 action classes are split into a base set of 137 classes and a novel set
of 20 classes.
2) A backbone feature extractor (R101-I3D-NL) is first trained on all video
examples of the base classes, which is shared by the baseline LFB, SGFB,
and SGFB oracle.
3) Next, each model are trained with only k examples from each novel class,
where k = 1, 5, 10, for 50 epochs.
4) Finally, models are evaluated on all examples of novel classes in the
Charades validation set.
18/ 21
IRRLABExperiments
 Spatio-temporal scene graph prediction
• Unlike image-based scene graph prediction, which only has a single
image as input,
• This task expects a video as input and therefore, can utilize temporal
information from neighboring frames to strength its predictions.
• There is significant room for improvement, especially since these existing
methods were designed to be conditioned on a single frame and do not
consider the entire video sequence as a whole.
19/ 21
IRRLABExperiments
 Summary
1. Predicting scene graphs can benefit action recognition
• Improving the state-of-the-art on the Charades dataset from 42.5% to
44.3% and to 60.3% when using oracle scene graphs.
2. Compositional understanding of actions induces better generalization.
• In few-shot action recognition, it achieves 42.7% mAP using as few
as 10 training examples.
3. In new task of spatio-temporal scene graph prediction, there is significant
room for improvement
• Existing methods were designed to be conditioned on a single frame
and do not consider the entire video sequence as a whole.
20/ 21
IRRLABFuture Work
 Future Research Directions
• Spatio-temporal action localization
• The majority of action localization methods focus on localizing the person
performing the action but ignore the objects.
• Action Genome can enable research on localization of both actors and objects.
• Explainable action models
• Action Genome provides frame-level labels of attention in the form of objects
that a person performing the action is either looking at or interacting with.
• These labels can be used to further train explainable models.
• Video generation from spatio-temporal scene graphs
Thank You
2020 IRRLAB Presentation
IRRLAB
shmwoo9395@{gist.ac.kr, gmail.com}

More Related Content

What's hot

(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BART
(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BART(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BART
(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BART
hyunyoung Lee
 
Modern face recognition with deep learning
Modern face recognition with deep learningModern face recognition with deep learning
Modern face recognition with deep learning
marada0033
 
Image restoration and enhancement #2
Image restoration and enhancement #2 Image restoration and enhancement #2
Image restoration and enhancement #2
Gera Paulos
 
Computer vision
Computer visionComputer vision
Computer vision
Md Nazmul Hossain Mir
 
Introduction to multiple object tracking
Introduction to multiple object trackingIntroduction to multiple object tracking
Introduction to multiple object tracking
Fan Yang
 
A completed modeling of local binary pattern operator
A completed modeling of local binary pattern operatorA completed modeling of local binary pattern operator
A completed modeling of local binary pattern operator
Win Yu
 
Digital image processing
Digital image processingDigital image processing
Digital image processing
Trishna Pattanaik
 
Augmented reality(my ppt)
Augmented reality(my ppt)Augmented reality(my ppt)
Augmented reality(my ppt)
Srilakshmi Alla
 
Attendance Using Facial Recognition
Attendance Using Facial RecognitionAttendance Using Facial Recognition
Attendance Using Facial Recognition
Vikramaditya Tarai
 
Réalité augmentée : technologies et marché
Réalité augmentée : technologies et marchéRéalité augmentée : technologies et marché
Réalité augmentée : technologies et marché
Sylvain Chaleteix, M.Sc.
 
AUGMENTED REALITY Documentation
AUGMENTED REALITY DocumentationAUGMENTED REALITY Documentation
AUGMENTED REALITY Documentation
Venu Gopal
 
2022 COMP4010 Lecture3: AR Technology
2022 COMP4010 Lecture3: AR Technology2022 COMP4010 Lecture3: AR Technology
2022 COMP4010 Lecture3: AR Technology
Mark Billinghurst
 
COMP 4010 - Lecture4 VR Technology - Visual and Haptic Displays
COMP 4010 - Lecture4 VR Technology - Visual and Haptic DisplaysCOMP 4010 - Lecture4 VR Technology - Visual and Haptic Displays
COMP 4010 - Lecture4 VR Technology - Visual and Haptic Displays
Mark Billinghurst
 
Attendance system based on face recognition using python by Raihan Sikdar
Attendance system based on face recognition using python by Raihan SikdarAttendance system based on face recognition using python by Raihan Sikdar
Attendance system based on face recognition using python by Raihan Sikdar
raihansikdar
 
Lec13: Clustering Based Medical Image Segmentation Methods
Lec13: Clustering Based Medical Image Segmentation MethodsLec13: Clustering Based Medical Image Segmentation Methods
Lec13: Clustering Based Medical Image Segmentation Methods
Ulaş Bağcı
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
RAHUL BHOJWANI
 
Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)
nikhilus85
 
OpenCV
OpenCVOpenCV
Marching Cubes
Marching CubesMarching Cubes
Marching Cubes
Tamara Pinos
 
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [추적 24시] : 완전 자동결제를 위한 무인점포 이용자 Tracking System 개발
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스  - [추적 24시] : 완전 자동결제를 위한 무인점포 이용자 Tracking System 개발제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스  - [추적 24시] : 완전 자동결제를 위한 무인점포 이용자 Tracking System 개발
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [추적 24시] : 완전 자동결제를 위한 무인점포 이용자 Tracking System 개발
BOAZ Bigdata
 

What's hot (20)

(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BART
(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BART(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BART
(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BART
 
Modern face recognition with deep learning
Modern face recognition with deep learningModern face recognition with deep learning
Modern face recognition with deep learning
 
Image restoration and enhancement #2
Image restoration and enhancement #2 Image restoration and enhancement #2
Image restoration and enhancement #2
 
Computer vision
Computer visionComputer vision
Computer vision
 
Introduction to multiple object tracking
Introduction to multiple object trackingIntroduction to multiple object tracking
Introduction to multiple object tracking
 
A completed modeling of local binary pattern operator
A completed modeling of local binary pattern operatorA completed modeling of local binary pattern operator
A completed modeling of local binary pattern operator
 
Digital image processing
Digital image processingDigital image processing
Digital image processing
 
Augmented reality(my ppt)
Augmented reality(my ppt)Augmented reality(my ppt)
Augmented reality(my ppt)
 
Attendance Using Facial Recognition
Attendance Using Facial RecognitionAttendance Using Facial Recognition
Attendance Using Facial Recognition
 
Réalité augmentée : technologies et marché
Réalité augmentée : technologies et marchéRéalité augmentée : technologies et marché
Réalité augmentée : technologies et marché
 
AUGMENTED REALITY Documentation
AUGMENTED REALITY DocumentationAUGMENTED REALITY Documentation
AUGMENTED REALITY Documentation
 
2022 COMP4010 Lecture3: AR Technology
2022 COMP4010 Lecture3: AR Technology2022 COMP4010 Lecture3: AR Technology
2022 COMP4010 Lecture3: AR Technology
 
COMP 4010 - Lecture4 VR Technology - Visual and Haptic Displays
COMP 4010 - Lecture4 VR Technology - Visual and Haptic DisplaysCOMP 4010 - Lecture4 VR Technology - Visual and Haptic Displays
COMP 4010 - Lecture4 VR Technology - Visual and Haptic Displays
 
Attendance system based on face recognition using python by Raihan Sikdar
Attendance system based on face recognition using python by Raihan SikdarAttendance system based on face recognition using python by Raihan Sikdar
Attendance system based on face recognition using python by Raihan Sikdar
 
Lec13: Clustering Based Medical Image Segmentation Methods
Lec13: Clustering Based Medical Image Segmentation MethodsLec13: Clustering Based Medical Image Segmentation Methods
Lec13: Clustering Based Medical Image Segmentation Methods
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
 
Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)
 
OpenCV
OpenCVOpenCV
OpenCV
 
Marching Cubes
Marching CubesMarching Cubes
Marching Cubes
 
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [추적 24시] : 완전 자동결제를 위한 무인점포 이용자 Tracking System 개발
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스  - [추적 24시] : 완전 자동결제를 위한 무인점포 이용자 Tracking System 개발제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스  - [추적 24시] : 완전 자동결제를 위한 무인점포 이용자 Tracking System 개발
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [추적 24시] : 완전 자동결제를 위한 무인점포 이용자 Tracking System 개발
 

Similar to Action Genome: Action As Composition of Spatio Temporal Scene Graphs

Silhouette analysis based action recognition via exploiting human poses
Silhouette analysis based action recognition via exploiting human posesSilhouette analysis based action recognition via exploiting human poses
Silhouette analysis based action recognition via exploiting human poses
AVVENIRE TECHNOLOGIES
 
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
sipij
 
Deep Learning for Structure-from-Motion (SfM)
Deep Learning for Structure-from-Motion (SfM)Deep Learning for Structure-from-Motion (SfM)
Deep Learning for Structure-from-Motion (SfM)
PetteriTeikariPhD
 
Announcing the Final Examination of Mr. Paul Smith for the ...
Announcing the Final Examination of Mr. Paul Smith for the ...Announcing the Final Examination of Mr. Paul Smith for the ...
Announcing the Final Examination of Mr. Paul Smith for the ...
butest
 
Emerging 3D Scanning Technologies for PropTech
Emerging 3D Scanning Technologies for PropTechEmerging 3D Scanning Technologies for PropTech
Emerging 3D Scanning Technologies for PropTech
PetteriTeikariPhD
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverview
Motaz El-Saban
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
ijsrd.com
 
Afmkl
AfmklAfmkl
Sparse representation based human action recognition using an action region-a...
Sparse representation based human action recognition using an action region-a...Sparse representation based human action recognition using an action region-a...
Sparse representation based human action recognition using an action region-a...
Wesley De Neve
 
Action Recognition using Nonnegative Action
Action Recognition using Nonnegative ActionAction Recognition using Nonnegative Action
Action Recognition using Nonnegative Action
suthi
 
ppt - of a project will help you on your college projects
ppt - of a project will help you on your college projectsppt - of a project will help you on your college projects
ppt - of a project will help you on your college projects
vikaspandey0702
 
Automatic Foreground object detection using Visual and Motion Saliency
Automatic Foreground object detection using Visual and Motion SaliencyAutomatic Foreground object detection using Visual and Motion Saliency
Automatic Foreground object detection using Visual and Motion Saliency
IJERD Editor
 
Image Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionImage Restoration for 3D Computer Vision
Image Restoration for 3D Computer Vision
PetteriTeikariPhD
 
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformHuman Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Fadwa Fouad
 
Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptx
computerscience98
 
E03404025032
E03404025032E03404025032
E03404025032
theijes
 
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
Tulipp. Eu
 
Introduction talk to Computer Vision
Introduction talk to Computer Vision Introduction talk to Computer Vision
Introduction talk to Computer Vision
Chen Sagiv
 
Depth preserving warping for stereo image retargeting
Depth preserving warping for stereo image retargetingDepth preserving warping for stereo image retargeting
Depth preserving warping for stereo image retargeting
LogicMindtech Nologies
 
J017377578
J017377578J017377578
J017377578
IOSR Journals
 

Similar to Action Genome: Action As Composition of Spatio Temporal Scene Graphs (20)

Silhouette analysis based action recognition via exploiting human poses
Silhouette analysis based action recognition via exploiting human posesSilhouette analysis based action recognition via exploiting human poses
Silhouette analysis based action recognition via exploiting human poses
 
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
 
Deep Learning for Structure-from-Motion (SfM)
Deep Learning for Structure-from-Motion (SfM)Deep Learning for Structure-from-Motion (SfM)
Deep Learning for Structure-from-Motion (SfM)
 
Announcing the Final Examination of Mr. Paul Smith for the ...
Announcing the Final Examination of Mr. Paul Smith for the ...Announcing the Final Examination of Mr. Paul Smith for the ...
Announcing the Final Examination of Mr. Paul Smith for the ...
 
Emerging 3D Scanning Technologies for PropTech
Emerging 3D Scanning Technologies for PropTechEmerging 3D Scanning Technologies for PropTech
Emerging 3D Scanning Technologies for PropTech
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverview
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
 
Afmkl
AfmklAfmkl
Afmkl
 
Sparse representation based human action recognition using an action region-a...
Sparse representation based human action recognition using an action region-a...Sparse representation based human action recognition using an action region-a...
Sparse representation based human action recognition using an action region-a...
 
Action Recognition using Nonnegative Action
Action Recognition using Nonnegative ActionAction Recognition using Nonnegative Action
Action Recognition using Nonnegative Action
 
ppt - of a project will help you on your college projects
ppt - of a project will help you on your college projectsppt - of a project will help you on your college projects
ppt - of a project will help you on your college projects
 
Automatic Foreground object detection using Visual and Motion Saliency
Automatic Foreground object detection using Visual and Motion SaliencyAutomatic Foreground object detection using Visual and Motion Saliency
Automatic Foreground object detection using Visual and Motion Saliency
 
Image Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionImage Restoration for 3D Computer Vision
Image Restoration for 3D Computer Vision
 
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformHuman Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
 
Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptx
 
E03404025032
E03404025032E03404025032
E03404025032
 
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
 
Introduction talk to Computer Vision
Introduction talk to Computer Vision Introduction talk to Computer Vision
Introduction talk to Computer Vision
 
Depth preserving warping for stereo image retargeting
Depth preserving warping for stereo image retargetingDepth preserving warping for stereo image retargeting
Depth preserving warping for stereo image retargeting
 
J017377578
J017377578J017377578
J017377578
 

More from Sangmin Woo

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptx
Sangmin Woo
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
Sangmin Woo
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
Sangmin Woo
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
Sangmin Woo
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptx
Sangmin Woo
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
Sangmin Woo
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
Sangmin Woo
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
Sangmin Woo
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Sangmin Woo
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient Transformers
Sangmin Woo
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
Sangmin Woo
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
Sangmin Woo
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
Sangmin Woo
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
Sangmin Woo
 

More from Sangmin Woo (14)

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptx
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptx
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient Transformers
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 

Recently uploaded

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 

Recently uploaded (20)

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 

Action Genome: Action As Composition of Spatio Temporal Scene Graphs

  • 1. 2020 IRRLAB Presentation IRRLAB Action Genome: Actions as Compositions of Spatio- temporal Scene Graphs Sangmin Woo 2020.07.09 Jingwei Ji Ranjay Krishna Li Fei-Fei Juan Carlos Niebles Stanford University CVPR 2020
  • 2. 2 / 21 IRRLABContents  Why Action Genome?  Action Genome  Method  Experiments  Future Work
  • 3. 3 / 21 IRRLABWhy Action Genome?  Scene Graph • The scene graph representation provides a scaffold that allows vision models to tackle complex inference tasks by breaking scenes into its corresponding objects and their visual relationships. [1] Zellers, R., Yatskar, M., Thomson, S., & Choi, Y. (2018). Neural Motifs: Scene Graph Parsing with Global Context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5831-5840).
  • 4. 4 / 21 IRRLABWhy Action Genome?  Versatility of Scene Graph • A number of works have proposed deep network-based approaches for generating the scene graphs, confirming its importance to the field • e.g., Image retrieval, Visual Question Answering, Image Captioning, Image Generation, Video Understanding [2] Ashual, O., & Wolf, L. (2019). Specifying Object Attributes and Relations in Interactive Scene Generation. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4561-4569).
  • 5. 5 / 21 IRRLABWhy Action Genome?  Scene Graph in Video?! • However, decompositions for temporal events has not been explored much, even though representing events with structured representations could lead to more accurate and grounded action understanding. • Meanwhile, people segment events into consistent groups and actively encode those ongoing activities in a hierarchical part structure.
  • 6. 6 / 21 IRRLABWhy Action Genome?  Benefits? • Such decompositions can enable machines to predict future and past scene graphs with objects and relationships as an action occurs • we can predict that the person is about to sit on the sofa when we see them move in front of it. • Similarly, such decomposition can also enable machines to learn from few examples • we can recognize the same action when we see a different person move towards a different chair.
  • 7. 7 / 21 IRRLABWhy Action Genome?  Benefits? • Such decompositions can enable machines to predict future and past scene graphs with objects and relationships as an action occurs • we can predict that the person is about to sit on the sofa when we see them move in front of it. • Similarly, such decomposition can also enable machines to learn from few examples • we can recognize the same action when we see a different person move towards a different chair. • Okay… such decompositions can provide the scaffolds to improve vision models... Then, how is it possible to correctly create representative hierarchies for a wide variety of complex actions?
  • 8. 8 / 21 IRRLABWhy Action Genome?  Benefits? • Such decompositions can enable machines to predict future and past scene graphs with objects and relationships as an action occurs • we can predict that the person is about to sit on the sofa when we see them move in front of it. • Similarly, such decomposition can also enable machines to learn from few examples • we can recognize the same action when we see a different person move towards a different chair. • Okay… such decompositions can provide the scaffolds to improve vision models... Then, how is it possible to correctly create representative hierarchies for a wide variety of complex actions? • Action Genome!
  • 9. 9 / 21 IRRLABAction Genome  Action Genome • A representation that decomposes actions into spatio-temporal scene graphs. • Object detection faced a similar challenge of large variation within any object category. So, just as progress in 2D perception was catalyzed by taxonomies, partonomies, and ontologies. • Action Genome aims to improve temporal understanding by partonomy.  Data Statistics • Built upon Charades, Action Genome provides • 476K object bounding boxes of 35 object classes (excluding “person”) • 1.72M relationships of 25 relationship classes • 234K video frames • 157 action categories
  • 10. 10/ 21 IRRLABAction Genome  Charades vs. Action Genome • Charades provides an injective mapping from each action to a verb • Charades’ verbs are clip-level labels, such as “𝑎𝑤𝑎𝑘𝑒𝑛” • Action Genome decompose them into frame-level human-object relationships, such as a sequence of < 𝑝𝑒𝑟𝑠𝑜𝑛 – 𝑙𝑦𝑖𝑛𝑔 𝑜𝑛 – 𝑏𝑒𝑑 >, < 𝑝𝑒𝑟𝑠𝑜𝑛 – 𝑠𝑖𝑡𝑡𝑖𝑛𝑔 𝑜𝑛 – 𝑏𝑒𝑑 > and < 𝑝𝑒𝑟𝑠𝑜𝑛 – 𝑛𝑜𝑡 𝑐𝑜𝑛𝑡𝑎𝑐𝑡𝑖𝑛𝑔 – 𝑏𝑒𝑑 >.
  • 11. 11 / 21 IRRLABAction Genome  Action Genome • Most perspectives on action decomposition converge on the prototypical unit of action-object couplets. • Action-object couplets refer to transitive actions performed on objects (e.g. “moving a chair” or “throwing a ball”) and intransitive self-actions (e.g. “moving towards the sofa”). • Action Genome’s dynamic scene graph representations capture both such types of events and as such, represent the prototypical unit. • This representation enables the study for tasks such as: • Spatio-temporal scene graph prediction • Improving action recognition • Few-shot action detection
  • 12. 12/ 21 IRRLABAction Genome  Three types of relationships in Action Genome • Attention: which objects people are looking at? • Spatial: how objects are laid out spatially? • Contact: semantic relationships involving people manipulating objects.
  • 13. 13/ 21 IRRLABAction Genome  Action Genome’s annotation pipeline • For every action, 5 frames are uniformly sampled across the action.
  • 14. 14/ 21 IRRLABMethod (for Action Recognition)  Action Genome for Action Recognition • Scene Graph Feature Banks (SGFB): It accumulates features of a long video as a fixed size representation for action recognition. 1. SGFB generates a spatio-temporal scene graph for every frame. 2. Scene graphs are then encoded to formulate a feature banks. 3. Feature banks are combined with 3D CNN features using feature bank operators (FBO), which can be instantiated as mean/max pooling or non-local blocks. 4. The final aggregated feature is used to predict action labels for the video. [3] Wu, C. Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., & Girshick, R. (2019). Long-Term Feature Banks for Detailed Video Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 284-293).
  • 15. 15/ 21 IRRLABMethod (for Action Recognition)  Action Genome for Action Recognition
  • 16. 16/ 21 IRRLABExperiments  Action Recognition on Charades • The Charades dataset contains 9, 848 crowdsourced videos with a length of 30 seconds on average. • SGFB outperformed all existing methods when simultaneously predict scene graphs while performing action recognition (44.3% mAP). • Utilizing ground truth scene graphs (Oracle) can significantly boost performance (60.3% mAP).
  • 17. 17/ 21 IRRLABExperiments  Few-shot action recognition • Intuitively, predicting actions should be easier (and well generalize to rare actions) from a symbolic embedding of scene graphs than from pixels. 1) 157 action classes are split into a base set of 137 classes and a novel set of 20 classes. 2) A backbone feature extractor (R101-I3D-NL) is first trained on all video examples of the base classes, which is shared by the baseline LFB, SGFB, and SGFB oracle. 3) Next, each model are trained with only k examples from each novel class, where k = 1, 5, 10, for 50 epochs. 4) Finally, models are evaluated on all examples of novel classes in the Charades validation set.
  • 18. 18/ 21 IRRLABExperiments  Spatio-temporal scene graph prediction • Unlike image-based scene graph prediction, which only has a single image as input, • This task expects a video as input and therefore, can utilize temporal information from neighboring frames to strength its predictions. • There is significant room for improvement, especially since these existing methods were designed to be conditioned on a single frame and do not consider the entire video sequence as a whole.
  • 19. 19/ 21 IRRLABExperiments  Summary 1. Predicting scene graphs can benefit action recognition • Improving the state-of-the-art on the Charades dataset from 42.5% to 44.3% and to 60.3% when using oracle scene graphs. 2. Compositional understanding of actions induces better generalization. • In few-shot action recognition, it achieves 42.7% mAP using as few as 10 training examples. 3. In new task of spatio-temporal scene graph prediction, there is significant room for improvement • Existing methods were designed to be conditioned on a single frame and do not consider the entire video sequence as a whole.
  • 20. 20/ 21 IRRLABFuture Work  Future Research Directions • Spatio-temporal action localization • The majority of action localization methods focus on localizing the person performing the action but ignore the objects. • Action Genome can enable research on localization of both actors and objects. • Explainable action models • Action Genome provides frame-level labels of attention in the form of objects that a person performing the action is either looking at or interacting with. • These labels can be used to further train explainable models. • Video generation from spatio-temporal scene graphs
  • 21. Thank You 2020 IRRLAB Presentation IRRLAB shmwoo9395@{gist.ac.kr, gmail.com}

Editor's Notes

  1. Let’s consider the action of “sitting on a sofa”. The person initially starts off next to the sofa, moves in front of it, and finally sits atop it.
  2. There are three types of relationships in Action Genome: attention relationships report which objects people are looking at, spatial relationships indicate how objects are laid out spatially, and contact relationships are semantic relationships involving people manipulating objects.
  3. There are three types of relationships in Action Genome: attention relationships report which objects people are looking at, spatial relationships indicate how objects are laid out spatially, and contact relationships are semantic relationships involving people manipulating objects.
  4. Action Genome’s annotation pipeline: For every action, we uniformly sample 5 frames across the action and annotate the person performing the action along with the objects they interact with. We also annotate the pairwise relationships between the person and those objects. Here, we show a video with 4 actions labelled, resulting in 20 (= 4 5) frames annotated with scene graphs. The objects are grounded back in the video as bounding boxes.
  5. There are three types of relationships in Action Genome: attention relationships report which objects people are looking at, spatial relationships indicate how objects are laid out spatially, and contact relationships are semantic relationships involving people manipulating objects.
  6. Such a boost in performance shows the potential of Action Genome and compositional action understanding when video-based scene graph models are utilized to improve scene graph prediction.
  7. When trained with very few examples, compositional action understanding with additional knowledge of scene graphs should outperform methods that treat actions as monolithic concept.
  8. The small gap in performance between the task of PredCls and SGCls suggests that these models suffer from not being able to accurately detect the objects in the video frames. Improving object detectors designed specifically for videos could improve performance.
  9. Amongst numerous techniques, saliency prediction has emerged as a key mechanism to interpret machine learning models