SlideShare a Scribd company logo
1 of 22
2020 IRRLAB Presentation
IRRLAB
Neural Motifs:
Scene Graph Parsing with Global
Context
Sangmin Woo
2020.04.29
Rowan Zellers1 Mark Yatskar1,2 Sam Thomson3 Yejin Choi1,2
1Paul G. Allen School of Computer Science & Engineering, University of Washington
2Allen Institute for Artificial Intelligence
3School of Computer Science, Carnegie Mellon University
2 / 22
IRRLABContents
 Scene Graph Generation
 Scene Graph Analysis
 Model: Neural Motifs
 Experimental Results
• Quantitative Results
• Qualitative Results
 References
3 / 22
IRRLABScene Graph Generation
 Scene Graph Generation(SGG) (a.k.a. Scene Graph Parsing)
• The task of producing graph representations of real-world images that
provide semantic summaries of objects and their relationships.
4 / 22
IRRLABScene Graph Analysis
 Object and relation types in Visual Genome Dataset
5 / 22
IRRLABScene Graph Analysis
 Types of edges between high-level categories in Visual Genome
6 / 22
IRRLABScene Graph Analysis
 How much information is gained by knowing the identity of
different parts in a scene graphs?
• In general, the identity of edges involved in a relationship is not highly
informative of other elements of the structure while the identities of head
or tail provide significant information, both to each other and to edge labels.
7 / 22
IRRLABScene Graph Analysis
 Larger Motifs
• Higher order structure
8 / 22
IRRLABModel
 Stacked Motif Network(MOTIFNET)
• Region proposals 𝑩
 𝐵 = 𝑏1, … , 𝑏 𝑛 , 𝑏𝑖 ∈ ℝ4
• Object label 𝑶
• Relation label 𝑹
Object
Detection
Object
Label
𝑷 𝑮 𝑰 = 𝑷 𝑩 𝑰 𝑷 𝑶 𝑩, 𝑰 𝑷(𝑹|𝑩, 𝑶, 𝑰)
Relation
Label
9 / 22
IRRLABModel
 Stacked Motif Network(MOTIFNET)
10/ 22
IRRLABModel
 Bounding Boxes
• Utilized Faster R-CNN
• For each image 𝐼, the detector predicts a set of region proposals 𝐵 =
{𝑏1, … , 𝑏 𝑛}.
• For each proposal 𝑏𝑖 ∈ 𝐵, it also outputs a feature vector 𝑓𝑖, and a vector
𝑙𝑖 ∈ 𝑅|𝐶|
of (non-contextualized) object label probabilities.
11 / 22
IRRLABModel
 Objects
• Context Encoding
• Construct a contextualized representation of object prediction based
on the set of proposal regions 𝐵
• Element of 𝐵 are first organized into a linear sequence,
[(𝑏1, 𝑓1, 𝑙1), … , (𝑏 𝑛, 𝑓𝑛, 𝑙 𝑛)].
• The object context, 𝐶, is then computed using a bidirectional LSTM
• 𝐶 = [𝑐1, … , 𝑐 𝑛] contains the final LSTM layer’s hidden states for each
element in the linearization of 𝐵
• 𝑊1 is a parameter matrix that maps the distribution of predicted classes, 𝑙1
to ℝ100
.
• The biLSTM allows all elements of 𝐵 to contribute information about
potential object identities.
12/ 22
IRRLABModel
 Objects
• Decoding
• The context 𝐶 is used to sequentially decode labels for each proposal
bounding region, conditioning on previously decoded labels.
• LSTM is used to decode a category label for each contextualized
representation in 𝐶
• Hidden sates ℎ𝑖 are discarded and object class commitments 𝑜𝑖 are
used in the relation model.
13/ 22
IRRLABModel
 Relations
• Context Encoding
• Construct a contextualized representation of bounding regions 𝐵 and
objects 𝑂 using additional bi-directional LSTM layers
• Where the edge context 𝐷 = [𝑑1, … , 𝑑 𝑛] contains the states for each
bounding region at the final layer, and 𝑊2 is a parameter matrix
mapping 𝑜𝑖 into ℝ100
14/ 22
IRRLABModel
 Relations
• Decoding
• There are quadratic number of possible relations in a scene graph
• For each possible edge, say between 𝑏𝑖 and 𝑏𝑗, the probability the
edge will have label 𝑥𝑖→𝑗 is computed
• The distribution uses global context 𝐷 and a feature vector for the
union of boxes 𝑓𝑖,𝑗
• ° denotes outer product
• 𝑊ℎ and 𝑊𝑡 project the head an tail context into ℝ4096
• 𝑤 𝑜 𝑖, 𝑜 𝑗
is a bias vector specific to the head and tail labels
15/ 22
IRRLABModel
 Frequency Baselines
• To support the finding that object labels are highly predictive of edge labels,
two frequency baselines built off training set statistics are additionally
introduced
1) FREQ, uses pre-trained detector to predict object labels for each RoI
• To obtain predicate probabilities between boxes 𝑖 and 𝑗, empirical
distribution over relationships between objects 𝑜𝑖 and 𝑜𝑗 is looked up
• Intuitively, while this baseline does not look at the image to compute
𝑃(𝑥𝑖→𝑗|𝑜𝑖, 𝑜𝑗), it displays the value of conditioning on object label
predictions 𝑜
2) FREQ-OVERLAP, requires that the two boxes intersect in order for the pair
to count as a valid relation
16/ 22
IRRLABModel
 Experimental Setup
• Alternating Highway LSTMs
• To mitigate vanishing gradient problems as information flows upward,
highway connections to all LSTMs are added
• To additionally reduce the number of parameters, LSTM directions are
alternated
• Each alternating highway LSTM step can be written as follows:
• Where 𝑥𝑖 is the input, ℎ𝑖 represents the hidden state, and 𝑠 is the
direction: 𝑠 = 1 if the current layer is even, and -1 otherwise
• For MOTIFNET, 2 alternating highway LSTM layers are used for
object context, and 4 for edge context
17/ 22
IRRLABModel
 Stacked Motif Network(MOTIFNET)
18/ 22
IRRLABModel
 Experimental Setup
• RoI ordering for LSTMs
• LEFTRIGHT(default): Default option is to sort the regions left-to-right
by the central x-coordinate: it is expected that this encourages the
model to predict edges between nearby objects, which is beneficial as
objects appearing in relationships tend to be close together
• CONFIDENCE: Another option is to order bounding regions based on
the confidence of the maximum non-background prediction from the
detector: as this lets the detector commit to “easy” regions, obtaining
context for more difficult regions
• SIZE: Here, bounding boxes are sorted in descending order by size,
possibly predicting global scene information first
• RANDOM: regions are randomly ordered.
19/ 22
IRRLABModel
 Experimental Setup
• Predicate Visual Features
• To extract visual features for a predicate between boxes 𝑏𝑖, 𝑏𝑗,
detector’s features corresponding to the union box of 𝑏𝑖 and 𝑏𝑗 are
resized to 7x7x256
• Geometric relations are modeled by using 14x14x2 binary input with
one channel per box
• Two convolutional layers are applied to this and add the resulting
7x7x256 representation to the detector features
• Last, fine-tuned VGG fully connected layers are applied to obtain
4096 dimensional representation.
20/ 22
IRRLABResults
 Quantitative Results & Ablation Studies
21/ 22
IRRLABResults
 Qualitative Results
Thank You
2020 IRRLAB Presentation
IRRLAB
shmwoo9395@{gist.ac.kr, gmail.com}

More Related Content

What's hot

Cassandra - Tips And Techniques
Cassandra - Tips And TechniquesCassandra - Tips And Techniques
Cassandra - Tips And TechniquesKnoldus Inc.
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteElectronic Arts / DICE
 
Unit 1. Tổng quan Scratch
Unit 1. Tổng quan ScratchUnit 1. Tổng quan Scratch
Unit 1. Tổng quan ScratchBùi Việt Hà
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...Johan Andersson
 
ใบงานที่ 2.2
ใบงานที่ 2.2ใบงานที่ 2.2
ใบงานที่ 2.2Wachi Kook
 
Hybrid Mobile Development with Apache Cordova and Java EE 7 (JavaOne 2014)
Hybrid Mobile Development with Apache Cordova and Java EE 7 (JavaOne 2014)Hybrid Mobile Development with Apache Cordova and Java EE 7 (JavaOne 2014)
Hybrid Mobile Development with Apache Cordova and Java EE 7 (JavaOne 2014)Ryan Cuprak
 
Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...VMware Tanzu
 
Nested and Parent/Child Docs in ElasticSearch
Nested and Parent/Child Docs in ElasticSearchNested and Parent/Child Docs in ElasticSearch
Nested and Parent/Child Docs in ElasticSearchBeyondTrees
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevAltinity Ltd
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache CalciteDataWorks Summit
 
Future Directions for Compute-for-Graphics
Future Directions for Compute-for-GraphicsFuture Directions for Compute-for-Graphics
Future Directions for Compute-for-GraphicsElectronic Arts / DICE
 
Row Pattern Matching in SQL:2016
Row Pattern Matching in SQL:2016Row Pattern Matching in SQL:2016
Row Pattern Matching in SQL:2016Markus Winand
 
Game Programming 03 - Git Flow
Game Programming 03 - Git FlowGame Programming 03 - Git Flow
Game Programming 03 - Git FlowNick Pruehs
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
Photogrammetry and Star Wars Battlefront
Photogrammetry and Star Wars BattlefrontPhotogrammetry and Star Wars Battlefront
Photogrammetry and Star Wars BattlefrontElectronic Arts / DICE
 
Cs lab04 win-form assignment
Cs lab04   win-form assignmentCs lab04   win-form assignment
Cs lab04 win-form assignmentHoangbach Nguyen
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Johan Andersson
 
Singer, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureSinger, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureDiscover Pinterest
 

What's hot (20)

Cassandra - Tips And Techniques
Cassandra - Tips And TechniquesCassandra - Tips And Techniques
Cassandra - Tips And Techniques
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in Frostbite
 
Unit 1. Tổng quan Scratch
Unit 1. Tổng quan ScratchUnit 1. Tổng quan Scratch
Unit 1. Tổng quan Scratch
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
 
ใบงานที่ 2.2
ใบงานที่ 2.2ใบงานที่ 2.2
ใบงานที่ 2.2
 
Hybrid Mobile Development with Apache Cordova and Java EE 7 (JavaOne 2014)
Hybrid Mobile Development with Apache Cordova and Java EE 7 (JavaOne 2014)Hybrid Mobile Development with Apache Cordova and Java EE 7 (JavaOne 2014)
Hybrid Mobile Development with Apache Cordova and Java EE 7 (JavaOne 2014)
 
Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...
 
Nested and Parent/Child Docs in ElasticSearch
Nested and Parent/Child Docs in ElasticSearchNested and Parent/Child Docs in ElasticSearch
Nested and Parent/Child Docs in ElasticSearch
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
การวิเคราะห์ระบบ 1
การวิเคราะห์ระบบ 1การวิเคราะห์ระบบ 1
การวิเคราะห์ระบบ 1
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
Future Directions for Compute-for-Graphics
Future Directions for Compute-for-GraphicsFuture Directions for Compute-for-Graphics
Future Directions for Compute-for-Graphics
 
Row Pattern Matching in SQL:2016
Row Pattern Matching in SQL:2016Row Pattern Matching in SQL:2016
Row Pattern Matching in SQL:2016
 
Game Programming 03 - Git Flow
Game Programming 03 - Git FlowGame Programming 03 - Git Flow
Game Programming 03 - Git Flow
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Photogrammetry and Star Wars Battlefront
Photogrammetry and Star Wars BattlefrontPhotogrammetry and Star Wars Battlefront
Photogrammetry and Star Wars Battlefront
 
Cs lab04 win-form assignment
Cs lab04   win-form assignmentCs lab04   win-form assignment
Cs lab04 win-form assignment
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
 
Singer, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureSinger, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging Infrastructure
 

Similar to Neural motifs scene graph parsing with global context

Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsSangmin Woo
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationSangmin Woo
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)Jinwon Lee
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
 
3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous drivingYu Huang
 
"Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En...
"Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En..."Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En...
"Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En...ssuser2624f71
 
201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture SearchDaeJin Kim
 
Parallelizing Pruning-based Graph Structural Clustering
Parallelizing Pruning-based Graph Structural ClusteringParallelizing Pruning-based Graph Structural Clustering
Parallelizing Pruning-based Graph Structural Clustering煜林 车
 
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...Edge AI and Vision Alliance
 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptationtaeseon ryu
 
NS-CUK Seminar: S.T.Nguyen, Review on "Geom-GCN: Geometric Graph Convolutiona...
NS-CUK Seminar: S.T.Nguyen, Review on "Geom-GCN: Geometric Graph Convolutiona...NS-CUK Seminar: S.T.Nguyen, Review on "Geom-GCN: Geometric Graph Convolutiona...
NS-CUK Seminar: S.T.Nguyen, Review on "Geom-GCN: Geometric Graph Convolutiona...ssuser4b1f48
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...ssuser2624f71
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyNUPUR YADAV
 
NS-CUK Joint Journal Club : S.T.Nguyen, Review on "Graph Neural Networks for ...
NS-CUK Joint Journal Club : S.T.Nguyen, Review on "Graph Neural Networks for ...NS-CUK Joint Journal Club : S.T.Nguyen, Review on "Graph Neural Networks for ...
NS-CUK Joint Journal Club : S.T.Nguyen, Review on "Graph Neural Networks for ...ssuser4b1f48
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingYu Huang
 
LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)Yu Huang
 
Wits presentation 6_28072015
Wits presentation 6_28072015Wits presentation 6_28072015
Wits presentation 6_28072015Beatrice van Eden
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learningYu Huang
 
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...thanhdowork
 
NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...
NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...
NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...ssuser4b1f48
 

Similar to Neural motifs scene graph parsing with global context (20)

Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving
 
"Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En...
"Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En..."Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En...
"Sparse Graph Attention Networks", IEEE Transactions on Knowledge and Data En...
 
201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search
 
Parallelizing Pruning-based Graph Structural Clustering
Parallelizing Pruning-based Graph Structural ClusteringParallelizing Pruning-based Graph Structural Clustering
Parallelizing Pruning-based Graph Structural Clustering
 
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
 
NS-CUK Seminar: S.T.Nguyen, Review on "Geom-GCN: Geometric Graph Convolutiona...
NS-CUK Seminar: S.T.Nguyen, Review on "Geom-GCN: Geometric Graph Convolutiona...NS-CUK Seminar: S.T.Nguyen, Review on "Geom-GCN: Geometric Graph Convolutiona...
NS-CUK Seminar: S.T.Nguyen, Review on "Geom-GCN: Geometric Graph Convolutiona...
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
NS-CUK Joint Journal Club : S.T.Nguyen, Review on "Graph Neural Networks for ...
NS-CUK Joint Journal Club : S.T.Nguyen, Review on "Graph Neural Networks for ...NS-CUK Joint Journal Club : S.T.Nguyen, Review on "Graph Neural Networks for ...
NS-CUK Joint Journal Club : S.T.Nguyen, Review on "Graph Neural Networks for ...
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object tracking
 
LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)
 
Wits presentation 6_28072015
Wits presentation 6_28072015Wits presentation 6_28072015
Wits presentation 6_28072015
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learning
 
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
240415_Thuy_Labseminar[Simple and Asymmetric Graph Contrastive Learning witho...
 
NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...
NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...
NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...
 

More from Sangmin Woo

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxSangmin Woo
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptxSangmin Woo
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxSangmin Woo
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxSangmin Woo
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxSangmin Woo
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptxSangmin Woo
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptxSangmin Woo
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSangmin Woo
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Sangmin Woo
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient TransformersSangmin Woo
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in VisionSangmin Woo
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsSangmin Woo
 

More from Sangmin Woo (12)

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptx
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptx
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient Transformers
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
 

Recently uploaded

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Neural motifs scene graph parsing with global context

  • 1. 2020 IRRLAB Presentation IRRLAB Neural Motifs: Scene Graph Parsing with Global Context Sangmin Woo 2020.04.29 Rowan Zellers1 Mark Yatskar1,2 Sam Thomson3 Yejin Choi1,2 1Paul G. Allen School of Computer Science & Engineering, University of Washington 2Allen Institute for Artificial Intelligence 3School of Computer Science, Carnegie Mellon University
  • 2. 2 / 22 IRRLABContents  Scene Graph Generation  Scene Graph Analysis  Model: Neural Motifs  Experimental Results • Quantitative Results • Qualitative Results  References
  • 3. 3 / 22 IRRLABScene Graph Generation  Scene Graph Generation(SGG) (a.k.a. Scene Graph Parsing) • The task of producing graph representations of real-world images that provide semantic summaries of objects and their relationships.
  • 4. 4 / 22 IRRLABScene Graph Analysis  Object and relation types in Visual Genome Dataset
  • 5. 5 / 22 IRRLABScene Graph Analysis  Types of edges between high-level categories in Visual Genome
  • 6. 6 / 22 IRRLABScene Graph Analysis  How much information is gained by knowing the identity of different parts in a scene graphs? • In general, the identity of edges involved in a relationship is not highly informative of other elements of the structure while the identities of head or tail provide significant information, both to each other and to edge labels.
  • 7. 7 / 22 IRRLABScene Graph Analysis  Larger Motifs • Higher order structure
  • 8. 8 / 22 IRRLABModel  Stacked Motif Network(MOTIFNET) • Region proposals 𝑩  𝐵 = 𝑏1, … , 𝑏 𝑛 , 𝑏𝑖 ∈ ℝ4 • Object label 𝑶 • Relation label 𝑹 Object Detection Object Label 𝑷 𝑮 𝑰 = 𝑷 𝑩 𝑰 𝑷 𝑶 𝑩, 𝑰 𝑷(𝑹|𝑩, 𝑶, 𝑰) Relation Label
  • 9. 9 / 22 IRRLABModel  Stacked Motif Network(MOTIFNET)
  • 10. 10/ 22 IRRLABModel  Bounding Boxes • Utilized Faster R-CNN • For each image 𝐼, the detector predicts a set of region proposals 𝐵 = {𝑏1, … , 𝑏 𝑛}. • For each proposal 𝑏𝑖 ∈ 𝐵, it also outputs a feature vector 𝑓𝑖, and a vector 𝑙𝑖 ∈ 𝑅|𝐶| of (non-contextualized) object label probabilities.
  • 11. 11 / 22 IRRLABModel  Objects • Context Encoding • Construct a contextualized representation of object prediction based on the set of proposal regions 𝐵 • Element of 𝐵 are first organized into a linear sequence, [(𝑏1, 𝑓1, 𝑙1), … , (𝑏 𝑛, 𝑓𝑛, 𝑙 𝑛)]. • The object context, 𝐶, is then computed using a bidirectional LSTM • 𝐶 = [𝑐1, … , 𝑐 𝑛] contains the final LSTM layer’s hidden states for each element in the linearization of 𝐵 • 𝑊1 is a parameter matrix that maps the distribution of predicted classes, 𝑙1 to ℝ100 . • The biLSTM allows all elements of 𝐵 to contribute information about potential object identities.
  • 12. 12/ 22 IRRLABModel  Objects • Decoding • The context 𝐶 is used to sequentially decode labels for each proposal bounding region, conditioning on previously decoded labels. • LSTM is used to decode a category label for each contextualized representation in 𝐶 • Hidden sates ℎ𝑖 are discarded and object class commitments 𝑜𝑖 are used in the relation model.
  • 13. 13/ 22 IRRLABModel  Relations • Context Encoding • Construct a contextualized representation of bounding regions 𝐵 and objects 𝑂 using additional bi-directional LSTM layers • Where the edge context 𝐷 = [𝑑1, … , 𝑑 𝑛] contains the states for each bounding region at the final layer, and 𝑊2 is a parameter matrix mapping 𝑜𝑖 into ℝ100
  • 14. 14/ 22 IRRLABModel  Relations • Decoding • There are quadratic number of possible relations in a scene graph • For each possible edge, say between 𝑏𝑖 and 𝑏𝑗, the probability the edge will have label 𝑥𝑖→𝑗 is computed • The distribution uses global context 𝐷 and a feature vector for the union of boxes 𝑓𝑖,𝑗 • ° denotes outer product • 𝑊ℎ and 𝑊𝑡 project the head an tail context into ℝ4096 • 𝑤 𝑜 𝑖, 𝑜 𝑗 is a bias vector specific to the head and tail labels
  • 15. 15/ 22 IRRLABModel  Frequency Baselines • To support the finding that object labels are highly predictive of edge labels, two frequency baselines built off training set statistics are additionally introduced 1) FREQ, uses pre-trained detector to predict object labels for each RoI • To obtain predicate probabilities between boxes 𝑖 and 𝑗, empirical distribution over relationships between objects 𝑜𝑖 and 𝑜𝑗 is looked up • Intuitively, while this baseline does not look at the image to compute 𝑃(𝑥𝑖→𝑗|𝑜𝑖, 𝑜𝑗), it displays the value of conditioning on object label predictions 𝑜 2) FREQ-OVERLAP, requires that the two boxes intersect in order for the pair to count as a valid relation
  • 16. 16/ 22 IRRLABModel  Experimental Setup • Alternating Highway LSTMs • To mitigate vanishing gradient problems as information flows upward, highway connections to all LSTMs are added • To additionally reduce the number of parameters, LSTM directions are alternated • Each alternating highway LSTM step can be written as follows: • Where 𝑥𝑖 is the input, ℎ𝑖 represents the hidden state, and 𝑠 is the direction: 𝑠 = 1 if the current layer is even, and -1 otherwise • For MOTIFNET, 2 alternating highway LSTM layers are used for object context, and 4 for edge context
  • 17. 17/ 22 IRRLABModel  Stacked Motif Network(MOTIFNET)
  • 18. 18/ 22 IRRLABModel  Experimental Setup • RoI ordering for LSTMs • LEFTRIGHT(default): Default option is to sort the regions left-to-right by the central x-coordinate: it is expected that this encourages the model to predict edges between nearby objects, which is beneficial as objects appearing in relationships tend to be close together • CONFIDENCE: Another option is to order bounding regions based on the confidence of the maximum non-background prediction from the detector: as this lets the detector commit to “easy” regions, obtaining context for more difficult regions • SIZE: Here, bounding boxes are sorted in descending order by size, possibly predicting global scene information first • RANDOM: regions are randomly ordered.
  • 19. 19/ 22 IRRLABModel  Experimental Setup • Predicate Visual Features • To extract visual features for a predicate between boxes 𝑏𝑖, 𝑏𝑗, detector’s features corresponding to the union box of 𝑏𝑖 and 𝑏𝑗 are resized to 7x7x256 • Geometric relations are modeled by using 14x14x2 binary input with one channel per box • Two convolutional layers are applied to this and add the resulting 7x7x256 representation to the detector features • Last, fine-tuned VGG fully connected layers are applied to obtain 4096 dimensional representation.
  • 20. 20/ 22 IRRLABResults  Quantitative Results & Ablation Studies
  • 22. Thank You 2020 IRRLAB Presentation IRRLAB shmwoo9395@{gist.ac.kr, gmail.com}

Editor's Notes

  1. 관련 연구, Scene Graph 분석, 모델, 실험 결과 순으로 발표
  2. Scene상의 Object를 node, 한 쌍의 object간의 관계를 edge로 구조화하여 표현하는 것을 Scene Graph라고 함 SGG task는 하나의 이미지가 주어지면 이러한 SG를 생성하는 task를 말함.
  3. Entity를 이루는것 중에 part type이 25%를 넘게 차지한다. Relation중에서는 Geometric과 Possessive가 90%를 넘게 차지한다.
  4. Visual genome dataset에서 entity category간의 edge가 어떻게 연결되는지에 대한 분포를 보여줌. Geometric, possessive, semantic edge가 각각 50.9, 40.9, 8.7%임. Semantic edge는 대부분 사람과 다른 사물 간에서 발생.
  5. Graph의 다른 요소들이 주어졌을 때, head, tail, edge label을 예측하는것에 대한 likelihoo를 그래프로 표현한것. X축은 top-k에서 k를, Y축은 accurac를 보여줌. 일반적으로 relationship에서 edge는 head나 tail에 비해서 informative하지 않다. Object pair에 의해 edge가 결정되는 추이를 보인다. Object pair만 주어져도 edge를 k=1의 경우에서 70%확률을 보인다.
  6. Graph의 다른 요소들이 주어졌을 때, head, tail, edge label을 예측하는것에 대한 likelihoo를 그래프로 표현한것. X축은 top-k에서 k를, Y축은 accurac를 보여줌. 일반적으로 relationship에서 edge는 head나 tail에 비해서 informative하지 않다. Object pair에 의해 edge가 결정되는 추이를 보인다. Object pair만 주어져도 edge를 k=1의 경우에서 70%확률을 보인다.
  7. 전체적인 과정을 수식화한 것. 이미지 I가 주어졌을 때, Scene Graph G를 만드는것이 목적. 이를 세 단계로 분할하면, 첫 번째는 object detection, object label 예측, relation label 예측으로 분리됨. Region proposal은 Bounding box의 B를 따서 B로 표시함. N은 한 이미지 내에서 여러 object가 있을 수 있기 때문에, n개의 object를 의미함. Bounding box는 두점의 좌표로 표시할 수 있기 때문에 4 dimension에 해당함. Object label은 O로 표시되고 마찬가지로, n개의 bounding box하나에 해당하는 label이 있을 수 있음. Relation label R로 표시되고, 이거는 n개의 bounding box 간의 연결 가능한 개수로 보기때문에 n(n-1)개가 나옴.
  8. Scene graph parsing을 세 단계로 나누어 해결하려함. 첫 번째는 bounding box를 찾는것으로 backbone vgg16와 faster r-cnn을 이용하여 object proposa을 찾아냄. 다음은 찾아낸 object region에 대해서 bi-LSTM을 이용하여 context를 encoding하고, LSTM을 이용하여 decoding하여 object label을 예측함. Object context와 예측된 label을 같이 input으로 사용해서 edge context를 encoding 함. 이렇게 Encode된 contex를 가지고 relationship label을 예측함.
  9. 첫번째 step은 Bounding box를 과정인데 여기서 faster r-cnn을 사용. 각각의 image I에 대해서, detector은 region proposal B를 먼저 예측함. 각각의 proposal b_i는 feature vector f_i와 object 레이블 확률분포 l_i를 가짐.
  10. Object label을 찾는 과정에서는 먼저 앞서 찾은 bounding box B에 대해서 Context를 encoding하는 과정을 거침. 먼저 B는 linear sequence로 재구성됨. 여기서 b는 bounding box, f는 visual feature, l은 object label probability distributio임. Object context C는 bi-LSTM을 통해 계산됨. C는 각 element에 대한 최종 LSTM 레이어의 hidden state임. W는 100차원의 weight matri임.
  11. 앞서 encode한 context를 decode하는 단계로 여기서도 마찬가지로 LSTM이 이용됨. Context c와 이전 state의 object class를 input으로 LSTM에 넣었을 때 h라는 hidden state가 나옴. W는 weight matri로, h와 곱해져서 argmax를 하면 가장 큰 object class probabilit가 나옴. 그러면 해당하는 object class가 다음 relation model의 inpute으로 전달됨.
  12. Relation에 대한 context를 encoding 하는 단계. bounding box B와 object O를 bi-LSTM을 이용하여 encoding 함. Edge context D는 최종 레이어에서의 각 bounding regio에 대한 state를 가짐. W_2는 100차원의 weight matrix임.
  13. Scene graph에서는 지수승의 relation 조합이 있을 수 있음. 각각의 가능한 edge 조합에 대해서 edge가 연결이 되어있을 확률을 계산함 D는 앞서 relation encoding에서 계산된 Global context, f_ij는 union box에 대한 feature vector를 의미. W_h와 W_t는 head와 tail을 각각 4096차원으로 projec하는 weight matrix. W_oi,oj는 bias vector
  14. Object label이 주어졌을 때, edge label을 예측하는 성능이 괜찮음을 보이기 위해 두 frequency baseline을 제시함 이는 training set의 통계에 따라 만들어지는데, 예를 들어서 training set에 등장하는 man과 bicycl의 관계가 ridin이 가장 많이 등장하면 Test set에 대해서도 ridin으로 예측하는 방식임. 첫번째 baseline FREQ 각 RoI에 대해서는 pretrained detector를 사용하고, box i와 j간의 relatio에 대한 확률을 계산하기 위해서 해당하는 object o_i와 o_j 사이에서 등장하는 relationship distribution을 확인함. 즉, image는 전혀 사용되지 않고, 단지 object label에 대해서만 conditionin됨. FREQ-overla은 FREQ와 동일한데, 두 bounding box가 겹치는 부분 즉, intersection이 있을때만 유효한 relation으로 간주함.
  15. Object label이 주어졌을 때, edge label을 예측하는 성능이 괜찮음을 보이기 위해 두 frequency baseline을 제시함 이는 training set의 통계에 따라 만들어지는데, 예를 들어서 training set에 등장하는 man과 bicycl의 관계가 ridin이 가장 많이 등장하면 Test set에 대해서도 ridin으로 예측하는 방식임. 첫번째 baseline FREQ 각 RoI에 대해서는 pretrained detector를 사용하고, box i와 j간의 relatio에 대한 확률을 계산하기 위해서 해당하는 object o_i와 o_j 사이에서 등장하는 relationship distribution을 확인함. 즉, image는 전혀 사용되지 않고, 단지 object label에 대해서만 conditionin됨. FREQ-overla은 FREQ와 동일한데, 두 bounding box가 겹치는 부분 즉, intersection이 있을때만 유효한 relation으로 간주함.
  16. Scene graph parsing을 세 단계로 나누어 해결하려함. 첫 번째는 bounding box를 찾는것으로 backbone vgg16와 faster r-cnn을 이용하여 object proposa을 찾아냄. 다음은 찾아낸 object region에 대해서 bi-LSTM을 이용하여 context를 encoding하고, LSTM을 이용하여 decoding하여 object label을 예측함. Object context와 예측된 label을 같이 input으로 사용해서 edge context를 encoding 함. 이렇게 Encode된 contex를 가지고 relationship label을 예측함.
  17. Figure에서 LSTM의 input으로 들어가는 RoI의 순서를 어떻게 정할지를 실험함. LEFTRIGHT는 RoI를 x좌표 기준으로 왼쪽에서 오른쪽으로 sort하는것. 근처에 있는 object 간의 edge를 predic하도록 모델링함. 가까운 objec일수록 relationshi이 나타날 확률이 높기 때문. CONFIDENCE는 detecto로부터 prediction confidence가 가장 높은 순으로 정렬함. 쉬운것부터 어려운순대로 모델링함 SIZE는 bounding box의 크기에 따라 정렬함. Global scene에 대한 정보 부터 모델링함 RANDOM은 무작위로 정렬됨.
  18. Object에 대한 visual featur는 앞단의 Faster r-cnn에서 곧바로 얻을 수 있는데, Predicate의 visual featur는 두 objec에 대한 featur이기 때문에 두 object의 union bounding box에서 얻어냄 이게 여기서는 7x7x256 사이즈로 resiz됨. Union bounding box 내에서의 두 objec간의 geometric relation은 14x14x2 크기의 binary input을 이용하여 모델링됨. 여기에 2개의 conv layer가 적용되면 7x7x256 크기의 representatio이 되고 이게 앞선 detector featur와 더해짐 최종적으로 VGG의 fine-tuned fc layer를 통해서 4096 크기의 representatio을 얻음.
  19. Freq-overla만으로도 기존의 SOTA인 message passin을 1.4 mean recall 만큼 증가시켰다. Motifnet-nocontext가 freq-overlap에 비해서 성능이 증가했다는것은 visual featur에 edge prediction을 위한 단서가 있다는 것을 보여줌. 또한 motifnet-nocontext와 full model(motifnet-leftright)의 비교를 통해 context 중요하다는 것을 알 수 있다.
  20. 가장 흔한 실패 케이스. 첫번째 케이스는, wearing vs. wears 와 같은 predicate ambiguity. 두번째 케이스는, detector이 object를 잘못 찾은 경우.