VQA-GNN: Reasoning with
Multimodal Knowledge via
Graph Neural Networks for
Visual Question Answering
Tien-Bach-Thanh Do
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: osfa19730@catholic.ac.kr
2024/07/29
Yanan Wang et al.
ICCV 2023
2
Introduction
Figure 1: Overview of VQA-GNN. Given an image and QA sentence, we obtain unstructured knowledge
(e.g., QA-concept node p and QA-context node z) and structured knowledge (e.g., scene-graph and
concept-graph), and then unify them to perform bidirectional fusion for VQA
3
Method
Multimodal semantic graph
Figure 2. Reasoning procedure of VQA-GNN. We first build a multimodal semantic graph for each given
image-QA pair to unify unstructured (e.g., “node p” and “node z”) and structured (e.g. “scene-graph” and
“concept-graph”) multimodal knowledge. Then we perform inter-modal message passing with a
multimodal GNN-based bidirectional fusion method to update the representations of node z, p, vi, and ci
for k+1 iterations in 2 steps. Finally, we predict the answer with these updated various node
representations. Here, “S” and “C” indicate scene-graph and concept-graph respectively. “LM_encoder”
indicates a language model used to finetune QA-context node representation, and “GNN” indicates a
relation-graph neural network for iterative message passing
4
Method
Multimodal semantic graph - Scene-graph encoding
● Given an image, apply pretrained scene graph generator to extract a scene graph consists of recall@20
of (subject, predicate, object) triplets to represent structured image context
● Apply pretrained object detection model for embedding a set of scene graph nodes
● Predicate edge types in the scene graph with a set of scene graph edges
5
Method
Multimodal semantic graph - QA-concept node retrieval
● With assumption that global image context of the correct choice aligns with local image context =>
employ a pretrained sentence-BERT model to calculate the similarity between each answer choice and all
descriptions of region image within VisualGenome dataset
● Extract relevant region images capture the global image context associated with each choice
● Retrieve top 10 results and utilize the same object detector to embed them
● Embeddings are averaged to obtain a QA-concept node denote as p
6
Method
Multimodal semantic graph - Concept-graph retrieval
7
Method
Multimodal semantic graph - Concept-graph retrieval
● Extract concept entities from both image and answer choices
● Consider detected object names as potential contextual entities
● For answer choice, ground phases if they are mentioned concepts in the ConceptNet KG (shop,...)
● Use grounded phases to retrieve their 1-hop neighbor nodes from ConceptNet KG
● Use word2vec model to get relevance score between concept node candidates and answer choices,
prune irrelevance nodes when relevance score is less than 0.6
● Combine parsed local concept entities of image with retrieved subgraph
● Consider ConceptNet encompasses various types, if local concept entity is found adjacent to retrieved
entity => build a new knowledge triple (bottle, aclocation, beverage)
● Construct a concept graph to depict the structure knowledge at concept level
● Obtain a collection of concept-graph nodes
8
Method
Multimodal semantic graph - QA-context node encoding
● QA-context node as z to inter-connect the scene-graph and concept-graph using 3 additional relation
types:
○ Question edge r(q)
○ Answer edge r(a)
○ Image edge r(e)
○ r(e) links z with V(s) capturing the relationship between the QA context and relevant entities in scene
graph
○ r(q), r(a) link z with entities extracted from question answer text, capturing the relationship between
QA context and relevant entities with concept-graph
○ Construct multimodal semnatic graph G = {S,C} provide joint reasoning space, includes 2 sub-graphs
of scene-graph and concept-graph, 2 super nodes of QA-concept node and QA-context node
○ Employ RoBERTa LM as encoder of QA-context node z and finetune with GNN modules
9
Method
Multimodal GNN-based bidirectional fusion
● Relation-GNN is built on GAT by introducing multi-relation aware message for attention-based message
aggregation
● 5 node types: T = {Z, P, S, C} in multimodal semantic graph, indicate z, p, s, q, c
● Relation edge should capture relationship from node i to node j
● Obtain node type embedding then concatenate them with edge embedding to generate multi-relation
embedding
2-layer MLP one-hot vectors
● Multi-relation aware message
linear transformation node
representation of each node i
10
Method
Multimodal GNN-based bidirectional fusion
● Perform message passing update node representations in each graph in parallel by aggregating multi-
relation aware messages from neighborhood nodes in each node
linear transformation structured node unstructured node
11
Method
Inference and Learning
● To identify correct answer, compute probability for each answer choice with its multimodal semantic
knowledge from scene-graph, concept-graph, QA-context node, QA-concept node
logit(a): confident score of answer choice
12
Experiments
Setup
● Visual Commonsense Reasoning (VCR)
○ Contain 290k pairs of questions, answers, and rationales, over 110k unique movie scenes
○ VCR consists 2 tasks: VQA (Q->A), answer justification (QA->R)
○ Each question in dataset is provided with 4 candidate answers
○ Q->A: select the best answer, QA->R: justify the given question answer pair by picking the best
rationale out of the 4 candidates
○ Joint train VQA-GNN on Q->A and QA->R, with LM encoder, multimodal semantic graph for Q->A,
concept graph retrieved by giving question-answering pair with a rationale candidate for QA->R
○ Use pretrained RoBERTa Large model to embed the QA-context node
● GQA dataset
○ Contain 1.5M questions correspond to 1,842 answer tokens, 110K scene graphs
○ Define question as context node (node q) to fully connect visual and textual SG respectively to
structure multimodal semantic graphs
○ Node q embedded with pretrained RoBERTa large model, initialize object nodes’ representations in
visual SG using object features, object nodes in textual SG by concatenating GloVe based word
embedding of object name and attributes
13
Experiments
Evaluation on VCR dataset
14
Experiments
Effectiveness of the multimodal semantic graph
15
Experiments
Analysis of the multimodal GNN method
Figure 4. Ablation architectures. We find that our final VQA-GNN architecture with two modality-
specialized GNNs overcomes the representation gaps between modalities
16
Experiments
GQA dataset
Table 4. Accuracy scores on the GQA validation set. All models are trained under the realistic setup of not
using the annotated semantic functional programs
17
Experiments
Ablation study on the bidirectional fusion
Table 5. Ablation results on the effect of our proposed bidirectional fusion for GQA
18
Conclusion
● Proposed VQA-GNN which unifies unstructured and structured multimodal knowledge to perform joint
reasoning of the scene
● Outperform strong baseline VQA methods by 3.2% on VCR (Q-AR) and 4.6% on GQA => well performing
concept-level reasoning
● Ablation studies demonstrate the efficacy of the bidirectional fusion and multimodal GNN method in
unifying unstructured and structured multimodal knowledge

[NS][Lab_Seminar_240729]VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering.pptx

  • 1.
    VQA-GNN: Reasoning with MultimodalKnowledge via Graph Neural Networks for Visual Question Answering Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: osfa19730@catholic.ac.kr 2024/07/29 Yanan Wang et al. ICCV 2023
  • 2.
    2 Introduction Figure 1: Overviewof VQA-GNN. Given an image and QA sentence, we obtain unstructured knowledge (e.g., QA-concept node p and QA-context node z) and structured knowledge (e.g., scene-graph and concept-graph), and then unify them to perform bidirectional fusion for VQA
  • 3.
    3 Method Multimodal semantic graph Figure2. Reasoning procedure of VQA-GNN. We first build a multimodal semantic graph for each given image-QA pair to unify unstructured (e.g., “node p” and “node z”) and structured (e.g. “scene-graph” and “concept-graph”) multimodal knowledge. Then we perform inter-modal message passing with a multimodal GNN-based bidirectional fusion method to update the representations of node z, p, vi, and ci for k+1 iterations in 2 steps. Finally, we predict the answer with these updated various node representations. Here, “S” and “C” indicate scene-graph and concept-graph respectively. “LM_encoder” indicates a language model used to finetune QA-context node representation, and “GNN” indicates a relation-graph neural network for iterative message passing
  • 4.
    4 Method Multimodal semantic graph- Scene-graph encoding ● Given an image, apply pretrained scene graph generator to extract a scene graph consists of recall@20 of (subject, predicate, object) triplets to represent structured image context ● Apply pretrained object detection model for embedding a set of scene graph nodes ● Predicate edge types in the scene graph with a set of scene graph edges
  • 5.
    5 Method Multimodal semantic graph- QA-concept node retrieval ● With assumption that global image context of the correct choice aligns with local image context => employ a pretrained sentence-BERT model to calculate the similarity between each answer choice and all descriptions of region image within VisualGenome dataset ● Extract relevant region images capture the global image context associated with each choice ● Retrieve top 10 results and utilize the same object detector to embed them ● Embeddings are averaged to obtain a QA-concept node denote as p
  • 6.
    6 Method Multimodal semantic graph- Concept-graph retrieval
  • 7.
    7 Method Multimodal semantic graph- Concept-graph retrieval ● Extract concept entities from both image and answer choices ● Consider detected object names as potential contextual entities ● For answer choice, ground phases if they are mentioned concepts in the ConceptNet KG (shop,...) ● Use grounded phases to retrieve their 1-hop neighbor nodes from ConceptNet KG ● Use word2vec model to get relevance score between concept node candidates and answer choices, prune irrelevance nodes when relevance score is less than 0.6 ● Combine parsed local concept entities of image with retrieved subgraph ● Consider ConceptNet encompasses various types, if local concept entity is found adjacent to retrieved entity => build a new knowledge triple (bottle, aclocation, beverage) ● Construct a concept graph to depict the structure knowledge at concept level ● Obtain a collection of concept-graph nodes
  • 8.
    8 Method Multimodal semantic graph- QA-context node encoding ● QA-context node as z to inter-connect the scene-graph and concept-graph using 3 additional relation types: ○ Question edge r(q) ○ Answer edge r(a) ○ Image edge r(e) ○ r(e) links z with V(s) capturing the relationship between the QA context and relevant entities in scene graph ○ r(q), r(a) link z with entities extracted from question answer text, capturing the relationship between QA context and relevant entities with concept-graph ○ Construct multimodal semnatic graph G = {S,C} provide joint reasoning space, includes 2 sub-graphs of scene-graph and concept-graph, 2 super nodes of QA-concept node and QA-context node ○ Employ RoBERTa LM as encoder of QA-context node z and finetune with GNN modules
  • 9.
    9 Method Multimodal GNN-based bidirectionalfusion ● Relation-GNN is built on GAT by introducing multi-relation aware message for attention-based message aggregation ● 5 node types: T = {Z, P, S, C} in multimodal semantic graph, indicate z, p, s, q, c ● Relation edge should capture relationship from node i to node j ● Obtain node type embedding then concatenate them with edge embedding to generate multi-relation embedding 2-layer MLP one-hot vectors ● Multi-relation aware message linear transformation node representation of each node i
  • 10.
    10 Method Multimodal GNN-based bidirectionalfusion ● Perform message passing update node representations in each graph in parallel by aggregating multi- relation aware messages from neighborhood nodes in each node linear transformation structured node unstructured node
  • 11.
    11 Method Inference and Learning ●To identify correct answer, compute probability for each answer choice with its multimodal semantic knowledge from scene-graph, concept-graph, QA-context node, QA-concept node logit(a): confident score of answer choice
  • 12.
    12 Experiments Setup ● Visual CommonsenseReasoning (VCR) ○ Contain 290k pairs of questions, answers, and rationales, over 110k unique movie scenes ○ VCR consists 2 tasks: VQA (Q->A), answer justification (QA->R) ○ Each question in dataset is provided with 4 candidate answers ○ Q->A: select the best answer, QA->R: justify the given question answer pair by picking the best rationale out of the 4 candidates ○ Joint train VQA-GNN on Q->A and QA->R, with LM encoder, multimodal semantic graph for Q->A, concept graph retrieved by giving question-answering pair with a rationale candidate for QA->R ○ Use pretrained RoBERTa Large model to embed the QA-context node ● GQA dataset ○ Contain 1.5M questions correspond to 1,842 answer tokens, 110K scene graphs ○ Define question as context node (node q) to fully connect visual and textual SG respectively to structure multimodal semantic graphs ○ Node q embedded with pretrained RoBERTa large model, initialize object nodes’ representations in visual SG using object features, object nodes in textual SG by concatenating GloVe based word embedding of object name and attributes
  • 13.
  • 14.
    14 Experiments Effectiveness of themultimodal semantic graph
  • 15.
    15 Experiments Analysis of themultimodal GNN method Figure 4. Ablation architectures. We find that our final VQA-GNN architecture with two modality- specialized GNNs overcomes the representation gaps between modalities
  • 16.
    16 Experiments GQA dataset Table 4.Accuracy scores on the GQA validation set. All models are trained under the realistic setup of not using the annotated semantic functional programs
  • 17.
    17 Experiments Ablation study onthe bidirectional fusion Table 5. Ablation results on the effect of our proposed bidirectional fusion for GQA
  • 18.
    18 Conclusion ● Proposed VQA-GNNwhich unifies unstructured and structured multimodal knowledge to perform joint reasoning of the scene ● Outperform strong baseline VQA methods by 3.2% on VCR (Q-AR) and 4.6% on GQA => well performing concept-level reasoning ● Ablation studies demonstrate the efficacy of the bidirectional fusion and multimodal GNN method in unifying unstructured and structured multimodal knowledge