Visual Commonsense Reasoning.pptx

2022-04-21
Sangmin Woo
Computational Intelligence Lab.
School of Electrical Engineering
Korea Advanced Institute of Science and Technology (KAIST)
Visual Commonsense Reasoning

2
Visual Commonsense Reasoning?
With one glance at an image, we can effortlessly imagine the world beyond the
pixels.
We can infer people’s actions, goals, and mental states.
However, it is tremendously difficult for today’s vision systems.
Visual Commonsense Reasoning!
Given a challenging question about an image, a machine must answer correctly
and then provide a rationale justifying its answer.
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.

3
Visual Commonsense Reasoning?
Visual Commonsense Reasoning = Visual Question Answering + Rationale

4
What’s new?
New task: Visual Commonsense Reasoning (VCR)
 Given an image, answer a question and provide a rationale justifying the answer.
New dataset: VCR dataset
 290K pairs of question, answers, and rationales (derived from 110K movie scenes)
 Humans find VCR easy (over 90% accuracy)
 State-of-the-art vision models struggle (~45%)
 Multiple choice QA problems
 Adversarial Matching: recycle each correct answer for a question exactly three
times – as a negative answer for three other questions.
New model: R2C (Recognition to Cognition Networks)
 R2C narrows the gap between humans and machines (~65%)

5
VCR Task
New task: Visual Commonsense Reasoning (VCR)
 Q->AR: VCR is casted as a four-way multiple-choice problem.
 Answering (Q ->A): Given a question along with four answer choices, a
model must first select the right answer.
 Justification (QA->R): If its answer was correct, then it is provided four
rationale choices and it must select the correct rationale.
 The machine needs to understand activities, the roles of people, the
mental states of people, and likely the events before and after the scene.
 VCR task covers these categories and more:

6
VCR Dataset Construction
Interestingnesss
Adversarial Matching
Dataset Collection

7
R2C: Recognition to Cognition Networks
Ground the meaning of the query and each response.
 Referring to the image for the two people
Contextualize the meaning of the query, response, and image together.
 Resolving referent “he” and why one might be pointing in a diner
Reason about the interplay of relevant image regions, query, and response.
 Determine social dynamics between person1 and person4

8
Results
vs. Text Only baselines
vs. VQA baselines
vs. Human
up-to-date results:
https://visualcommonsense.com/leaderboard

9
Qualitative Examples

Thank
You
Sangmin Woo
sangminwoo.github.i
o
smwoo95@kaist.ac.k
r
sangminwoo

11
Appendix: VCR task

12
Appendix: Annotation Interface

13
Appendix: Model Ablations

14
Appendix: Qualitative Examples

Visual Commonsense Reasoning.pptx

More Related Content

Similar to Visual Commonsense Reasoning.pptx

More from Sangmin Woo

Recently uploaded

Visual Commonsense Reasoning.pptx

Editor's Notes