Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
2. 2
Visual Commonsense Reasoning?
With one glance at an image, we can effortlessly imagine the world beyond the
pixels.
We can infer people’s actions, goals, and mental states.
However, it is tremendously difficult for today’s vision systems.
Visual Commonsense Reasoning!
Given a challenging question about an image, a machine must answer correctly
and then provide a rationale justifying its answer.
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
4. 4
What’s new?
New task: Visual Commonsense Reasoning (VCR)
Given an image, answer a question and provide a rationale justifying the answer.
New dataset: VCR dataset
290K pairs of question, answers, and rationales (derived from 110K movie scenes)
Humans find VCR easy (over 90% accuracy)
State-of-the-art vision models struggle (~45%)
Multiple choice QA problems
Adversarial Matching: recycle each correct answer for a question exactly three
times – as a negative answer for three other questions.
New model: R2C (Recognition to Cognition Networks)
R2C narrows the gap between humans and machines (~65%)
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
5. 5
VCR Task
New task: Visual Commonsense Reasoning (VCR)
Q->AR: VCR is casted as a four-way multiple-choice problem.
Answering (Q ->A): Given a question along with four answer choices, a
model must first select the right answer.
Justification (QA->R): If its answer was correct, then it is provided four
rationale choices and it must select the correct rationale.
The machine needs to understand activities, the roles of people, the
mental states of people, and likely the events before and after the scene.
VCR task covers these categories and more:
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
6. 6
VCR Dataset Construction
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
Interestingnesss
Adversarial Matching
Dataset Collection
7. 7
R2C: Recognition to Cognition Networks
Ground the meaning of the query and each response.
Referring to the image for the two people
Contextualize the meaning of the query, response, and image together.
Resolving referent “he” and why one might be pointing in a diner
Reason about the interplay of relevant image regions, query, and response.
Determine social dynamics between person1 and person4
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
8. 8
Results
vs. Text Only baselines
vs. VQA baselines
vs. Human
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
up-to-date results:
https://visualcommonsense.com/leaderboard