2022-04-21
Sangmin Woo
Computational Intelligence Lab.
School of Electrical Engineering
Korea Advanced Institute of Science and Technology (KAIST)
Visual Commonsense Reasoning
2
Visual Commonsense Reasoning?
With one glance at an image, we can effortlessly imagine the world beyond the
pixels.
We can infer people’s actions, goals, and mental states.
However, it is tremendously difficult for today’s vision systems.
Visual Commonsense Reasoning!
Given a challenging question about an image, a machine must answer correctly
and then provide a rationale justifying its answer.
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
3
Visual Commonsense Reasoning?
Visual Commonsense Reasoning = Visual Question Answering + Rationale
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
4
What’s new?
New task: Visual Commonsense Reasoning (VCR)
 Given an image, answer a question and provide a rationale justifying the answer.
New dataset: VCR dataset
 290K pairs of question, answers, and rationales (derived from 110K movie scenes)
 Humans find VCR easy (over 90% accuracy)
 State-of-the-art vision models struggle (~45%)
 Multiple choice QA problems
 Adversarial Matching: recycle each correct answer for a question exactly three
times – as a negative answer for three other questions.
New model: R2C (Recognition to Cognition Networks)
 R2C narrows the gap between humans and machines (~65%)
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
5
VCR Task
New task: Visual Commonsense Reasoning (VCR)
 Q->AR: VCR is casted as a four-way multiple-choice problem.
 Answering (Q ->A): Given a question along with four answer choices, a
model must first select the right answer.
 Justification (QA->R): If its answer was correct, then it is provided four
rationale choices and it must select the correct rationale.
 The machine needs to understand activities, the roles of people, the
mental states of people, and likely the events before and after the scene.
 VCR task covers these categories and more:
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
6
VCR Dataset Construction
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
Interestingnesss
Adversarial Matching
Dataset Collection
7
R2C: Recognition to Cognition Networks
Ground the meaning of the query and each response.
 Referring to the image for the two people
Contextualize the meaning of the query, response, and image together.
 Resolving referent “he” and why one might be pointing in a diner
Reason about the interplay of relevant image regions, query, and response.
 Determine social dynamics between person1 and person4
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
8
Results
vs. Text Only baselines
vs. VQA baselines
vs. Human
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
up-to-date results:
https://visualcommonsense.com/leaderboard
9
Qualitative Examples
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
Thank
You
Sangmin Woo
sangminwoo.github.i
o
smwoo95@kaist.ac.k
r
sangminwoo
11
Appendix: VCR task
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
12
Appendix: Annotation Interface
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
13
Appendix: Model Ablations
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
14
Appendix: Qualitative Examples
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.

Visual Commonsense Reasoning.pptx

  • 1.
    2022-04-21 Sangmin Woo Computational IntelligenceLab. School of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST) Visual Commonsense Reasoning
  • 2.
    2 Visual Commonsense Reasoning? Withone glance at an image, we can effortlessly imagine the world beyond the pixels. We can infer people’s actions, goals, and mental states. However, it is tremendously difficult for today’s vision systems. Visual Commonsense Reasoning! Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 3.
    3 Visual Commonsense Reasoning? VisualCommonsense Reasoning = Visual Question Answering + Rationale Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 4.
    4 What’s new? New task:Visual Commonsense Reasoning (VCR)  Given an image, answer a question and provide a rationale justifying the answer. New dataset: VCR dataset  290K pairs of question, answers, and rationales (derived from 110K movie scenes)  Humans find VCR easy (over 90% accuracy)  State-of-the-art vision models struggle (~45%)  Multiple choice QA problems  Adversarial Matching: recycle each correct answer for a question exactly three times – as a negative answer for three other questions. New model: R2C (Recognition to Cognition Networks)  R2C narrows the gap between humans and machines (~65%) Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 5.
    5 VCR Task New task:Visual Commonsense Reasoning (VCR)  Q->AR: VCR is casted as a four-way multiple-choice problem.  Answering (Q ->A): Given a question along with four answer choices, a model must first select the right answer.  Justification (QA->R): If its answer was correct, then it is provided four rationale choices and it must select the correct rationale.  The machine needs to understand activities, the roles of people, the mental states of people, and likely the events before and after the scene.  VCR task covers these categories and more: Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 6.
    6 VCR Dataset Construction Zellers,Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019. Interestingnesss Adversarial Matching Dataset Collection
  • 7.
    7 R2C: Recognition toCognition Networks Ground the meaning of the query and each response.  Referring to the image for the two people Contextualize the meaning of the query, response, and image together.  Resolving referent “he” and why one might be pointing in a diner Reason about the interplay of relevant image regions, query, and response.  Determine social dynamics between person1 and person4 Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 8.
    8 Results vs. Text Onlybaselines vs. VQA baselines vs. Human Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019. up-to-date results: https://visualcommonsense.com/leaderboard
  • 9.
    9 Qualitative Examples Zellers, Rowan,et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 10.
  • 11.
    11 Appendix: VCR task Zellers,Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 12.
    12 Appendix: Annotation Interface Zellers,Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 13.
    13 Appendix: Model Ablations Zellers,Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 14.
    14 Appendix: Qualitative Examples Zellers,Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.

Editor's Notes