SlideShare a Scribd company logo
1 of 14
2022-04-21
Sangmin Woo
Computational Intelligence Lab.
School of Electrical Engineering
Korea Advanced Institute of Science and Technology (KAIST)
Visual Commonsense Reasoning
2
Visual Commonsense Reasoning?
With one glance at an image, we can effortlessly imagine the world beyond the
pixels.
We can infer people’s actions, goals, and mental states.
However, it is tremendously difficult for today’s vision systems.
Visual Commonsense Reasoning!
Given a challenging question about an image, a machine must answer correctly
and then provide a rationale justifying its answer.
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
3
Visual Commonsense Reasoning?
Visual Commonsense Reasoning = Visual Question Answering + Rationale
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
4
What’s new?
New task: Visual Commonsense Reasoning (VCR)
 Given an image, answer a question and provide a rationale justifying the answer.
New dataset: VCR dataset
 290K pairs of question, answers, and rationales (derived from 110K movie scenes)
 Humans find VCR easy (over 90% accuracy)
 State-of-the-art vision models struggle (~45%)
 Multiple choice QA problems
 Adversarial Matching: recycle each correct answer for a question exactly three
times – as a negative answer for three other questions.
New model: R2C (Recognition to Cognition Networks)
 R2C narrows the gap between humans and machines (~65%)
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
5
VCR Task
New task: Visual Commonsense Reasoning (VCR)
 Q->AR: VCR is casted as a four-way multiple-choice problem.
 Answering (Q ->A): Given a question along with four answer choices, a
model must first select the right answer.
 Justification (QA->R): If its answer was correct, then it is provided four
rationale choices and it must select the correct rationale.
 The machine needs to understand activities, the roles of people, the
mental states of people, and likely the events before and after the scene.
 VCR task covers these categories and more:
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
6
VCR Dataset Construction
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
Interestingnesss
Adversarial Matching
Dataset Collection
7
R2C: Recognition to Cognition Networks
Ground the meaning of the query and each response.
 Referring to the image for the two people
Contextualize the meaning of the query, response, and image together.
 Resolving referent “he” and why one might be pointing in a diner
Reason about the interplay of relevant image regions, query, and response.
 Determine social dynamics between person1 and person4
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
8
Results
vs. Text Only baselines
vs. VQA baselines
vs. Human
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
up-to-date results:
https://visualcommonsense.com/leaderboard
9
Qualitative Examples
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
Thank
You
Sangmin Woo
sangminwoo.github.i
o
smwoo95@kaist.ac.k
r
sangminwoo
11
Appendix: VCR task
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
12
Appendix: Annotation Interface
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
13
Appendix: Model Ablations
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
14
Appendix: Qualitative Examples
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.

More Related Content

Similar to Visual Commonsense Reasoning.pptx

Review of Pose Recognition Systems
Review of Pose Recognition SystemsReview of Pose Recognition Systems
Review of Pose Recognition Systemsvivatechijri
 
Deep learning for pose-invariant face detection in unconstrained environment
Deep learning for pose-invariant face detection in unconstrained environmentDeep learning for pose-invariant face detection in unconstrained environment
Deep learning for pose-invariant face detection in unconstrained environmentIJECEIAES
 
Practical computer vision-- A problem-driven approach towards learning CV/ML/DL
Practical computer vision-- A problem-driven approach towards learning CV/ML/DLPractical computer vision-- A problem-driven approach towards learning CV/ML/DL
Practical computer vision-- A problem-driven approach towards learning CV/ML/DLAlbert Y. C. Chen
 
Virtual retinal display ppt
Virtual retinal display pptVirtual retinal display ppt
Virtual retinal display pptHina Saxena
 
Senior Project Paper
Senior Project PaperSenior Project Paper
Senior Project PaperMark Kurtz
 
Face Recognition Human Computer Interaction
Face Recognition Human Computer InteractionFace Recognition Human Computer Interaction
Face Recognition Human Computer Interactionines beltaief
 
computer vision.pdf
computer vision.pdfcomputer vision.pdf
computer vision.pdfsisaysimon
 
THE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGES
THE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGESTHE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGES
THE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGESijcsit
 
Scene Description From Images To Sentences
Scene Description From Images To SentencesScene Description From Images To Sentences
Scene Description From Images To SentencesIRJET Journal
 
Applied Computer Vision - a Deep Learning Approach
Applied Computer Vision - a Deep Learning ApproachApplied Computer Vision - a Deep Learning Approach
Applied Computer Vision - a Deep Learning ApproachJose Berengueres
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - Hiroshi Fukui
 
3 d recognition via 2-d stage associative memory kunal
3 d recognition via 2-d stage associative memory kunal3 d recognition via 2-d stage associative memory kunal
3 d recognition via 2-d stage associative memory kunalKunal Kishor Nirala
 
Face recognition system
Face recognition systemFace recognition system
Face recognition systemYogesh Lamture
 
Modelling Framework of a Neural Object Recognition
Modelling Framework of a Neural Object RecognitionModelling Framework of a Neural Object Recognition
Modelling Framework of a Neural Object RecognitionIJERA Editor
 
Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Cl...
Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Cl...Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Cl...
Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Cl...BIJIAM Journal
 
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra MalikDeep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra MalikThe Hive
 
Automated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired PeopleAutomated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired PeopleChristopher Mehdi Elamri
 

Similar to Visual Commonsense Reasoning.pptx (20)

Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
 
Review of Pose Recognition Systems
Review of Pose Recognition SystemsReview of Pose Recognition Systems
Review of Pose Recognition Systems
 
Deep learning for pose-invariant face detection in unconstrained environment
Deep learning for pose-invariant face detection in unconstrained environmentDeep learning for pose-invariant face detection in unconstrained environment
Deep learning for pose-invariant face detection in unconstrained environment
 
Practical computer vision-- A problem-driven approach towards learning CV/ML/DL
Practical computer vision-- A problem-driven approach towards learning CV/ML/DLPractical computer vision-- A problem-driven approach towards learning CV/ML/DL
Practical computer vision-- A problem-driven approach towards learning CV/ML/DL
 
Datta
DattaDatta
Datta
 
Virtual retinal display ppt
Virtual retinal display pptVirtual retinal display ppt
Virtual retinal display ppt
 
Senior Project Paper
Senior Project PaperSenior Project Paper
Senior Project Paper
 
Face Recognition Human Computer Interaction
Face Recognition Human Computer InteractionFace Recognition Human Computer Interaction
Face Recognition Human Computer Interaction
 
computer vision.pdf
computer vision.pdfcomputer vision.pdf
computer vision.pdf
 
THE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGES
THE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGESTHE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGES
THE EFFECT OF PHYSICAL BASED FEATURES FOR RECOGNITION OF RECAPTURED IMAGES
 
Scene Description From Images To Sentences
Scene Description From Images To SentencesScene Description From Images To Sentences
Scene Description From Images To Sentences
 
Applied Computer Vision - a Deep Learning Approach
Applied Computer Vision - a Deep Learning ApproachApplied Computer Vision - a Deep Learning Approach
Applied Computer Vision - a Deep Learning Approach
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
 
3 d recognition via 2-d stage associative memory kunal
3 d recognition via 2-d stage associative memory kunal3 d recognition via 2-d stage associative memory kunal
3 d recognition via 2-d stage associative memory kunal
 
Face recognition system
Face recognition systemFace recognition system
Face recognition system
 
Nursing Essay Example
Nursing Essay ExampleNursing Essay Example
Nursing Essay Example
 
Modelling Framework of a Neural Object Recognition
Modelling Framework of a Neural Object RecognitionModelling Framework of a Neural Object Recognition
Modelling Framework of a Neural Object Recognition
 
Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Cl...
Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Cl...Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Cl...
Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Cl...
 
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra MalikDeep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
 
Automated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired PeopleAutomated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired People
 

More from Sangmin Woo

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxSangmin Woo
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptxSangmin Woo
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxSangmin Woo
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxSangmin Woo
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptxSangmin Woo
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptxSangmin Woo
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSangmin Woo
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Sangmin Woo
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient TransformersSangmin Woo
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in VisionSangmin Woo
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsSangmin Woo
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextSangmin Woo
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsSangmin Woo
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationSangmin Woo
 

More from Sangmin Woo (14)

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptx
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient Transformers
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

Visual Commonsense Reasoning.pptx

  • 1. 2022-04-21 Sangmin Woo Computational Intelligence Lab. School of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST) Visual Commonsense Reasoning
  • 2. 2 Visual Commonsense Reasoning? With one glance at an image, we can effortlessly imagine the world beyond the pixels. We can infer people’s actions, goals, and mental states. However, it is tremendously difficult for today’s vision systems. Visual Commonsense Reasoning! Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 3. 3 Visual Commonsense Reasoning? Visual Commonsense Reasoning = Visual Question Answering + Rationale Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 4. 4 What’s new? New task: Visual Commonsense Reasoning (VCR)  Given an image, answer a question and provide a rationale justifying the answer. New dataset: VCR dataset  290K pairs of question, answers, and rationales (derived from 110K movie scenes)  Humans find VCR easy (over 90% accuracy)  State-of-the-art vision models struggle (~45%)  Multiple choice QA problems  Adversarial Matching: recycle each correct answer for a question exactly three times – as a negative answer for three other questions. New model: R2C (Recognition to Cognition Networks)  R2C narrows the gap between humans and machines (~65%) Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 5. 5 VCR Task New task: Visual Commonsense Reasoning (VCR)  Q->AR: VCR is casted as a four-way multiple-choice problem.  Answering (Q ->A): Given a question along with four answer choices, a model must first select the right answer.  Justification (QA->R): If its answer was correct, then it is provided four rationale choices and it must select the correct rationale.  The machine needs to understand activities, the roles of people, the mental states of people, and likely the events before and after the scene.  VCR task covers these categories and more: Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 6. 6 VCR Dataset Construction Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019. Interestingnesss Adversarial Matching Dataset Collection
  • 7. 7 R2C: Recognition to Cognition Networks Ground the meaning of the query and each response.  Referring to the image for the two people Contextualize the meaning of the query, response, and image together.  Resolving referent “he” and why one might be pointing in a diner Reason about the interplay of relevant image regions, query, and response.  Determine social dynamics between person1 and person4 Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 8. 8 Results vs. Text Only baselines vs. VQA baselines vs. Human Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019. up-to-date results: https://visualcommonsense.com/leaderboard
  • 9. 9 Qualitative Examples Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 11. 11 Appendix: VCR task Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 12. 12 Appendix: Annotation Interface Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 13. 13 Appendix: Model Ablations Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.
  • 14. 14 Appendix: Qualitative Examples Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." CVPR 2019.

Editor's Notes

  1. Thank you.