Open-ended
Visual Question-Answering
[thesis][web][code]
Issey Masuda Mora Santiago Pascual de la PuenteXavier Giró i Nieto
Roadmap
Introduction Related
Work
Methodology Results Conclusions Future
work
2
Introduction Related
Work
Methodology Results Conclusions Future
Work
Introduction
3
Visual Question-Answering
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question
answering. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2425-2433).
4
Predict the answer of
a given question
related to an image
5
Visual Question-Answering: Types
6
Real images Abstract scenes
Multi-Choice
Open-ended
Q: Does it
appear to be
rainy?
A: no
Q: What is just
under the tree?
A: a ball
Q: How
many slices
of pizza are
there?
A: 1, 2, 3, 4
Q: What is for
desert?
A: cake, ice
cream,
cheesecake, pie
Example
7
Question: What is bobbing in the water other than
the boats?
Answer: buoys
Motivation
8
New visual Turing test
Motivation: AI research
● Multidisciplinary tasks
● Models able to perform more
complex activities
● Different sub-problems tackled at
once
9
Computer Vision
Knowledge
Representation
and Reasoning
Natural
Language
Processing
Introduction Related
Work
Methodology Results Conclusions Future
Work
Related Work
10
Deep Learning
11Credit: Google
VQA: Common approach
12
Visual
representation
Textual
representation
Predict answerMerge
Question
What object is flying?
Answer
Kite
CNN
Word/sentence
embedding + LSTM
Tools: Convolutional Neural Networks (CNN)
13
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In
Advances in neural information processing systems (pp. 1097-1105).
AlexNet
Tools: Word and Sentence embeddings
14
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases
and their compositionality. InAdvances in neural information processing systems (pp. 3111-3119).
Experiments from: Socher et. al. (2013b) and Collbert et. al. (2011)
King Man- Woman+ Queen=
Tools: Long Short-Term Memory networks (LSTM)
15
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
Introduction Related
Work
Methodology Results Conclusions Future
Work
Methodology
16
First steps: Text-based QA
17
Extending text-based QA for VQA
18
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556.
Substitute VGG-16 with KCNN
19
Liu, Z. (2015). Kernelized Deep Convolutional Neural Network for Describing Complex Images. arXiv preprint arXiv:
1509.04581.
Sentence embedding and image projection
20
Image
Question
Answer
Introduction Related
Work
Methodology Results Conclusions Future
Work
Results
21
VQA Dataset: Real Images, Open-ended questions
22
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question
answering. CVPR 2015.
1 (image) x 3 (questions) x 10 (answers)
Evaluation
23
Metric: Script:
● Characters to lowercase
● Remove periods (unless decimal
periods)
● Number words to digits
● Remove articles
● Add apostrophe to contractions
● Replace punctuation with space
VQA Challenge
24
53.62%CVPR2016 VQA Challenge
Real Images Open-ended, test-standard dataset partition
25
Results in detail
26
VALIDATION SET TEST SET
Model Yes/No Number Other Overall Yes/No Number Other Overall
Model 1 71.82 23.79 27.99 43.87 71.62 28.76 29.32 46.70
Model 3 75.02 28.60 29.30 46.32 - - - -
Model 2 75.62 31.81 28.11 46.36 - - - -
Model 5 78.15 32.79 33.91 50.32 78.15 36.20 35.26 53.03
Model 4 78.73 32.82 35.5 51.34 78.02 35.68 36.54 53.62
Results in context
27
100%0%
Humans
83.30%
UC Berkeley
& Sony
66.47%
Baseline
LSTM&CNN
54.06%
Baseline Nearest
neighbor
42.85%
Baseline Prior per
question type
37.47%
Baseline All yes
29.88%
Ours
53.62%
Comparison with the baseline
Our model
● Single word answer
● Generate answers
28
Baseline
● Multi word answers (hardcoded)
● Classify over the 1000 most common
answers
Qualitative results: I
29
Qualitative results: II
30
Deep Python Project
31
https://github.com/imatge-upc/vqa-2016-cvprw
Research contribution: Extended abstract
32
VQA workshop, CVPR 2016
Research controbution: Extended abstract - Poster
33
… ticket to Las Vegas 34
35Presenting our poster and extended abstract at CVPR 2016, Las Vegas, USA
VQA Challenge statistics: Answering method
36
Introduction Related
Work
Methodology Results Conclusions Future
Work
Conclusions
37
Conclusion
38
✓ Present to VQA Challenge,
CVPR 2016
Goals accomplished
✓ First GPI project using text
processing techniques
✓ Create a scalable VQA model
✓ Build a modular and reusable
software package
✓ Extended abstract accepted
to VQA workshop CVPR 2016
Conclusion
Personal overview
● Submission to VQA Challenge
● VQA, hot topic at CVPR 2016
● Model designed to generate
answers instead of classifying
them
● Question-Answer pair
generation proposal
39
Introduction Related
Work
Methodology Results Conclusions Future
Work
Future Work
40
Future work
41
● Decoder for multiple word
answers
● Character embedding
● Attention mechanisms
● Question-Answer pairs
generation
Next steps
Automatic Question-Answer Pairs Generation
42
Thank You!
43
Do you have any
question?
Project resource links
● Thesis: https://imatge.upc.edu/web/sites/default/files/pub/xMasuda-
Mora_0.pdf
● Web page: http://imatge-upc.github.io/vqa-2016-cvprw/
● Source code: https://github.com/imatge-upc/vqa-2016-cvprw
44
Motivation: First steps towards QA Generation
45
AI System
Question
What is the man doing?
Answer
Surf
VQA: Counterexample
46
Dynamic Parameter Prediction Network (DPPnet)
Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter
prediction. CVPR 2016
Experiments: Batch Normalization
47
Losses I
48
Losses II
49
Losses III
50
VQA Challenge statistics: Image modelling
51
VQA Challenge statistics: Question modelling
52

Open-ended Visual Question-Answering