Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Open-ended Visual Question-Answering

5,356 views

Published on

http://imatge-upc.github.io/vqa-2016-cvprw/

This thesis studies methods to solve Visual Question-Answering (VQA) tasks with a Deep Learning framework.As a preliminary step, we explore Long Short-Term Memory (LSTM) networks used in Natural Language Processing (NLP) to tackle Question-Answering (text based). We then modify the previous model to accept an image as an input in addition to the question. For this purpose, we explore the VGG-16 and K-CNN convolutional neural networks to extract visual features from the image. These are merged with the word embedding or with a sentence embedding of the question to predict the answer. This work was successfully submitted to the Visual Question Answering Challenge 2016, where it achieved a 53,62\% of accuracy in the test dataset. The developed software has followed the best programming practices and Python code style, providing a consistent baseline in Keras for different configurations.

Published in: Technology
  • Be the first to comment

Open-ended Visual Question-Answering

  1. 1. Open-ended Visual Question-Answering [thesis][web][code] Issey Masuda Mora Santiago Pascual de la PuenteXavier Giró i Nieto
  2. 2. Roadmap Introduction Related Work Methodology Results Conclusions Future work 2
  3. 3. Introduction Related Work Methodology Results Conclusions Future Work Introduction 3
  4. 4. Visual Question-Answering Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2425-2433). 4
  5. 5. Predict the answer of a given question related to an image 5
  6. 6. Visual Question-Answering: Types 6 Real images Abstract scenes Multi-Choice Open-ended Q: Does it appear to be rainy? A: no Q: What is just under the tree? A: a ball Q: How many slices of pizza are there? A: 1, 2, 3, 4 Q: What is for desert? A: cake, ice cream, cheesecake, pie
  7. 7. Example 7 Question: What is bobbing in the water other than the boats? Answer: buoys
  8. 8. Motivation 8 New visual Turing test
  9. 9. Motivation: AI research ● Multidisciplinary tasks ● Models able to perform more complex activities ● Different sub-problems tackled at once 9 Computer Vision Knowledge Representation and Reasoning Natural Language Processing
  10. 10. Introduction Related Work Methodology Results Conclusions Future Work Related Work 10
  11. 11. Deep Learning 11Credit: Google
  12. 12. VQA: Common approach 12 Visual representation Textual representation Predict answerMerge Question What object is flying? Answer Kite CNN Word/sentence embedding + LSTM
  13. 13. Tools: Convolutional Neural Networks (CNN) 13 Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). AlexNet
  14. 14. Tools: Word and Sentence embeddings 14 Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing systems (pp. 3111-3119). Experiments from: Socher et. al. (2013b) and Collbert et. al. (2011) King Man- Woman+ Queen=
  15. 15. Tools: Long Short-Term Memory networks (LSTM) 15 Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
  16. 16. Introduction Related Work Methodology Results Conclusions Future Work Methodology 16
  17. 17. First steps: Text-based QA 17
  18. 18. Extending text-based QA for VQA 18 Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  19. 19. Substitute VGG-16 with KCNN 19 Liu, Z. (2015). Kernelized Deep Convolutional Neural Network for Describing Complex Images. arXiv preprint arXiv: 1509.04581.
  20. 20. Sentence embedding and image projection 20 Image Question Answer
  21. 21. Introduction Related Work Methodology Results Conclusions Future Work Results 21
  22. 22. VQA Dataset: Real Images, Open-ended questions 22 Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question answering. CVPR 2015. 1 (image) x 3 (questions) x 10 (answers)
  23. 23. Evaluation 23 Metric: Script: ● Characters to lowercase ● Remove periods (unless decimal periods) ● Number words to digits ● Remove articles ● Add apostrophe to contractions ● Replace punctuation with space
  24. 24. VQA Challenge 24
  25. 25. 53.62%CVPR2016 VQA Challenge Real Images Open-ended, test-standard dataset partition 25
  26. 26. Results in detail 26 VALIDATION SET TEST SET Model Yes/No Number Other Overall Yes/No Number Other Overall Model 1 71.82 23.79 27.99 43.87 71.62 28.76 29.32 46.70 Model 3 75.02 28.60 29.30 46.32 - - - - Model 2 75.62 31.81 28.11 46.36 - - - - Model 5 78.15 32.79 33.91 50.32 78.15 36.20 35.26 53.03 Model 4 78.73 32.82 35.5 51.34 78.02 35.68 36.54 53.62
  27. 27. Results in context 27 100%0% Humans 83.30% UC Berkeley & Sony 66.47% Baseline LSTM&CNN 54.06% Baseline Nearest neighbor 42.85% Baseline Prior per question type 37.47% Baseline All yes 29.88% Ours 53.62%
  28. 28. Comparison with the baseline Our model ● Single word answer ● Generate answers 28 Baseline ● Multi word answers (hardcoded) ● Classify over the 1000 most common answers
  29. 29. Qualitative results: I 29
  30. 30. Qualitative results: II 30
  31. 31. Deep Python Project 31 https://github.com/imatge-upc/vqa-2016-cvprw
  32. 32. Research contribution: Extended abstract 32 VQA workshop, CVPR 2016
  33. 33. Research controbution: Extended abstract - Poster 33
  34. 34. … ticket to Las Vegas 34
  35. 35. 35Presenting our poster and extended abstract at CVPR 2016, Las Vegas, USA
  36. 36. VQA Challenge statistics: Answering method 36
  37. 37. Introduction Related Work Methodology Results Conclusions Future Work Conclusions 37
  38. 38. Conclusion 38 ✓ Present to VQA Challenge, CVPR 2016 Goals accomplished ✓ First GPI project using text processing techniques ✓ Create a scalable VQA model ✓ Build a modular and reusable software package ✓ Extended abstract accepted to VQA workshop CVPR 2016
  39. 39. Conclusion Personal overview ● Submission to VQA Challenge ● VQA, hot topic at CVPR 2016 ● Model designed to generate answers instead of classifying them ● Question-Answer pair generation proposal 39
  40. 40. Introduction Related Work Methodology Results Conclusions Future Work Future Work 40
  41. 41. Future work 41 ● Decoder for multiple word answers ● Character embedding ● Attention mechanisms ● Question-Answer pairs generation Next steps
  42. 42. Automatic Question-Answer Pairs Generation 42
  43. 43. Thank You! 43 Do you have any question?
  44. 44. Project resource links ● Thesis: https://imatge.upc.edu/web/sites/default/files/pub/xMasuda- Mora_0.pdf ● Web page: http://imatge-upc.github.io/vqa-2016-cvprw/ ● Source code: https://github.com/imatge-upc/vqa-2016-cvprw 44
  45. 45. Motivation: First steps towards QA Generation 45 AI System Question What is the man doing? Answer Surf
  46. 46. VQA: Counterexample 46 Dynamic Parameter Prediction Network (DPPnet) Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016
  47. 47. Experiments: Batch Normalization 47
  48. 48. Losses I 48
  49. 49. Losses II 49
  50. 50. Losses III 50
  51. 51. VQA Challenge statistics: Image modelling 51
  52. 52. VQA Challenge statistics: Question modelling 52

×