Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018

136 views

Published on

https://telecombcn-dl.github.io/2018-dlcv/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Published in: Data & Analytics
  • Be the first to comment

Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018

  1. 1. Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Language and Vision Day 4 Lecture 4 #DLUPC http://bit.ly/dlcv2018
  2. 2. 2 Acknowledgments Antonio Bonafonte Santiago Pascual
  3. 3. 3 Acknowledgments Marta R. Costa-jussà
  4. 4. 4 Outline 1. Motivation 2. Image Captioning 3. Visual Question Answering / Reasoning 4. Joint Embeddings
  5. 5. 5 Outline 1. Motivation 2. Image and Video Captioning 3. Visual Question Answering / Reasoning 4. Joint Embeddings
  6. 6. 6
  7. 7. 7[course site]
  8. 8. Text (English) 8 Text (French) Neural Machine Translation
  9. 9. 9 Neural Machine Translation Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Representation or Embedding
  10. 10. 10 Representation or Embedding
  11. 11. 11 Representation or Embedding Encoder Decoder
  12. 12. bit.ly/InsightDL2018 #InsightDL2018 12 Encoder-Decoder Front View Side View Representation of the sentence Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).
  13. 13. 13 Representation or Embedding Encoder
  14. 14. bit.ly/InsightDL2018 #InsightDL2018 14 Encoder in three steps Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) (2) (3) (1) One hot encoding (2) Word embedding (3) Sequence summarization
  15. 15. bit.ly/InsightDL2018 #InsightDL2018 15 (1) One hot encoding cat: xT = [1,0,0, ..., 0] dog: xT = [0,1,0, ..., 0] . . house: xT = [0,0,0, …,0,1,0,...,0] . . .
  16. 16. bit.ly/InsightDL2018 #InsightDL2018 16 (2) Word embeddings Figure: Christopher Olah, Visualizing Representations
  17. 17. bit.ly/InsightDL2018 #InsightDL2018 17 (3) Sequence summarization Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
  18. 18. 18 Representation or Embedding Decoder
  19. 19. bit.ly/InsightDL2018 #InsightDL2018 19 Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) The Recurrent State (zi ) of the decoder is determined by: 1) summary vector hT 2) previous output word ui-1 3) previous state zi-1 hT
  20. 20. bit.ly/InsightDL2018 #InsightDL2018 20 Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) With zi updated, we can compute a probability pi for each word i as an output of the RNN: hT
  21. 21. bit.ly/InsightDL2018 #InsightDL2018 21 Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) More words for the decoded sentence are generated until a <EOS> (End Of Sentence) “word” is predicted. EOS
  22. 22. 22 Outline 1. Motivation 2. Image and Video Captioning 3. Visual Question Answering / Reasoning 4. Joint Embeddings
  23. 23. 23 Representation or Embedding Encoder Decoder
  24. 24. 24 Captioning: Show & Tell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015. [video]
  25. 25. 25 Captioning: DeepImageSent (Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015
  26. 26. 26 Captioning: DeepImageSent (Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015 only takes into account image features in the first hidden state Multimodal Recurrent Neural Network
  27. 27. Challenge on Multimodal Image Translation: http://www.statmt.org/wmt17/multimodal-task.html#task1 Multimodal Machine Translation
  28. 28. 28 Captioning: Show, Attend & Tell Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
  29. 29. 29 Captioning: Show, Attend & Tell Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
  30. 30. 30 Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Captioning (+ Detection): DenseCap
  31. 31. 31 Captioning (+ Detection): DenseCap Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016
  32. 32. 32 Captioning (+ Detection): DenseCap Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 XAVI: “man has short hair”, “man with short hair” AMAIA:”a woman wearing a black shirt”, “ BOTH: “two men wearing black glasses”
  33. 33. 33 Captioning: Video Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
  34. 34. 34 (Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016. LSTM unit (2nd layer) Time Image t = 1 t = T hidden state at t = T first chunk of data Captioning: Video
  35. 35. 35 Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
  36. 36. 36 Lipreading: Watch, Listen, Attend & Spell Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017 Audio features Image features
  37. 37. 37 Lipreading: Watch, Listen, Attend & Spell Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017 Attention over output states from audio and video is computed at each timestep
  38. 38. 38 Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016). Lip Reading: LipNet Input (video frames) and output (sentence) sequences are not aligned
  39. 39. 39 Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006 Lip Reading: LipNet CTC Loss: Connectionist temporal classification ● Avoiding the need for alignment between input and output sequence by predicting an additional “_” blank word ● Before computing the loss, repeated words and blank tokens are removed ● “a _ a b _ ” == “_ a a _ _ b b” == “a a b”
  40. 40. 40 Lip Reading: LipNet Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
  41. 41. 41 Outline 1. Neural Machine Transaltion (no vision here !) 2. Image and Video Captioning 3. Visual Question Answering / Reasoning 4. Joint Embeddings
  42. 42. 42 Visual Question Answering (VQA) [z1 , z2 , … zN ] [y1 , y2 , … yM ] “Is economic growth decreasing ?” “Yes” Encode Encode Decode
  43. 43. 43 Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA: Visual question answering." CVPR 2015. Visual Question Answering (VQA)
  44. 44. 44 Extract visual features Embedding Predict answerMerge Question What object is flying? Answer Kite Slide credit: Issey Masuda Visual Question Answering (VQA)
  45. 45. 45 Visual Question Answering (VQA) Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual Question-Answering." ETSETB UPC TelecomBCN (2016). Image Question Answer
  46. 46. 46 Visual Question Answering (VQA) Francisco Roldán, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Visual Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).
  47. 47. 47 Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016 Dynamic Parameter Prediction Network (DPPnet) Visual Question Answering (VQA)
  48. 48. 48 Visual Question Answering: Dynamic (Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for Visual and Textual Question Answering." ICML 2016
  49. 49. 49 Visual Question Answering: Grounded (Slides and Screencast by Issey Masuda): Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei."Visual7W: Grounded Question Answering in Images." CVPR 2016.
  50. 50. 50 Visual Dialog (Image Guessing Game) Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. "Visual Dialog." CVPR 2017
  51. 51. 51 Visual Dialog (Image Guessing Game) Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. "Visual Dialog." CVPR 2017
  52. 52. 52 Visual Reasoning Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. "CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017
  53. 53. 53 Visual Reasoning (Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Fei-Fei Li, Larry Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017 Program Generator Execution Engine
  54. 54. 54 Outline 1. Neural Machine Transaltion (no vision here !) 2. Image and Video Captioning 3. Visual Question Answering / Reasoning 4. Joint Embeddings
  55. 55. 55 Representation or EmbeddingEncoder Joint Neural Embeddings Encoder
  56. 56. 56 Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep visual-semantic embedding model." NIPS 2013 Joint Neural Embeddings
  57. 57. 57 Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code] Zero-shot learning: a class not present in the training set of images can be predicted (eg. no images from “cat” in the training set) Joint Neural Embeddings
  58. 58. 58 Alejandro Woodward, Víctor Campos, Dèlia Fernàndez, Jordi Torres, Xavier Giró-i-Nieto, Brendan Jou and Shih-Fu Chang (submitted) Joint Neural Embeddings Foggy Day
  59. 59. 59
  60. 60. 60 Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 Image and text retrieval with joint embeddings. Joint Neural Embeddings
  61. 61. 61 Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 Joint Neural Embeddings
  62. 62. 62 Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 Joint Neural Embeddings joint embedding LSTM Bidirectional LSTM
  63. 63. 63 Image to image and text Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016.
  64. 64. 64 Image to image and text Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016.
  65. 65. Gella, Spandana, Rico Sennrich, Frank Keller, and Mirella Lapata. "Image Pivoting for Learning Multilingual Multimodal Representations." arXiv preprint arXiv:1707.07601 (2017). Janarthanan Rajendran, Mitesh M Khapra, Sarath Chandar, Balaraman Ravindran, Bridge Correlational Neural Networks for Multilingual Multimodal Representation Learning NAACL, 2016 Multilingual & Multimodal Embeddings
  66. 66. 66 Outline 1. Motivation 2. Image Captioning 3. Visual Question Answering / Reasoning 4. Joint Embeddings
  67. 67. 67 Questions ?

×