Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)

218 views

Published on

https://telecombcn-dl.github.io/2017-dlcv/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)

  1. 1. [course site] Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Language and Vision Day 3 Lecture 5 #DLUPC
  2. 2. 2 Acknowledgments Antonio Bonafonte Santiago Pascual
  3. 3. 3 Acknowledgments Marta R. Costa-jussà
  4. 4. 4 Outline 1. Neural Machine Transaltion (no vision here !) 2. Image Captioning 3. Visual Question Answering / Reasoning 4. Joint Embeddings
  5. 5. 5 Outline 1. Neural Machine Transaltion (no vision here !) 2. Image and Video Captioning 3. Visual Question Answering / Reasoning 4. Joint Embeddings
  6. 6. 6
  7. 7. 7[course site]
  8. 8. 8 Neural Machine Translation (NMT) Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Representation or Embedding
  9. 9. 9 Encoder-decoder Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014). Language IN Language OUT RNNs
  10. 10. 10 Encoder-Decoder Front View Side View Representation of the sentence Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).
  11. 11. Encoder
  12. 12. 12 Encoder in three steps Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) (2) (3) (3) Sequence summarization (2) Continuous-space Word Representation (word embedding) (1) One hot encoding Representation of the sentence
  13. 13. 13 Encoder: (1) One hot encoding Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) (2) (3) (3) Sequence summarization (2) Continuous-space Word Representation (word embedding) (1) One hot encoding Representation of the sentence
  14. 14. 14Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Example: words. cat: xT = [1,0,0, ..., 0] dog: xT = [0,1,0, ..., 0] . . house: xT = [0,0,0, …,0,1,0,...,0] . . . Number of words, |V| ? B2: 5K C2: 18K LVSR: 50-100K Wikipedia (1.6B): 400K Crawl data (42B): 2M Encoder: (1) One hot encoding
  15. 15. 15 Encoder: (2) Word embedding Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) (2) (3) (3) Sequence summarization (2) Continuous-space Word Representation (word embedding) (1) One hot encoding
  16. 16. 16 Figure: Christopher Olah, Visualizing Representations Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality." NIPS 2013 Video: Antonio Bonafonte @ DLSL 2017 Encoder: (2) Word embedding
  17. 17. 17Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) hT Encoder: (3) Recurrence
  18. 18. 18 Sequence Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) Activation function could be LSTM, GRU, QRNN, pLSTM... Encoder: (3) Recurrence
  19. 19. Decoder
  20. 20. 20 Decoder: (1) Recurrence Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) (1) The Recurrent State (zi ) of the decoder is determined by: ● summary vector hT ● previous output word ui-1 ● previous state zi-1 hT
  21. 21. 21 Decoder: (2) Word Probabilities Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) (2) With zi ready, we the output of the RNN will estimate a probability pi for each word in the vocabulary: hT
  22. 22. 22Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) (3) The word with the highest probability will be predicted as word sample ui Decoder: (3) Word Sample
  23. 23. 23 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) (3) An output sequence of words can be generated until an <EOS> (End Of Sentence) “word” is predicted. EOS Decoder: (3) Word Sample
  24. 24. 24 Encoder-Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Representation or Embedding
  25. 25. 25 Representation or Embedding Encoder Decoder
  26. 26. 26 Outline 1. Neural Machine Transaltion (no vision here !) 2. Image and Video Captioning 3. Visual Question Answering / Reasoning 4. Joint Embeddings
  27. 27. 27 (Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015 Image Captioning
  28. 28. 28 Representation or Embedding Encoder Decoder
  29. 29. 29 Captioning: DeepImageSent (Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015 only takes into account image features in the first hidden state Multimodal Recurrent Neural Network
  30. 30. 30 Captioning: Show & Tell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015.
  31. 31. 31 Captioning: Show & Tell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015.
  32. 32. 32 Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Captioning (+ Detection): DenseCap
  33. 33. 33 Captioning (+ Detection): DenseCap Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016
  34. 34. 34 Captioning (+ Detection): DenseCap Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 XAVI: “man has short hair”, “man with short hair” AMAIA:”a woman wearing a black shirt”, “ BOTH: “two men wearing black glasses”
  35. 35. 35 Captioning: Video Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
  36. 36. 36 ( Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016. LSTM unit (2nd layer) Time Image t = 1 t = T hidden state at t = T first chunk of data Captioning: Video
  37. 37. 37 Outline 1. Neural Machine Transaltion (no vision here !) 2. Image and Video Captioning 3. Visual Question Answering / Reasoning 4. Joint Embeddings
  38. 38. 38 Visual Question Answering (VQA) [z1 , z2 , … zN ] [y1 , y2 , … yM ] “Is economic growth decreasing ?” “Yes” Encode Encode Decode
  39. 39. 39 Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA: Visual question answering." CVPR 2015. Visual Question Answering (VQA)
  40. 40. 40 Extract visual features Embedding Predict answerMerge Question What object is flying? Answer Kite Slide credit: Issey Masuda Visual Question Answering (VQA)
  41. 41. 41 Visual Question Answering (VQA) Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual Question-Answering." ETSETB UPC TelecomBCN (2016). Image Question Answer
  42. 42. 42 Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016 Dynamic Parameter Prediction Network (DPPnet) Visual Question Answering (VQA)
  43. 43. 43 Visual Question Answering: Dynamic (Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for Visual and Textual Question Answering." ICML 2016
  44. 44. 44 Visual Question Answering: Grounded (Slides and Screencast by Issey Masuda): Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei."Visual7W: Grounded Question Answering in Images." CVPR 2016.
  45. 45. 45 Visual Dialog (Image Guessing Game) Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. "Visual Dialog." CVPR 2017
  46. 46. 46 Visual Dialog (Image Guessing Game) Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. "Visual Dialog." CVPR 2017
  47. 47. 47 Visual Reasoning Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. "CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017
  48. 48. 48 Visual Reasoning (Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Fei-Fei Li, Larry Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. arXiv 2017. Program Generator Execution Engine
  49. 49. 49 Outline 1. Neural Machine Transaltion (no vision here !) 2. Image and Video Captioning 3. Visual Question Answering / Reasoning 4. Joint Embeddings
  50. 50. 50 Representation or EmbeddingEncoder Joint Neural Embeddings Encoder
  51. 51. 51 Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep visual-semantic embedding model." NIPS 2013 Joint Neural Embeddings
  52. 52. 52 Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code] Zero-shot learning: a class not present in the training set of images can be predicted (eg. no images from “cat” in the training set) Joint Neural Embeddings
  53. 53. 53 Alejandro Woodward, Víctor Campos, Dèlia Fernàndez, Jordi Torres, Xavier Giró-i-Nieto, Brendan Jou and Shih-Fu Chang (work under progress) Joint Neural Embeddings Foggy Day
  54. 54. 54 Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 Image and text retrieval with joint embeddings. Joint Neural Embeddings
  55. 55. 55 Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 Joint Neural Embeddings
  56. 56. 56 Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 Joint Neural Embeddings joint embedding LSTM Bidirectional LSTM
  57. 57. 57 Outline 1. Neural Machine Transaltion (no vision here !) 2. Image and Video Captioning 3. Visual Question Answering / Reasoning 4. Joint Embeddings
  58. 58. 58 Thanks ! Q&A ? Follow me at https://imatge.upc.edu/web/people/xavier-giro @DocXavi /ProfessorXavi

×