Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

1,834 views

Published on

https://telecombcn-dl.github.io/2017-dlsl/

Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.

The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

  1. 1. Day 4 Lecture 4 Multimodal Deep Learning Xavier Giró-i-Nieto [course site]
  2. 2. 2 Multimedia Text Audio Vision and ratings, geolocation, time stamps...
  3. 3. 3 Multimedia Text Audio Vision and ratings, geolocation, time stamps...
  4. 4. 4
  5. 5. 5 Language & Vision: Encoder-Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Representation or Embedding Lectures D3L4 & D4L2 by Marta Ruiz on Neural Machine Translation
  6. 6. 6 Language & Vision: Encoder-Decoder
  7. 7. 7 Captioning: DeepImageSent (Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015
  8. 8. 8 Captioning: DeepImageSent (Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015 only takes into account image features in the first hidden state Multimodal Recurrent Neural Network
  9. 9. 9 Captioning: Show & Tell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015.
  10. 10. 10 Captioning: Show & Tell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015.
  11. 11. 11 Captioning: Show, Attend & Tell Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
  12. 12. 12 Captioning: Show, Attend & Tell Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
  13. 13. 13 Captioning: LSTM with image & video Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
  14. 14. 14 Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Captioning (+ Detection): DenseCap
  15. 15. 15 Captioning (+ Detection): DenseCap Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016
  16. 16. 16 Captioning (+ Detection): DenseCap Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 XAVI: “man has short hair”, “man with short hair” AMAIA:”a woman wearing a black shirt”, “ BOTH: “two men wearing black glasses”
  17. 17. 17 Captioning (+ Retrieval): DenseCap Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016
  18. 18. 18 Captioning: HRNE ( Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016. LSTM unit (2nd layer) Time Image t = 1 t = T hidden state at t = T first chunk of data
  19. 19. 19 Visual Question Answering [z1 , z2 , … zN ] [y1 , y2 , … yM ] “Is economic growth decreasing ?” “Yes” Encode Encode Decode
  20. 20. 20 Extract visual features Embedding Predict answerMerge Question What object is flying? Answer Kite Visual Question Answering Slide credit: Issey Masuda
  21. 21. 21 Visual Question Answering Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016 Dynamic Parameter Prediction Network (DPPnet)
  22. 22. 22 Visual Question Answering: Dynamic (Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for Visual and Textual Question Answering." arXiv preprint arXiv:1603.01417 (2016).
  23. 23. 23 Visual Question Answering: Dynamic (Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for Visual and Textual Question Answering." ICML 2016. Main idea: split image into local regions. Consider each region equivalent to a sentence. Local Region Feature Extraction: CNN (VGG-19): (1) Rescale input to 448x448. (2) Take output from last pooling layer → D=512x14x14 → 196 512-d local region vectors. Visual feature embedding: W matrix to project image features to “q”-textual space.
  24. 24. 24 Visual Question Answering: Grounded (Slides and Screencast by Issey Masuda): Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei."Visual7W: Grounded Question Answering in Images." CVPR 2016.
  25. 25. 25 Challenges: Visual Question Answering Visual Question Answering
  26. 26. 26 100%0% Humans 83,30% UC Berkeley & Sony 66,47% Baseline LSTM&CNN 54,06% Baseline Nearest neighbor 42,85% Baseline Prior per question type 37,47% Baseline All yes 29,88% 53,62% I. Masuda-Mora, “Open-Ended Visual Question-Answering”. BSc ETSETB 2016. [Keras] Challenges: Visual Question Answering
  27. 27. 27 Vision and language: Embeddings Perronin, F., CVPR Tutorial on LSVR @ CVPR’14, Output embedding for LSVR
  28. 28. 28 Vision and language: Embeddings A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” Part of: Advances in Neural Information Processing Systems 25 (NIPS 2012)
  29. 29. 29 Vision and language: Embeddings Lecture D2L4 by Toni Bonafonte on Word Embeddings Christopher Olah Visualizing Representations
  30. 30. 30 Vision and language: Devise One-hot encoding Embedding space
  31. 31. 31 Vision and language: Devise A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” Part of: Advances in Neural Information Processing Systems 25 (NIPS 2012)
  32. 32. 32 Vision and language: Devise Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep visual-semantic embedding model." NIPS 2013
  33. 33. 33 Vision and language: Embedding Víctor Campos, Dèlia Fernàndez, Jordi Torres, Xavier Giró-i-Nieto, Brendan Jou and Shih-Fu Chang (work under progress)
  34. 34. 34 Learn more Julia Hockenmeirer (UIUC), Vision to Language (@ Microsoft Research)
  35. 35. 35 Multimedia Text Audio Vision and ratings, geolocation, time stamps...
  36. 36. 36 Audio and Video: Soundnet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016. Object & Scenes recognition in videos by analysing the audio track (only).
  37. 37. 37 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Videos for training are unlabeled. Relies on CNNs trained on labeled images. Audio and Video: Soundnet
  38. 38. 38 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Videos for training are unlabeled. Relies on CNNs trained on labeled images. Audio and Video: Soundnet
  39. 39. 39 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016. Audio and Video: Soundnet
  40. 40. 40 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art. Audio and Video: Soundnet
  41. 41. 41 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualization of the 1D filters over raw audio in conv1. Audio and Video: Soundnet
  42. 42. 42 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualization of the 1D filters over raw audio in conv1. Audio and Video: Soundnet
  43. 43. 43 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualization of the 1D filters over raw audio in conv1. Audio and Video: Soundnet
  44. 44. 44 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016. Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7): Audio and Video: Soundnet
  45. 45. 45 Audio and Video: Sonorizaton Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Learn synthesized sounds from videos of people hitting objects with a drumstick.
  46. 46. 46 Audio and Video: Visual Sounds Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. No end-to-end
  47. 47. 47 Audio and Video: Visual Sounds Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016.
  48. 48. 48 Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: Speech Reconstruction from Silent Video." ICASSP 2017 Speech and Video: Vid2Speech CNN (VGG) Frame from a slient video Audio feature
  49. 49. 49Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: Speech Reconstruction from Silent Video." ICASSP 2017 Speech and Video: Vid2Speech No end-to-end
  50. 50. 50 Speech and Video: Vid2Speech Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: Speech Reconstruction from Silent Video." ICASSP 2017
  51. 51. 51 Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: Sentence-level Lipreading." arXiv preprint arXiv:1611.01599 (2016). Speech and Video: LipNet
  52. 52. 52 Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: Sentence-level Lipreading." arXiv preprint arXiv:1611.01599 (2016). Speech and Video: LipNet
  53. 53. 53 Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." arXiv preprint arXiv:1611.05358 (2016). Speech and Video: Watch, Listen, Attend & Spell
  54. 54. 54 Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." arXiv preprint arXiv:1611.05358 (2016). Speech and Video: Watch, Listen, Attend & Spell
  55. 55. 55 Conclusions
  56. 56. 56 Conclusions
  57. 57. 57 Conclusions [course site]
  58. 58. 58 Learn more Ruus Salakhutdinov, “Multimodal Machine Learning” (NIPS 2015 Workshop)
  59. 59. 59 Thanks ! Q&A ? Follow me at https://imatge.upc.edu/web/people/xavier-giro @DocXavi /ProfessorXavi

×