Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)

233 views

Published on

Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalisation properties of deep learning. Get the latest results on how convolutional and recurrent neural networks are combined to find the most hidden patterns in multimedia.

https://re-work.co/events/deep-learning-summit-london-2017/

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)

  1. 1. 1 DEEP LEARNING SUMMIT London 21 September 2017 Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia One Perceptron to Rule them All: Deep Learning for Multimedia #REWORKDL @DocXavi[Slides on GDrive]
  2. 2. Text 2 Visual SpeechAudio
  3. 3. Text 3 Visual SpeechAudio
  4. 4. Text 4 Visual SpeechAudio
  5. 5. 5
  6. 6. 6
  7. 7. 7
  8. 8. Text 8 Visual SpeechAudio
  9. 9. Text 9
  10. 10. Text (English) 10 Text (French) Neural Machine Translation
  11. 11. 11 Neural Machine Translation Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Representation or Embedding
  12. 12. 12 Representation or Embedding
  13. 13. 13 Representation or Embedding Encoder Decoder
  14. 14. 14 Embedding Space Figure: Christopher Olah Visualizing Representations Word embeddings
  15. 15. Text 15 Visual SpeechAudio
  16. 16. 16 Visual
  17. 17. 17 VisualBW Colorization
  18. 18. 18 Representation or Embedding Encoder Decoder Image to image
  19. 19. 19 Image to image #pix2pix Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial networks." CVPR 2017
  20. 20. 20 Image to image #pix2pix Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial networks." CVPR 2017
  21. 21. 21 Junting Pan, Cristian Canton, Kevin McGuinness, Noel E. O’Connor, Jordi Torres, Elisa Sayrol and Xavier Giro-i-Nieto. “SalGAN: Visual Saliency Prediction with Generative Adversarial Networks.” CVPRW. 2017. Image to Gaze Saliency Map
  22. 22. 22 Adversarial Training Figure by Cristopher Hesse on Affinelayer blog. Generative Adversarial Training
  23. 23. Text 23 Visual SpeechAudio
  24. 24. Text 24 Visual
  25. 25. 25 Image to Word A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” NIPS 2012 Orange
  26. 26. 26Perronin, F., CVPR Tutorial on LSVR @ CVPR’14, Output embedding for LSVR [1,0,0] [0,1,0] [0,0,1] One-hot representation Image to Word
  27. 27. 27 Representation or Embedding Encoder Decoder Image to Caption
  28. 28. 28 Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015. Image to Caption: Show & Tell
  29. 29. 29 Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015 Image to Caption: Show, Attend & Tell
  30. 30. 30 Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Image to Caption: DenseCap
  31. 31. 31 Visual Question Answering (VQA) Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA: Visual question answering." CVPR 2015.
  32. 32. 32 Visual Question Answering (VQA) Slide credit: Issey Masuda Visual representation Textual representation Predict answerMerge Question What object is flying? Answer Kite CNN Word/sentence embedding + LSTM
  33. 33. 33 Video to Caption Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
  34. 34. 34 Lipreading: Watch, Listen, Attend & Spell Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017 Audio features Image features
  35. 35. 35 Lipreading: Watch, Listen, Attend & Spell Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017 Attention over output states from audio and video is computed at each timestep
  36. 36. 36
  37. 37. Text 37 Visual Cross-Modal
  38. 38. 38 Representation or Embedding Encoder Decoder Joint embeddings
  39. 39. 39 Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep visual-semantic embedding model." NIPS 2013 Joint embeddings: DeViSe
  40. 40. 40 Joint embeddings: DeViSe Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code] Zero-shot learning: a class not present in the training set of images can be predicted (...because it was present in the training set of words)
  41. 41. 41 Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 Recipe retrieval from images (and viceversa). Joint embeddings: Pic2Recipe
  42. 42. 42 Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 Joint embeddings: Pic2Recipe
  43. 43. 43 Cross-modal Networks Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016.
  44. 44. 44 Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016. Cross-modal Networks
  45. 45. Text 45 Visual Audio Cross-Modal
  46. 46. 46Yusuf Aytar, Carl Vondrick, Antonio Torralba, “See, Hear, and Read: Deep Aligned Representations”. Cross-modal Networks: See, hear & read
  47. 47. 47 Yusuf Aytar, Carl Vondrick, Antonio Torralba, “See, Hear, and Read: Deep Aligned Representations”. 2017 Cross-modal Networks: See, hear & read
  48. 48. Text 48 Visual SpeechAudio
  49. 49. 49 Visual Audio
  50. 50. 50 Image Sonorization L. Chen, S. Srivastava, Z. Duan and C. Xu. Deep Cross-Modal Audio-Visual Generation. ACM International Conference on Multimedia Thematic Workshops, 2017. Discriminator loss: Does the image-audio pair match ?
  51. 51. 51 Image Sonorization (& viceversa) L. Chen, S. Srivastava, Z. Duan and C. Xu. Deep Cross-Modal Audio-Visual Generation. ACM International Conference on Multimedia Thematic Workshops, 2017. Discriminator loss: Does the image-audio pair match ?
  52. 52. 52 Image Sonorization L. Chen, S. Srivastava, Z. Duan and C. Xu. Deep Cross-Modal Audio-Visual Generation. ACM International Conference on Multimedia Thematic Workshops, 2017.
  53. 53. 53 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Learn synthesized sounds from videos of people hitting objects with a drumstick. Video Sonorization
  54. 54. 54 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Video Sonorization Not end-to-end
  55. 55. 55
  56. 56. 56 Video Sonorization Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto (under progress).
  57. 57. 57 Video Sonorization Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto (under progress). Best match Visual feature Audio feature
  58. 58. 58 Video Sonorization Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto (under progress). Best match Visual feature Audio feature
  59. 59. 59 Video Sonorization Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto (under progress). Visual feature Audio feature Best match
  60. 60. 60 Video Sonorization (& viceversa) Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto (under progress). Visual feature Audio feature Best match
  61. 61. 61 Learn audio representations from video Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Teacher networks for visual object and scene detection generate weak labels.
  62. 62. 62
  63. 63. 63 Learn audio representations from video Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
  64. 64. Text 64 Visual SpeechAudio
  65. 65. 65 Visual Speech
  66. 66. 66Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: Speech Reconstruction from Silent Video." ICCVW 2017 Video to Speech (Speechreading) Trainable modules
  67. 67. 67
  68. 68. 68Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017. Speech (+ Identity) to Video Synthesis
  69. 69. 69
  70. 70. 70 Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from audio." SIGGRAPH 2017. Speech to Video Synthesis (mouth)
  71. 71. 71
  72. 72. 72 Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017 Speech to Video Synthesis (pose & emotion)
  73. 73. 73
  74. 74. 74
  75. 75. 75
  76. 76. http://dlai.deeplearning.barcelona/ 76 Deep Learning online courses by UPC: https://telecombcn-dl.github.io/2017-dlcv/ http://imatge-upc.github.io/telecombcn-2016-dlcv/ https://telecombcn-dl.github.io/2017-dlsl/ Autumn semester Winter School (24-30 January 2018)Summer School (early July 2018)
  77. 77. 77 #REWORKDL @DocXavi[Slides on GDrive with links] Xavier Giro-i-Nieto xavier.giro@upc.edu

×