Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Once Perceptron to Rule Them all: Deep Learning for Multimedia

734 views

Published on

Talk at the Insight Machine Learning Workshop 2017. #InsightDL2017

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Once Perceptron to Rule Them all: Deep Learning for Multimedia

  1. 1. 1 DEEP LEARNING WORKSHOP Dublin City University 27-28 April 2017 Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia One Perceptron to Rule them All: Deep Learning for Multimedia Spotlight #InsightDL2017@DocXavi
  2. 2. TextAudio Image EEG 2 Eye gazeSentiment
  3. 3. Sentiment TextAudio EEG 3 Eye gaze Image
  4. 4. Sentiment TextAudio Image EEG 4 Eye gaze
  5. 5. 5
  6. 6. 6
  7. 7. 7
  8. 8. TextAudio Visual EEG 8 Eye gazeSentiment
  9. 9. 9 Image to Tag A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” NIPS 2012 Orange
  10. 10. 10Perronin, F., CVPR Tutorial on LSVR @ CVPR’14, Output embedding for LSVR Image to Tag [1,0,0] [0,1,0] [0,0,1] One-hot representation
  11. 11. 11 Borth, Damian, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu Chang. "Large-scale visual sentiment ontology and detectors using adjective noun pairs." ACMMM 2013. CRYING BABY STORMY LANDSCAPE SMILING BABY SUNNY LANDSCAPE Image to Tags: Adjective Noun Pairs
  12. 12. 12 Delia Fernandez, Alejandro Woodward, Victor Campos, Xavier Giro-i-Nieto, Brendan Jou, Shih-Fu Chang, “More cat than cute? Interpretable Prediction of Adjective-Noun Pairs”. NounAdjective % Adjective contribution % Noun contribution ADJECTIVE NOUN Image to Tags: Adjective Noun Pairs
  13. 13. 13 Delia Fernandez, Alejandro Woodward, Victor Campos, Xavier Giro-i-Nieto, Brendan Jou, Shih-Fu Chang, “More cat than cute? Interpretable Prediction of Adjective-Noun Pairs”. Noun-Oriented Adjective-Oriented Cute Cat Foggy Day Interpretation: Adjective vs Noun-oriented predictions Image to Tags: Adjective Noun Pairs
  14. 14. 14 Delia Fernandez, Alejandro Woodward, Victor Campos, Xavier Giro-i-Nieto, Brendan Jou, Shih-Fu Chang, “More cat than cute? Interpretable Prediction of Adjective-Noun Pairs”. Image to Tags: Adjective Noun Pairs Interpretation: Contributing Adjectives and Nouns.
  15. 15. Text (English)Audio Text (French) EEG 15 Eye gazeSentiment
  16. 16. 16 Neural Machine Translation Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Representation or Embedding
  17. 17. 17 Representation or Embedding
  18. 18. 18 Representation or Embedding Encoder Decoder
  19. 19. 19 Word Embeddings Christopher Olah Visualizing Representations
  20. 20. TextAudio EEG 20 Eye gaze Visual Sentiment
  21. 21. 21 Representation or Embedding Encoder Decoder
  22. 22. 22 Captioning: Show & Tell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015.
  23. 23. 23 Captioning: DeepImageSent (Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015
  24. 24. 24 Captioning: DeepImageSent (Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015
  25. 25. 25 Captioning: Show, Attend & Tell Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
  26. 26. 26 Captioning: Show, Attend & Tell Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
  27. 27. 27 Captioning (+ Detection): DenseCap Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016
  28. 28. 28 Captioning for video Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
  29. 29. 29 Visual Question Answering (VQA) Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA: Visual question answering." CVPR 2015.
  30. 30. 30 Visual Question Answering (VQA) Slide credit: Issey Masuda Visual representation Textual representation Predict answerMerge Question What object is flying? Answer Kite CNN Word/sentence embedding + LSTM
  31. 31. 31 Visual Question Answering (VQA) Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual Question-Answering." arXiv preprint arXiv:1610.02692 (2016). Image Question Answer
  32. 32. 32 Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: Sentence-level Lipreading." arXiv preprint arXiv:1611.01599 (2016). Video to Text: LipNet
  33. 33. 33 Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017 Video to Text: Watch, Listen, Attend & Spell
  34. 34. 34 Representation or Embedding Encoder Encoder Joint embeddings
  35. 35. 35 Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep visual-semantic embedding model." NIPS 2013 Joint embeddings
  36. 36. 36 Joint embeddings Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code] Zero-shot learning: a class not present in the training set of images can be predicted
  37. 37. 37 Víctor Campos, Dèlia Fernàndez, Jordi Torres, Xavier Giró-i-Nieto, Brendan Jou and Shih-Fu Chang (work under progress) Joint embeddings
  38. 38. 38 Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 Image and text retrieval with joint embeddings. Joint embeddings
  39. 39. Visual (in)Audio Visual (out) 39 Eye gaze Text Sentiment
  40. 40. 40 Representation or Embedding Encoder Decoder Image to image
  41. 41. 41 Image to image: Colorization Victor Garcia and Xavier Giro-i-Nieto, “Image Colorization with Conditional Generative Adversarial Networks” (under progress)
  42. 42. 42 Victor Garcia and Xavier Giro-i-Nieto, “Image Colorization with Conditional Generative Adversarial Networks” (under progress) Image to image: Colorization
  43. 43. 43 Image to image #pix2pix Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial networks." arXiv preprint arXiv:1611.07004 (2016).
  44. 44. Visual (in)Audio Visual (out) 44 Eye gaze Text Sentiment
  45. 45. 45 Image to image and text Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." arXiv preprint arXiv:1610.09003 (2016).
  46. 46. 46 Image to image and text Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." arXiv preprint arXiv:1610.09003 (2016).
  47. 47. 47 Image to image and text Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." arXiv preprint arXiv:1610.09003 (2016).
  48. 48. 48 Encoder Decoder Decoder
  49. 49. 49 Image to image, tags & caption (multitask) Amaia Salvador, Manel Baradad, Xavier Giro-i-Nieto, Ferran Marques (under progress). t
  50. 50. TextAudio Visual EEG 50 Eye gazeSentiment
  51. 51. 51 Campos, Victor, Amaia Salvador, Xavier Giro-i-Nieto, and Brendan Jou. "Diving deep into sentiment: Understanding fine-tuned cnns for visual sentiment prediction." In Workshop on Affect & Sentiment in Multimedia, pp. 57-62. ACM, 2015. CNN Image to Sentiment
  52. 52. 52 Image to Sentiment Campos, Victor, Brendan Jou, and Xavier Giro-i-Nieto. "From pixels to sentiment: Fine-tuning cnns for visual sentiment prediction." Image and Vision Computing (2017).
  53. 53. TextAudio Visual EEG 53 Eye gazeSentiment
  54. 54. 54 Pan, Junting, Elisa Sayrol, Xavier Giro-i-Nieto, Kevin McGuinness, and Noel E. O'Connor. "Shallow and deep convolutional networks for saliency prediction." CVPR 2016 Image to Eye Gaze Prediction
  55. 55. 55 Junting Pan, Cristian Canton, Kevin McGuinness, Noel E. O’Connor, Jordi Torres, Elisa Sayrol and Xavier Giro-i-Nieto. “SalGAN: Visual Saliency Prediction with Generative Adversarial Networks.” arXiv. 2017. Image to Eye Gaze Prediction
  56. 56. 56 Reyes, Cristian, Eva Mohedano, Kevin McGuinness, Noel E. O'Connor, and Xavier Giro-i-Nieto. "Where is my Phone?: Personal Object Retrieval from Egocentric Images." ACMMM Workshop 2016 Image to Eye Gaze Prediction
  57. 57. TextAudio Vision EEG 57 Sentiment Eye gaze
  58. 58. 58 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Learn synthesized sounds from videos of people hitting objects with a drumstick. Video to Audio Representations
  59. 59. 59 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. No end-to-end Video to Audio Representations
  60. 60. 60 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Video to Audio Representations
  61. 61. 61Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: Speech Reconstruction from Silent Video." ICASSP 2017 No end-to-end Video to Audio Representations
  62. 62. 62Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: Speech Reconstruction from Silent Video." ICASSP 2017 Video to Audio Representations
  63. 63. 63 Learn audio representations from video Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Object & Scenes recognition in videos by analysing the audio track (only).
  64. 64. 64 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Learn audio representations from video
  65. 65. TextAudio Vision EEG 65 Eye gazeSentiment
  66. 66. 66 Spampinato, Concetto, Simone Palazzo, Isaak Kavasidis, Daniela Giordano, Mubarak Shah, and Nasim Souly. "Deep Learning Human Mind for Automated Visual Classification." CVPR 2017.
  67. 67. 67
  68. 68. 68
  69. 69. Slides & Videos Online [http://imatge-upc.github.io/telecombcn-2016-dlcv/] Next edition: Barcelona (Jun’17) 69
  70. 70. Slides & Videos online [https://telecombcn-dl.github.io/2017-dlsl/] Next edition: Barcelona (January’18) 70
  71. 71. 71 Thanks ! Q&A ? https://imatge.upc.edu/web/people/xavier-giro @DocXavi /ProfessorXavi #InsightDL2017

×