Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018

389 views

Published on

Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep learning. In this talk we will review the latest results on how convolutional and recurrent neural networks are combined to find the most hidden patterns in multimedia.

Published in: Data & Analytics
  • Be the first to comment

One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018

  1. 1. DEEP LEARNING MEETS COMPUTER VISION Barcelona, Catalonia 22 November 2018 One Perceptron to Rule them All: Deep Learning for Multimedia Talk III #A2IC2018 Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politecnica de Catalunya (UPC) Barcelona Supercomputing Center (BSC)
  2. 2. 2 Acknowledgments
  3. 3. 3 Densely linked slides
  4. 4. Text Audio 4 Speech Vision
  5. 5. Text Audio 5 Speech Vision
  6. 6. Text Audio 6 Speech Vision
  7. 7. 7
  8. 8. 8
  9. 9. 9
  10. 10. 10 Encoder Representation
  11. 11. 11Slide concept: Perronin, F., Tutorial on LSVR @ CVPR’14, Output embedding for LSVR One-hot Representation (Embeddings) [1,0,0] [0,1,0] [0,0,1]
  12. 12. 12 Encoder 0 1 0 Cat
  13. 13. 13 CatEncoder
  14. 14. 14 Image Encoding A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” NIPS 2012 Cat CNN
  15. 15. 15 Encoder Representation
  16. 16. 16 Encoder Representation
  17. 17. 17 Fig: Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." EMNLP 2014. (2) (3) (1) One hot encoding (2) Word embedding (3) Sentence representation Text Encoding RNN
  18. 18. 18 Gehring, Jonas, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. "Convolutional sequence to sequence learning." ICML 2017. Text Encoding CNN
  19. 19. 19 Representation Encoder
  20. 20. 20 Representation Raw MFCC Mel spectrum Encoder
  21. 21. 21 Chan, William, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition." ICASSP 2016. Speech Encoding Pascual, Santiago, Antonio Bonafonte, and Joan Serra. "SEGAN: Speech enhancement generative adversarial network." Interspeech 2017. RNN CNN
  22. 22. 22 Encoder Representation
  23. 23. 23 Decoder Representation
  24. 24. 24 Decoder Representation
  25. 25. 25 Text Decoding Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) RNN Representation
  26. 26. 26 Text Decoding CNN Gehring, Jonas, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. "Convolutional sequence to sequence learning." ICML 2017.
  27. 27. 27 Decoder Representation
  28. 28. 28 Image Decoding CNN Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." ICLR 2016. #DCGAN
  29. 29. 29 Decoder Representation
  30. 30. 30 Audio Decoding Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. "SampleRNN: An unconditional end-to-end neural audio generation model." ICLR 2017. RNN CNN Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).
  31. 31. 31 Encoder Decoder Representation
  32. 32. 32 Encoder Decoder Representation
  33. 33. 33 Neural Machine Translation (NMT) Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
  34. 34. 34 Encoder Decoder Representation
  35. 35. 35 Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." MICCAI 2015. Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial networks." CVPR 2017.
  36. 36. 36 Encoder Decoder Representation
  37. 37. 37 Speech Enhancement Pascual, Santiago, Antonio Bonafonte, and Joan Serra. "SEGAN: Speech enhancement generative adversarial network." Interspeech 2017.
  38. 38. 38 Encoder Decoder Representation
  39. 39. 39 Automatic Speech Recognition (ASR) Chan, William, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition." ICASSP 2016. #LAS Listener (encoder) Speller (decoder)
  40. 40. 40 Encoder Decoder Representation
  41. 41. 41 Speech Synthesis Prenger, Ryan, Rafael Valle, and Bryan Catanzaro. "WaveGlow: A Flow-based Generative Network for Speech Synthesis." arXiv preprint arXiv:1811.00002 (2018).
  42. 42. 42 Encoder Decoder Representation
  43. 43. 43 Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015. Image Captioning
  44. 44. 44 Lip Reading Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016).
  45. 45. 45 Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016).
  46. 46. 46 Encoder Decoder Representation
  47. 47. 47 Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. Text-to-Image
  48. 48. 48 Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. Text-to-Image
  49. 49. 49 Encoder Decoder Representation Encoder Representation
  50. 50. 50 Visual Question Answering Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA: Visual question answering." CVPR 2015.
  51. 51. 51 Encoder Encoder Representation
  52. 52. 52 Joint Representations (Embeddings) Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep visual-semantic embedding model." NIPS 2013
  53. 53. 53 Zero-shot learning Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code] No images from “cat” in the training set... ...but they can still be recognised as “cats” thanks to the representations learned from text .
  54. 54. 54 Visual Search of a Recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 #pic2recipe
  55. 55. 55 Encoder Encoder Representation
  56. 56. 56 Video Sonorization Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Sound Generation 2-layer LSTM For sound feature regression AlexNet FC7 For image feature extraction Sound retrieval Two stream CNN At each time step, RGB frame (one stream) and 3-spatiotemporal image (current, prev and following)
  57. 57. 57 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016.
  58. 58. 58 Encoder Encoder Representation
  59. 59. 59 Audio search of a matching video... Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018. Best match Audio feature
  60. 60. 60 ...or video search of a matching audio Best match Visual feature Audio feature Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
  61. 61. 61 Audio vs Pixels Alignment Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Object & Scenes recognition in videos by analysing the audio track (only).
  62. 62. 62 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.
  63. 63. 63 Audio vs Pixels Alignment Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
  64. 64. 64 Audio vs Pixels Alignment Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
  65. 65. 65 Encoder Encoder Representation
  66. 66. 66 Speech vs Pixels Alignment Harwath, David, Antonio Torralba, and James Glass. "Unsupervised learning of spoken language with visual context." NIPS 2016. [talk] Train a visual & speech networks with pairs of (non-)corresponding images & speech.
  67. 67. 67 Speech vs Pixels Alignment Harwath, David, Antonio Torralba, and James Glass. "Unsupervised learning of spoken language with visual context." NIPS 2016. [talk] Similarity curve show which regions of the spectrogram are relevant for the image. Important: no text transcriptions used during the training !!
  68. 68. 68 Speech vs Pixels Alignment Harwath, David, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass. "Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input." ECCV 2018.
  69. 69. 69 Speech vs Pixels Alignment Harwath, David, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass. "Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input." ECCV 2018. Regions matching the spoken word “WOMAN”:
  70. 70. 70 Encoder Decoder Representation
  71. 71. 71 Speech to Pixels Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al. “Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” (under progress)
  72. 72. 72 Encoder Decoder Representation Encoder Representation
  73. 73. 73 Speech & Pixels to Pixels Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
  74. 74. bit.ly/DLCV2018 #DLUPC 74Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
  75. 75. 75 Speech to Lip Keypoints Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from audio." SIGGRAPH 2017.
  76. 76. bit.ly/DLCV2018 #DLUPC 76 Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017
  77. 77. 77 Speech to 3D Models Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017
  78. 78. bit.ly/DLCV2018 #DLUPC 78 Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017
  79. 79. 79 DecoderEncoder Representation
  80. 80. 80 Speech Reconstruction Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: speech reconstruction from silent video." ICASSP 2017. CNN (VGG) Frame from a silent video Audio feature Post-hoc synthesis
  81. 81. bit.ly/DLCV2018 #DLUPC 81 Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV Workshop on Computer Vision for Audio-Visual Media. 2017.
  82. 82. 82 Encoder Decoder Representation Encoder Representation
  83. 83. 83 Speech Separation with Vision (lips) Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech Enhancement." Interspeech 2018.
  84. 84. bit.ly/DLCV2018 #DLUPC 84 Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech Enhancement." Interspeech 2018..
  85. 85. 85 Encoder Decoder Representation
  86. 86. Text Audio 86 Speech Vision
  87. 87. @DocXavi Xavier Giro-i-Nieto Slides available at: http://bit.ly/a2ic2018 xavier.giro@upc.edu #A2IC2018

×