Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalisation properties of deep learning. Get the latest results on how convolutional and recurrent neural networks are combined to find the most hidden patterns in multimedia.
https://re-work.co/events/deep-learning-summit-london-2017/
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
1. 1
DEEP
LEARNING
SUMMIT
London
21 September 2017
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
One Perceptron
to Rule them All:
Deep Learning for Multimedia
#REWORKDL
@DocXavi[Slides on GDrive]
28. 28
Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption
generator." CVPR 2015.
Image to Caption: Show & Tell
29. 29
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S.
Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual
Attention." ICML 2015
Image to Caption: Show, Attend & Tell
30. 30
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for
dense captioning." CVPR 2016
Image to Caption: DenseCap
31. 31
Visual Question Answering (VQA)
Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and
Devi Parikh. "VQA: Visual question answering." CVPR 2015.
33. 33
Video to Caption
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor
Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
34. 34
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the
wild." CVPR 2017
Audio
features
Image
features
35. 35
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the
wild." CVPR 2017
Attention over output
states from audio and
video is computed at
each timestep
39. 39
Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A
deep visual-semantic embedding model." NIPS 2013
Joint embeddings: DeViSe
40. 40
Joint embeddings: DeViSe
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer.
NIPS 2013 [slides] [code]
Zero-shot learning:
a class not present in the
training set of images
can be predicted
(...because it was
present in the training
set of words)
41. 41
Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber,
Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”.
CVPR 2017
Recipe retrieval from images (and viceversa).
Joint embeddings: Pic2Recipe
42. 42
Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber,
Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”.
CVPR 2017
Joint embeddings: Pic2Recipe
43. 43
Cross-modal Networks
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal
Scene Networks." CVPR 2016.
44. 44
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal
Scene Networks." CVPR 2016.
Cross-modal Networks
50. 50
Image Sonorization
L. Chen, S. Srivastava, Z. Duan and C. Xu. Deep Cross-Modal Audio-Visual Generation. ACM
International Conference on Multimedia Thematic Workshops, 2017.
Discriminator loss:
Does the image-audio pair match ?
51. 51
Image Sonorization (& viceversa)
L. Chen, S. Srivastava, Z. Duan and C. Xu. Deep Cross-Modal Audio-Visual Generation. ACM
International Conference on Multimedia Thematic Workshops, 2017.
Discriminator loss:
Does the image-audio pair match ?
52. 52
Image Sonorization
L. Chen, S. Srivastava, Z. Duan and C. Xu. Deep Cross-Modal Audio-Visual Generation. ACM
International Conference on Multimedia Thematic Workshops, 2017.
53. 53
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
Learn synthesized sounds from videos of people hitting objects with a drumstick.
Video Sonorization
54. 54
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
Video Sonorization
Not end-to-end
57. 57
Video Sonorization
Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto (under progress).
Best
match
Visual feature Audio feature
58. 58
Video Sonorization
Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto (under progress).
Best
match
Visual feature Audio feature
59. 59
Video Sonorization
Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto (under progress).
Visual feature Audio feature
Best
match
60. 60
Video Sonorization (& viceversa)
Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto (under progress).
Visual feature Audio feature
Best
match
61. 61
Learn audio representations from video
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Teacher networks for visual object and scene detection generate weak labels.
66. 66Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: Speech Reconstruction from Silent Video." ICCVW 2017
Video to Speech (Speechreading)
Trainable modules
70. 70
Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama:
learning lip sync from audio." SIGGRAPH 2017.
Speech to Video Synthesis (mouth)
72. 72
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
Speech to Video Synthesis (pose & emotion)
76. http://dlai.deeplearning.barcelona/
76
Deep Learning online courses by UPC:
https://telecombcn-dl.github.io/2017-dlcv/
http://imatge-upc.github.io/telecombcn-2016-dlcv/ https://telecombcn-dl.github.io/2017-dlsl/
Autumn semester Winter School (24-30 January 2018)Summer School (early July 2018)