1
DEEP
LEARNING
SUMMIT
London
21 September 2017
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
One Perceptron
to Rule them All:
Deep Learning for Multimedia
#REWORKDL
@DocXavi[Slides on GDrive]
Text
2
Visual
SpeechAudio
Text
3
Visual
SpeechAudio
Text
4
Visual
SpeechAudio
5
6
7
Text
8
Visual
SpeechAudio
Text
9
Text (English)
10
Text (French)
Neural Machine Translation
11
Neural Machine Translation
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Representation or
Embedding
12
Representation or
Embedding
13
Representation or
Embedding
Encoder Decoder
14
Embedding Space
Figure:
Christopher Olah
Visualizing Representations
Word
embeddings
Text
15
Visual
SpeechAudio
16
Visual
17
VisualBW
Colorization
18
Representation or
Embedding
Encoder Decoder
Image to image
19
Image to image #pix2pix
Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional
adversarial networks." CVPR 2017
20
Image to image #pix2pix
Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional
adversarial networks." CVPR 2017
21
Junting Pan, Cristian Canton, Kevin McGuinness, Noel E. O’Connor, Jordi Torres, Elisa
Sayrol and Xavier Giro-i-Nieto. “SalGAN: Visual Saliency Prediction with Generative
Adversarial Networks.” CVPRW. 2017.
Image to Gaze Saliency Map
22
Adversarial Training
Figure by Cristopher Hesse on Affinelayer blog.
Generative
Adversarial
Training
Text
23
Visual
SpeechAudio
Text
24
Visual
25
Image to Word
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks”
NIPS 2012
Orange
26Perronin, F., CVPR Tutorial on LSVR @ CVPR’14, Output embedding for LSVR
[1,0,0]
[0,1,0]
[0,0,1]
One-hot
representation
Image to Word
27
Representation or
Embedding
Encoder Decoder
Image to Caption
28
Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption
generator." CVPR 2015.
Image to Caption: Show & Tell
29
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S.
Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual
Attention." ICML 2015
Image to Caption: Show, Attend & Tell
30
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for
dense captioning." CVPR 2016
Image to Caption: DenseCap
31
Visual Question Answering (VQA)
Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and
Devi Parikh. "VQA: Visual question answering." CVPR 2015.
32
Visual Question Answering (VQA)
Slide credit: Issey Masuda
Visual
representation
Textual
representation
Predict answerMerge
Question
What object is flying?
Answer
Kite
CNN
Word/sentence
embedding + LSTM
33
Video to Caption
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor
Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
34
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the
wild." CVPR 2017
Audio
features
Image
features
35
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the
wild." CVPR 2017
Attention over output
states from audio and
video is computed at
each timestep
36
Text
37
Visual
Cross-Modal
38
Representation or
Embedding
Encoder Decoder
Joint embeddings
39
Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A
deep visual-semantic embedding model." NIPS 2013
Joint embeddings: DeViSe
40
Joint embeddings: DeViSe
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer.
NIPS 2013 [slides] [code]
Zero-shot learning:
a class not present in the
training set of images
can be predicted
(...because it was
present in the training
set of words)
41
Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber,
Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”.
CVPR 2017
Recipe retrieval from images (and viceversa).
Joint embeddings: Pic2Recipe
42
Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber,
Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”.
CVPR 2017
Joint embeddings: Pic2Recipe
43
Cross-modal Networks
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal
Scene Networks." CVPR 2016.
44
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal
Scene Networks." CVPR 2016.
Cross-modal Networks
Text
45
Visual
Audio
Cross-Modal
46Yusuf Aytar, Carl Vondrick, Antonio Torralba, “See, Hear, and Read: Deep Aligned Representations”.
Cross-modal Networks: See, hear & read
47
Yusuf Aytar, Carl Vondrick, Antonio Torralba, “See, Hear, and Read: Deep Aligned Representations”.
2017
Cross-modal Networks: See, hear & read
Text
48
Visual
SpeechAudio
49
Visual
Audio
50
Image Sonorization
L. Chen, S. Srivastava, Z. Duan and C. Xu. Deep Cross-Modal Audio-Visual Generation. ACM
International Conference on Multimedia Thematic Workshops, 2017.
Discriminator loss:
Does the image-audio pair match ?
51
Image Sonorization (& viceversa)
L. Chen, S. Srivastava, Z. Duan and C. Xu. Deep Cross-Modal Audio-Visual Generation. ACM
International Conference on Multimedia Thematic Workshops, 2017.
Discriminator loss:
Does the image-audio pair match ?
52
Image Sonorization
L. Chen, S. Srivastava, Z. Duan and C. Xu. Deep Cross-Modal Audio-Visual Generation. ACM
International Conference on Multimedia Thematic Workshops, 2017.
53
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
Learn synthesized sounds from videos of people hitting objects with a drumstick.
Video Sonorization
54
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
Video Sonorization
Not end-to-end
55
56
Video Sonorization
Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto (under progress).
57
Video Sonorization
Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto (under progress).
Best
match
Visual feature Audio feature
58
Video Sonorization
Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto (under progress).
Best
match
Visual feature Audio feature
59
Video Sonorization
Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto (under progress).
Visual feature Audio feature
Best
match
60
Video Sonorization (& viceversa)
Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto (under progress).
Visual feature Audio feature
Best
match
61
Learn audio representations from video
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Teacher networks for visual object and scene detection generate weak labels.
62
63
Learn audio representations from video
Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Text
64
Visual
SpeechAudio
65
Visual
Speech
66Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: Speech Reconstruction from Silent Video." ICCVW 2017
Video to Speech (Speechreading)
Trainable modules
67
68Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
Speech (+ Identity) to Video Synthesis
69
70
Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama:
learning lip sync from audio." SIGGRAPH 2017.
Speech to Video Synthesis (mouth)
71
72
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
Speech to Video Synthesis (pose & emotion)
73
74
75
http://dlai.deeplearning.barcelona/
76
Deep Learning online courses by UPC:
https://telecombcn-dl.github.io/2017-dlcv/
http://imatge-upc.github.io/telecombcn-2016-dlcv/ https://telecombcn-dl.github.io/2017-dlsl/
Autumn semester Winter School (24-30 January 2018)Summer School (early July 2018)
77
#REWORKDL
@DocXavi[Slides on GDrive
with links]
Xavier Giro-i-Nieto
xavier.giro@upc.edu

One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)