One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)

1
DEEP
LEARNING
SUMMIT
London
21 September 2017
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
One Perceptron
to Rule them All:
Deep Learning for Multimedia
#REWORKDL
@DocXavi[Slides on GDrive]

Text (English)
10
Text (French)
Neural Machine Translation

11
Neural Machine Translation
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Representation or
Embedding

12
Representation or
Embedding

13
Representation or
Embedding
Encoder Decoder

14
Embedding Space
Figure:
Christopher Olah
Visualizing Representations
Word
embeddings

18
Representation or
Embedding
Encoder Decoder
Image to image

19
Image to image #pix2pix
Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional
adversarial networks." CVPR 2017

20
Image to image #pix2pix
Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional
adversarial networks." CVPR 2017

21
Junting Pan, Cristian Canton, Kevin McGuinness, Noel E. O’Connor, Jordi Torres, Elisa
Sayrol and Xavier Giro-i-Nieto. “SalGAN: Visual Saliency Prediction with Generative
Adversarial Networks.” CVPRW. 2017.
Image to Gaze Saliency Map

22
Adversarial Training
Figure by Cristopher Hesse on Affinelayer blog.
Generative
Adversarial
Training

25
Image to Word
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks”
NIPS 2012
Orange

26Perronin, F., CVPR Tutorial on LSVR @ CVPR’14, Output embedding for LSVR
[1,0,0]
[0,1,0]
[0,0,1]
One-hot
representation
Image to Word

27
Representation or
Embedding
Encoder Decoder
Image to Caption

28
Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption
generator." CVPR 2015.
Image to Caption: Show & Tell

29
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S.
Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual
Attention." ICML 2015
Image to Caption: Show, Attend & Tell

30
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for
dense captioning." CVPR 2016
Image to Caption: DenseCap

31
Visual Question Answering (VQA)
Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and
Devi Parikh. "VQA: Visual question answering." CVPR 2015.

32
Visual Question Answering (VQA)
Slide credit: Issey Masuda
Visual
representation
Textual
representation
Predict answerMerge
Question
What object is flying?
Answer
Kite
CNN
Word/sentence
embedding + LSTM

33
Video to Caption
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor
Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code

34
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the
wild." CVPR 2017
Audio
features
Image
features

35
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the
wild." CVPR 2017
Attention over output
states from audio and
video is computed at
each timestep

38
Representation or
Embedding
Encoder Decoder
Joint embeddings

39
Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A
deep visual-semantic embedding model." NIPS 2013
Joint embeddings: DeViSe

40
Joint embeddings: DeViSe
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer.
NIPS 2013 [slides] [code]
Zero-shot learning:
a class not present in the
training set of images
can be predicted
(...because it was
present in the training
set of words)

41
Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber,
Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”.
CVPR 2017
Recipe retrieval from images (and viceversa).
Joint embeddings: Pic2Recipe

42
Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber,
Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”.
CVPR 2017
Joint embeddings: Pic2Recipe

43
Cross-modal Networks
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal
Scene Networks." CVPR 2016.

44
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal
Scene Networks." CVPR 2016.
Cross-modal Networks

Text
45
Visual
Audio
Cross-Modal

46Yusuf Aytar, Carl Vondrick, Antonio Torralba, “See, Hear, and Read: Deep Aligned Representations”.
Cross-modal Networks: See, hear & read

47
Yusuf Aytar, Carl Vondrick, Antonio Torralba, “See, Hear, and Read: Deep Aligned Representations”.
2017
Cross-modal Networks: See, hear & read

50
Image Sonorization
L. Chen, S. Srivastava, Z. Duan and C. Xu. Deep Cross-Modal Audio-Visual Generation. ACM
International Conference on Multimedia Thematic Workshops, 2017.
Discriminator loss:
Does the image-audio pair match ?

51
Image Sonorization (& viceversa)
Discriminator loss:
Does the image-audio pair match ?

52
Image Sonorization

53
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
Learn synthesized sounds from videos of people hitting objects with a drumstick.
Video Sonorization

54
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
Video Sonorization
Not end-to-end

56
Video Sonorization
Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto (under progress).

57
Video Sonorization
Best
match
Visual feature Audio feature

58
Video Sonorization
Best
match

59
Video Sonorization
Best
match

60
Video Sonorization (& viceversa)
Best
match

61
Learn audio representations from video
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Teacher networks for visual object and scene detection generate weak labels.

63
Learn audio representations from video
Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.

66Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: Speech Reconstruction from Silent Video." ICCVW 2017
Video to Speech (Speechreading)
Trainable modules

68Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
Speech (+ Identity) to Video Synthesis

70
Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama:
learning lip sync from audio." SIGGRAPH 2017.
Speech to Video Synthesis (mouth)

72
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
Speech to Video Synthesis (pose & emotion)

http://dlai.deeplearning.barcelona/
76
Deep Learning online courses by UPC:
https://telecombcn-dl.github.io/2017-dlcv/
http://imatge-upc.github.io/telecombcn-2016-dlcv/ https://telecombcn-dl.github.io/2017-dlsl/
Autumn semester Winter School (24-30 January 2018)Summer School (early July 2018)

77
#REWORKDL
@DocXavi[Slides on GDrive
with links]
Xavier Giro-i-Nieto
xavier.giro@upc.edu

One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)

Recommended

Recommended

More Related Content

Similar to One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)

Similar to One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017) (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)