DEEP LEARNING
MEETS COMPUTER
VISION
Barcelona, Catalonia
22 November 2018
One Perceptron
to Rule them All:
Deep Learning for Multimedia
Talk III
#A2IC2018
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politecnica de Catalunya (UPC)
Barcelona Supercomputing Center (BSC)
2
Acknowledgments
3
Densely linked slides
Text
Audio
4
Speech
Vision
Text
Audio
5
Speech
Vision
Text
Audio
6
Speech
Vision
7
8
9
10
Encoder
Representation
11Slide concept: Perronin, F., Tutorial on LSVR @ CVPR’14, Output embedding for LSVR
One-hot Representation (Embeddings)
[1,0,0]
[0,1,0]
[0,0,1]
12
Encoder
0
1
0
Cat
13
CatEncoder
14
Image Encoding
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks”
NIPS 2012
Cat
CNN
15
Encoder
Representation
16
Encoder
Representation
17
Fig: Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation."
EMNLP 2014.
(2)
(3)
(1) One hot encoding
(2) Word embedding
(3) Sentence representation
Text Encoding
RNN
18
Gehring, Jonas, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. "Convolutional sequence to sequence
learning." ICML 2017.
Text Encoding
CNN
19
Representation
Encoder
20
Representation
Raw
MFCC
Mel spectrum
Encoder
21
Chan, William, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. "Listen, attend and spell: A neural network for large vocabulary
conversational speech recognition." ICASSP 2016.
Speech Encoding
Pascual, Santiago, Antonio Bonafonte, and Joan Serra. "SEGAN: Speech enhancement generative adversarial network."
Interspeech 2017.
RNN
CNN
22
Encoder
Representation
23
Decoder
Representation
24
Decoder
Representation
25
Text Decoding
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
RNN
Representation
26
Text Decoding
CNN
Gehring, Jonas, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. "Convolutional sequence to sequence
learning." ICML 2017.
27
Decoder
Representation
28
Image Decoding
CNN
Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative
adversarial networks." ICLR 2016. #DCGAN
29
Decoder
Representation
30
Audio Decoding
Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua
Bengio. "SampleRNN: An unconditional end-to-end neural audio generation model." ICLR 2017.
RNN
CNN
Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew
Senior, and Koray Kavukcuoglu. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).
31
Encoder Decoder
Representation
32
Encoder Decoder
Representation
33
Neural Machine Translation (NMT)
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
34
Encoder Decoder
Representation
35
Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation."
MICCAI 2015.
Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial
networks." CVPR 2017.
36
Encoder Decoder
Representation
37
Speech Enhancement
Pascual, Santiago, Antonio Bonafonte, and Joan Serra. "SEGAN: Speech enhancement generative adversarial network."
Interspeech 2017.
38
Encoder Decoder
Representation
39
Automatic Speech Recognition (ASR)
Chan, William, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. "Listen, attend and spell: A neural network for large vocabulary
conversational speech recognition." ICASSP 2016. #LAS
Listener (encoder)
Speller (decoder)
40
Encoder Decoder
Representation
41
Speech Synthesis
Prenger, Ryan, Rafael Valle, and Bryan Catanzaro. "WaveGlow: A Flow-based Generative Network for Speech Synthesis." arXiv
preprint arXiv:1811.00002 (2018).
42
Encoder Decoder
Representation
43
Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator."
CVPR 2015.
Image Captioning
44
Lip Reading
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level
Lipreading." (2016).
45
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading."
(2016).
46
Encoder Decoder
Representation
47
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016.
Text-to-Image
48
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016.
Text-to-Image
49
Encoder
Decoder
Representation
Encoder
Representation
50
Visual Question Answering
Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA:
Visual question answering." CVPR 2015.
51
Encoder Encoder
Representation
52
Joint Representations (Embeddings)
Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep
visual-semantic embedding model." NIPS 2013
53
Zero-shot learning
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides]
[code]
No images from “cat” in
the training set...
...but they can still be
recognised as “cats”
thanks to the
representations learned
from text .
54
Visual Search of a Recipe
Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber,
Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food
Images”. CVPR 2017 #pic2recipe
55
Encoder Encoder
Representation
56
Video Sonorization
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Sound Generation
2-layer LSTM
For sound feature regression
AlexNet FC7
For image feature extraction
Sound retrieval
Two stream CNN
At each time step, RGB frame (one stream) and
3-spatiotemporal image (current, prev and following)
57
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
58
Encoder Encoder
Representation
59
Audio search of a matching video...
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for
Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
Best
match
Audio feature
60
...or video search of a matching audio
Best
match
Visual feature Audio feature
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for
Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
61
Audio vs Pixels Alignment
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Object & Scenes recognition in videos by analysing the audio track (only).
62
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016.
63
Audio vs Pixels Alignment
Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
64
Audio vs Pixels Alignment
Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
65
Encoder Encoder
Representation
66
Speech vs Pixels Alignment
Harwath, David, Antonio Torralba, and James Glass. "Unsupervised learning of spoken language with visual context." NIPS 2016.
[talk]
Train a visual & speech networks with pairs of (non-)corresponding images & speech.
67
Speech vs Pixels Alignment
Harwath, David, Antonio Torralba, and James Glass. "Unsupervised learning of spoken language with visual context." NIPS 2016.
[talk]
Similarity curve show which regions of the spectrogram are relevant for the image.
Important: no text transcriptions used during the training !!
68
Speech vs Pixels Alignment
Harwath, David, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass. "Jointly
Discovering Visual Objects and Spoken Words from Raw Sensory Input." ECCV 2018.
69
Speech vs Pixels Alignment
Harwath, David, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass. "Jointly
Discovering Visual Objects and Spoken Words from Raw Sensory Input." ECCV 2018.
Regions matching the spoken word “WOMAN”:
70
Encoder Decoder
Representation
71
Speech to Pixels
Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al.
“Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” (under progress)
72
Encoder
Decoder
Representation
Encoder Representation
73
Speech & Pixels to Pixels
Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
bit.ly/DLCV2018
#DLUPC
74Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
75
Speech to Lip Keypoints
Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from
audio." SIGGRAPH 2017.
bit.ly/DLCV2018
#DLUPC
76
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
77
Speech to 3D Models
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end
learning of pose and emotion." SIGGRAPH 2017
bit.ly/DLCV2018
#DLUPC
78
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
79
DecoderEncoder
Representation
80
Speech Reconstruction
Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: speech reconstruction from silent video." ICASSP 2017.
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis
bit.ly/DLCV2018
#DLUPC
81
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In
ICCV Workshop on Computer Vision for Audio-Visual Media. 2017.
82
Encoder
Decoder
Representation
Encoder Representation
83
Speech Separation with Vision (lips)
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech Enhancement."
Interspeech 2018.
bit.ly/DLCV2018
#DLUPC
84
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech
Enhancement." Interspeech 2018..
85
Encoder Decoder
Representation
Text
Audio
86
Speech
Vision
@DocXavi
Xavier Giro-i-Nieto
Slides available at:
http://bit.ly/a2ic2018
xavier.giro@upc.edu
#A2IC2018

One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018