One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018

DEEP LEARNING
MEETS COMPUTER
VISION
Barcelona, Catalonia
22 November 2018
One Perceptron
to Rule them All:
Deep Learning for Multimedia
Talk III
#A2IC2018
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politecnica de Catalunya (UPC)
Barcelona Supercomputing Center (BSC)

11Slide concept: Perronin, F., Tutorial on LSVR @ CVPR’14, Output embedding for LSVR
One-hot Representation (Embeddings)
[1,0,0]
[0,1,0]
[0,0,1]

14
Image Encoding
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks”
NIPS 2012
Cat
CNN

17
Fig: Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation."
EMNLP 2014.
(2)
(3)
(1) One hot encoding
(2) Word embedding
(3) Sentence representation
Text Encoding
RNN

18
Gehring, Jonas, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. "Convolutional sequence to sequence
learning." ICML 2017.
Text Encoding
CNN

20
Representation
Raw
MFCC
Mel spectrum
Encoder

21
Chan, William, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. "Listen, attend and spell: A neural network for large vocabulary
conversational speech recognition." ICASSP 2016.
Speech Encoding
Pascual, Santiago, Antonio Bonafonte, and Joan Serra. "SEGAN: Speech enhancement generative adversarial network."
Interspeech 2017.
RNN
CNN

25
Text Decoding
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
RNN
Representation

26
Text Decoding
CNN
Gehring, Jonas, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. "Convolutional sequence to sequence
learning." ICML 2017.

28
Image Decoding
CNN
Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative
adversarial networks." ICLR 2016. #DCGAN

30
Audio Decoding
Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua
Bengio. "SampleRNN: An unconditional end-to-end neural audio generation model." ICLR 2017.
RNN
CNN
Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew
Senior, and Koray Kavukcuoglu. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).

31
Encoder Decoder
Representation

32
Encoder Decoder
Representation

33
Neural Machine Translation (NMT)
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)

34
Encoder Decoder
Representation

35
Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation."
MICCAI 2015.
Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial
networks." CVPR 2017.

36
Encoder Decoder
Representation

37
Speech Enhancement
Pascual, Santiago, Antonio Bonafonte, and Joan Serra. "SEGAN: Speech enhancement generative adversarial network."
Interspeech 2017.

38
Encoder Decoder
Representation

39
Automatic Speech Recognition (ASR)
Chan, William, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. "Listen, attend and spell: A neural network for large vocabulary
conversational speech recognition." ICASSP 2016. #LAS
Listener (encoder)
Speller (decoder)

40
Encoder Decoder
Representation

41
Speech Synthesis
Prenger, Ryan, Rafael Valle, and Bryan Catanzaro. "WaveGlow: A Flow-based Generative Network for Speech Synthesis." arXiv
preprint arXiv:1811.00002 (2018).

42
Encoder Decoder
Representation

43
Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator."
CVPR 2015.
Image Captioning

44
Lip Reading
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level
Lipreading." (2016).

45
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading."
(2016).

46
Encoder Decoder
Representation

47
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016.
Text-to-Image

48
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016.
Text-to-Image

49
Encoder
Decoder
Representation
Encoder
Representation

50
Visual Question Answering
Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA:
Visual question answering." CVPR 2015.

51
Encoder Encoder
Representation

52
Joint Representations (Embeddings)
Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep
visual-semantic embedding model." NIPS 2013

53
Zero-shot learning
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides]
[code]
No images from “cat” in
the training set...
...but they can still be
recognised as “cats”
thanks to the
representations learned
from text .

54
Visual Search of a Recipe
Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber,
Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food
Images”. CVPR 2017 #pic2recipe

55
Encoder Encoder
Representation

56
Video Sonorization
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Sound Generation
2-layer LSTM
For sound feature regression
AlexNet FC7
For image feature extraction
Sound retrieval
Two stream CNN
At each time step, RGB frame (one stream) and
3-spatiotemporal image (current, prev and following)

57
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.

58
Encoder Encoder
Representation

59
Audio search of a matching video...
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for
Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
Best
match
Audio feature

60
...or video search of a matching audio
Best
match
Visual feature Audio feature
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for
Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.

61
Audio vs Pixels Alignment
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Object & Scenes recognition in videos by analysing the audio track (only).

62
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016.

63
Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.

64
Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.

65
Encoder Encoder
Representation

66
Speech vs Pixels Alignment
Harwath, David, Antonio Torralba, and James Glass. "Unsupervised learning of spoken language with visual context." NIPS 2016.
[talk]
Train a visual & speech networks with pairs of (non-)corresponding images & speech.

67
Harwath, David, Antonio Torralba, and James Glass. "Unsupervised learning of spoken language with visual context." NIPS 2016.
[talk]
Similarity curve show which regions of the spectrogram are relevant for the image.
Important: no text transcriptions used during the training !!

68
Harwath, David, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass. "Jointly
Discovering Visual Objects and Spoken Words from Raw Sensory Input." ECCV 2018.

69
Harwath, David, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass. "Jointly
Discovering Visual Objects and Spoken Words from Raw Sensory Input." ECCV 2018.
Regions matching the spoken word “WOMAN”:

70
Encoder Decoder
Representation

71
Speech to Pixels
Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al.
“Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” (under progress)

72
Encoder
Decoder
Representation
Encoder Representation

73
Speech & Pixels to Pixels
Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.

bit.ly/DLCV2018
#DLUPC
74Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.

75
Speech to Lip Keypoints
Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from
audio." SIGGRAPH 2017.

bit.ly/DLCV2018
#DLUPC
76
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017

77
Speech to 3D Models
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end
learning of pose and emotion." SIGGRAPH 2017

bit.ly/DLCV2018
#DLUPC
78
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017

79
DecoderEncoder
Representation

80
Speech Reconstruction
Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: speech reconstruction from silent video." ICASSP 2017.
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis

bit.ly/DLCV2018
#DLUPC
81
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In
ICCV Workshop on Computer Vision for Audio-Visual Media. 2017.

82
Encoder
Decoder
Representation
Encoder Representation

83
Speech Separation with Vision (lips)
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech Enhancement."
Interspeech 2018.

bit.ly/DLCV2018
#DLUPC
84
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech
Enhancement." Interspeech 2018..

85
Encoder Decoder
Representation

@DocXavi
Xavier Giro-i-Nieto
Slides available at:
http://bit.ly/a2ic2018
xavier.giro@upc.edu
#A2IC2018

One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018

More Related Content

What's hot

Similar to One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018

More from Universitat Politècnica de Catalunya

Recently uploaded

One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018