Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC TelecomBCN Barcelona

Module 6 - Day 10 - Lecture 2
Self-supervised
Audiovisual Learning
26th March 2020
Xavier Giró-i-Nieto
Deep Learning for Video lectures:
[https://mcv-m6-video.github.io/deepvideo-2020/]
@DocXavi

2
Acknowledgements
2
Santiago Pascual de la Puente
santi.pascual@upc.edu
Universitat Politècnica de Catalunya
Technical University of Catalonia

Video lecture
3
Lecture in Master in Computer Vision Barcelona 2019

4
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation

9
Encoder
0
1
0
Cat
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classiﬁcation with deep convolutional neural networks” NIPS 2012

10Slide concept: Perronin, F., Tutorial on LSVR @ CVPR’14, Output embedding for LSVR
One-hot Representation
[1,0,0]
[0,1,0]
[0,0,1]

13
Decoder
Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative
adversarial networks." ICLR 2016. #DCGAN
0
1
0
Cat
Fig: Xudong Mao #DCGAN

14
Encoder Decoder
Representation

15
^
y
loss
Representations learned without labels
Self-supervised learning is a form of unsupervised learning where the data
provides the supervision.
● By deﬁning a proxy loss, the NN learns representations, which should be
valuable for the actually target task.
Self-supervised Learning

16
Vision
Audio
Speech
Video
Synchronization among modalities
captured by video is exploited in a
self-supervised manner.

17
Encoder Decoder
Representation
Learn

18
Outline
1. Motivation
2. Feature Learning
a. Audio Features
b. Joint Feature Learning
c. Matching Networks

19
Outline
1. Motivation
2. Feature Learning
a. Audio Features

20
Encoder Encoder
Representation

21
Audio Feature Learning
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Based on the assumption that ambient sound in video is related to the visual
semantics.

22
Use videos to train a CNN that predicts the audio statistics of a frame.

23
Task: Use the predicted audio stats to clusters images. Audio clusters built with
K-means algorithm over the training set
Cluster assignments at test time (one row=one cluster)

24
Although the CNN was not trained with class labels, local units with semantic
meaning emerge.

25
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Video sonorization: Retrieve matching sounds for videos of people hitting objects
with a drumstick.
Audio Feature Learning (self-supervised)

26
The Greatest Hits Dataset

27
Audio Clip
Retrieval
Not end-to-end

28

29
Speaker ID Feature Learning (disentangled)
Nagrani, Arsha, Joon Son Chung, Samuel Albanie, and Andrew Zisserman. "Disentangled Speech Embeddings using Cross-modal
Self-supervision." ICASSP 2020.

30
Speaker ID Feature Learning (disentangled)
Nagrani, Arsha, Joon Son Chung, Samuel Albanie, and Andrew Zisserman. "Disentangled Speech Embeddings using Cross-modal
Self-supervision." ICASSP 2020.

31
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Teacher network: Visual Recognition (object & scenes)
Audio Feature Learning (label transfer)

32
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS
2016.

33
Learned audio features are good for environmental sound recognition.

34
Learned audio features are good for environmental sound recognition.

35
Visualization of the 1D ﬁlters over raw audio in conv1.

36
Visualization of the 1D ﬁlters over raw audio in conv1.

37
Visualize video frames that mostly activate a neuron in a late layer (conv7)

38
Visualize video frames that mostly activate a neuron in a late layer (conv7)

39
S Albanie, A Nagrani, A Vedaldi, A Zisserman, “Emotion Recognition in Speech using Cross-Modal Transfer in the Wild”
ACM Multimedia 2018.
Teacher network: Facial Emotion Recognition (visual)

40
Outline
1. Motivation
2. Feature Learning
a. Audio Features

41
Encoder Encoder
Representation
Joint Feature Learning

42
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video
and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.

43
Best
match
Audio feature

44
Best
match
Visual feature Audio feature

45
#AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from
Self-Supervised Synchronization." NIPS 2018.

46

47

48
Outline
1. Motivation
2. Feature Learning
a. Audio Features

49Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Matching Networks

Most activated unit in pool4 layer of the visual network.
Matching Networks

Visual features used to train a linear classiﬁer on ImageNet.
Matching Networks

Most activated unit in pool4 layer of the audio network
Matching Networks

Audio features achieve state of the art performance.
Matching Networks

54Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." ECCV 2018.
Matching Networks (sound localization)

55Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." ECCV 2018.

56
Nagrani, Arsha, Samuel Albanie, and Andrew Zisserman. "Seeing voices and hearing faces: Cross-modal
biometric matching." CVPR 2018.
Matching Networks

57
Nagrani, Arsha, Samuel Albanie, and Andrew Zisserman. "Seeing voices and hearing faces: Cross-modal
biometric matching." CVPR 2018.

58
Outline
1. Motivation
2. Feature Learning
a. Speech Reconstruction
b. Skeleton animation
c. Face Hallucination
d. Sound separation
e. Visual re-dubbing (“deep fakes”)

59
DecoderEncoder
Representation

60
Outline
1. Motivation
2. Feature Learning
b. Face Hallucination
c. Sound separation
d. Visual re-dubbing (“deep fakes”)

61
Speech Reconstruction
Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: speech reconstruction from silent video." ICASSP 2017.
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis

62
Speech Reconstruction
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV 2017
Workshop on Computer Vision for Audio-Visual Media. 2017.

bit.ly/DLCV2018
#DLUPC
63
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV Workshop on
Computer Vision for Audio-Visual Media. 2017.

64
Encoder Decoder
Representation

65
Outline
1. Motivation
2. Feature Learning
c. Sound separation

66
Face Hallucination from Low-res
Liu, C., Shum, H. Y., & Zhang, C. S. A two-step approach to hallucinating faces: global parametric model and local nonparametric
model. CVPR 2001.

67
Face Hallucination from Speech
#Wav2Pix Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva
Mohedano et al. “Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019

68
Generated faces from known identities.

69
Faces from average speeches.

70
Interpolated faces from interpolated speeches.

71
#Speech2Face Oh, Tae-Hyun, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, and Wojciech
Matusik. "Speech2face: Learning the face behind a voice." CVPR 2019.

72
#Speech2Face Oh, Tae-Hyun, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, and Wojciech
Matusik. "Speech2face: Learning the face behind a voice." CVPR 2019.

73
Encoder Decoder
Representation

74
Avatar animation with music (skeletons)
Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. (2018). Audio to body dynamics. CVPR 2018.

75Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. Audio to body dynamics. CVPR 2018.

76
Sound Source Localization
Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual
Scenes." CVPR 2018.

77
Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual Scenes."
CVPR 2018.

78
Outline
1. Motivation
2. Feature Learning
c. Sound separation

79
Encoder
Decoder
Representation
Encoder Representation

80
Speech Separation with Vision (lips)
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech Enhancement."
Interspeech 2018.

bit.ly/DLCV2018
#DLUPC
81
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech
Enhancement." Interspeech 2018.

82
Outline
1. Motivation
2. Feature Learning
c. Sound separation

83
Encoder
Decoder
Representation
Encoder Representation

84
Visual re-dubbing (skeletons)
#Speech2Gesture Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, Jitendra Malik, “Learning Individual
Styles of Conversational Gesture” CVPR 2019.

85
Visual re-dubbing (skeletons)
#Speech2Gesture Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, Jitendra Malik, “Learning Individual
Styles of Conversational Gesture” CVPR 2019.

bit.ly/DLCV2018
#DLUPC
86
Daniel Cudeiro*, Timo Bolkart*, Cassidy Laidlaw, Anurag Ranjan and Michael J. Black, “Capture, Learning and Synthesis of
3D Speaking Styles” CVPR 2019

87
Visual Re-dubbing (lip keypoints)
Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from
audio." SIGGRAPH 2017.

bit.ly/DLCV2018
#DLUPC
88
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017

89
Visual Re-dubbing (3D meshes)
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end
learning of pose and emotion." SIGGRAPH 2017

bit.ly/DLCV2018
#DLUPC
90
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017

91
Visual Re-dubbing (3D meshes)

bit.ly/DLCV2018
#DLUPC
92

93
Visual Re-dubbing (pixels)
Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017. #speech2vid

bit.ly/DLCV2018
#DLUPC
94Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017. #speech2vid

95
Chen, Lele, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. "Lip Movements Generation at a Glance." ECCV
2018.

bit.ly/DLCV2018
#DLUPC
96Chen, Lele, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. "Lip Movements Generation at a Glance." ECCV 2018.

97
Vougioukas, Konstantinos, Stavros Petridis, and Maja Pantic. "End-to-End Speech-Driven Facial Animation with Temporal
GANs." BMVC 2018.
Adversarial losses at frame (spatial) & sequence (temporal) scales.

bit.ly/DLCV2018
#DLUPC
98
Vougioukas, Konstantinos, Stavros Petridis, and Maja Pantic. "End-to-End Speech-Driven Facial Animation with Temporal
GANs." BMVC 2018

99
Outline
1. Motivation
2. Feature Learning

Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC TelecomBCN Barcelona

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC TelecomBCN Barcelona

Similar to Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC TelecomBCN Barcelona (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC TelecomBCN Barcelona