https://mcv-m6-video.github.io/deepvideo-2020/
Self-supervised audiovisual learning exploits the synchronization between pixels and audio recorded in video files. This lecture reviews the state of the art in deep neural networks trained with this approach, which does not require any manual annotation from humans.
1. Module 6 - Day 10 - Lecture 2
Self-supervised
Audiovisual Learning
26th March 2020
Xavier Giró-i-Nieto
Deep Learning for Video lectures:
[https://mcv-m6-video.github.io/deepvideo-2020/]
@DocXavi
15. 15
^
y
loss
Representations learned without labels
Self-supervised learning is a form of unsupervised learning where the data
provides the supervision.
● By defining a proxy loss, the NN learns representations, which should be
valuable for the actually target task.
Self-supervised Learning
21. 21
Audio Feature Learning
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Based on the assumption that ambient sound in video is related to the visual
semantics.
22. 22
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Use videos to train a CNN that predicts the audio statistics of a frame.
Audio Feature Learning
23. 23
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Task: Use the predicted audio stats to clusters images. Audio clusters built with
K-means algorithm over the training set
Cluster assignments at test time (one row=one cluster)
Audio Feature Learning
24. 24
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Although the CNN was not trained with class labels, local units with semantic
meaning emerge.
Audio Feature Learning
25. 25
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Video sonorization: Retrieve matching sounds for videos of people hitting objects
with a drumstick.
Audio Feature Learning (self-supervised)
26. 26
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
The Greatest Hits Dataset
Audio Feature Learning (self-supervised)
27. 27
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Audio Clip
Retrieval
Not end-to-end
Audio Feature Learning (self-supervised)
28. 28
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
29. 29
Speaker ID Feature Learning (disentangled)
Nagrani, Arsha, Joon Son Chung, Samuel Albanie, and Andrew Zisserman. "Disentangled Speech Embeddings using Cross-modal
Self-supervision." ICASSP 2020.
30. 30
Speaker ID Feature Learning (disentangled)
Nagrani, Arsha, Joon Son Chung, Samuel Albanie, and Andrew Zisserman. "Disentangled Speech Embeddings using Cross-modal
Self-supervision." ICASSP 2020.
31. 31
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Teacher network: Visual Recognition (object & scenes)
Audio Feature Learning (label transfer)
32. 32
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS
2016.
33. 33
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Learned audio features are good for environmental sound recognition.
Audio Feature Learning (label transfer)
34. 34
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Learned audio features are good for environmental sound recognition.
Audio Feature Learning (label transfer)
35. 35
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Audio Feature Learning (label transfer)
36. 36
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Audio Feature Learning (label transfer)
37. 37
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Visualize video frames that mostly activate a neuron in a late layer (conv7)
Audio Feature Learning (label transfer)
38. 38
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Visualize video frames that mostly activate a neuron in a late layer (conv7)
Audio Feature Learning (label transfer)
39. 39
S Albanie, A Nagrani, A Vedaldi, A Zisserman, “Emotion Recognition in Speech using Cross-Modal Transfer in the Wild”
ACM Multimedia 2018.
Teacher network: Facial Emotion Recognition (visual)
Audio Feature Learning (label transfer)
42. 42
Joint Feature Learning
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video
and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
43. 43
Joint Feature Learning
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video
and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
Best
match
Audio feature
44. 44
Best
match
Visual feature Audio feature
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video
and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
Joint Feature Learning
45. 45
Joint Feature Learning
#AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from
Self-Supervised Synchronization." NIPS 2018.
46. 46
Joint Feature Learning
#AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from
Self-Supervised Synchronization." NIPS 2018.
47. 47
Joint Feature Learning
#AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from
Self-Supervised Synchronization." NIPS 2018.
50. 50Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Most activated unit in pool4 layer of the visual network.
Matching Networks
51. 51Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Visual features used to train a linear classifier on ImageNet.
Matching Networks
52. 52Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Most activated unit in pool4 layer of the audio network
Matching Networks
53. 53Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Audio features achieve state of the art performance.
Matching Networks
54. 54Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." ECCV 2018.
Matching Networks (sound localization)
56. 56
Nagrani, Arsha, Samuel Albanie, and Andrew Zisserman. "Seeing voices and hearing faces: Cross-modal
biometric matching." CVPR 2018.
Matching Networks
57. 57
Nagrani, Arsha, Samuel Albanie, and Andrew Zisserman. "Seeing voices and hearing faces: Cross-modal
biometric matching." CVPR 2018.
58. 58
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
a. Speech Reconstruction
b. Skeleton animation
c. Face Hallucination
d. Sound separation
e. Visual re-dubbing (“deep fakes”)
60. 60
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
a. Speech Reconstruction
b. Face Hallucination
c. Sound separation
d. Visual re-dubbing (“deep fakes”)
61. 61
Speech Reconstruction
Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: speech reconstruction from silent video." ICASSP 2017.
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis
62. 62
Speech Reconstruction
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV 2017
Workshop on Computer Vision for Audio-Visual Media. 2017.
63. bit.ly/DLCV2018
#DLUPC
63
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV Workshop on
Computer Vision for Audio-Visual Media. 2017.
65. 65
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
a. Speech Reconstruction
b. Face Hallucination
c. Sound separation
d. Visual re-dubbing (“deep fakes”)
66. 66
Face Hallucination from Low-res
Liu, C., Shum, H. Y., & Zhang, C. S. A two-step approach to hallucinating faces: global parametric model and local nonparametric
model. CVPR 2001.
67. 67
Face Hallucination from Speech
#Wav2Pix Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva
Mohedano et al. “Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019
68. 68
Generated faces from known identities.
Face Hallucination from Speech
#Wav2Pix Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva
Mohedano et al. “Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019
69. 69
Faces from average speeches.
Face Hallucination from Speech
#Wav2Pix Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva
Mohedano et al. “Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019
70. 70
Interpolated faces from interpolated speeches.
#Wav2Pix Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva
Mohedano et al. “Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019
Face Hallucination from Speech
71. 71
Face Hallucination from Speech
#Speech2Face Oh, Tae-Hyun, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, and Wojciech
Matusik. "Speech2face: Learning the face behind a voice." CVPR 2019.
72. 72
#Speech2Face Oh, Tae-Hyun, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, and Wojciech
Matusik. "Speech2face: Learning the face behind a voice." CVPR 2019.
Face Hallucination from Speech
74. 74
Avatar animation with music (skeletons)
Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. (2018). Audio to body dynamics. CVPR 2018.
75. 75Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. Audio to body dynamics. CVPR 2018.
76. 76
Sound Source Localization
Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual
Scenes." CVPR 2018.
77. 77
Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual Scenes."
CVPR 2018.
78. 78
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
a. Speech Reconstruction
b. Face Hallucination
c. Sound separation
d. Visual re-dubbing (“deep fakes”)
80. 80
Speech Separation with Vision (lips)
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech Enhancement."
Interspeech 2018.
82. 82
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
a. Speech Reconstruction
b. Face Hallucination
c. Sound separation
d. Visual re-dubbing (“deep fakes”)
87. 87
Visual Re-dubbing (lip keypoints)
Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from
audio." SIGGRAPH 2017.
88. bit.ly/DLCV2018
#DLUPC
88
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
89. 89
Visual Re-dubbing (3D meshes)
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end
learning of pose and emotion." SIGGRAPH 2017
90. bit.ly/DLCV2018
#DLUPC
90
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
91. 91
Visual Re-dubbing (3D meshes)
Daniel Cudeiro*, Timo Bolkart*, Cassidy Laidlaw, Anurag Ranjan and Michael J. Black, “Capture, Learning and Synthesis of
3D Speaking Styles” CVPR 2019
95. 95
Chen, Lele, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. "Lip Movements Generation at a Glance." ECCV
2018.
Visual Re-dubbing (pixels)