SlideShare a Scribd company logo
1 of 100
Download to read offline
Module 6 - Day 10 - Lecture 2
Self-supervised
Audiovisual Learning
26th March 2020
Xavier Giró-i-Nieto
Deep Learning for Video lectures:
[https://mcv-m6-video.github.io/deepvideo-2020/]
@DocXavi
2
Acknowledgements
2
Santiago Pascual de la Puente
santi.pascual@upc.edu
Universitat Politècnica de Catalunya
Technical University of Catalonia
Video lecture
3
Lecture in Master in Computer Vision Barcelona 2019
4
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
Audio
5
Speech
Vision
Audio
6
Speech
Vision
Audio
7
Speech
Vision
8
9
Encoder
0
1
0
Cat
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” NIPS 2012
10Slide concept: Perronin, F., Tutorial on LSVR @ CVPR’14, Output embedding for LSVR
One-hot Representation
[1,0,0]
[0,1,0]
[0,0,1]
11
Encoder
Representation
12
Encoder
Representation
13
Decoder
Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative
adversarial networks." ICLR 2016. #DCGAN
0
1
0
Cat
Fig: Xudong Mao #DCGAN
14
Encoder Decoder
Representation
15
^
y
loss
Representations learned without labels
Self-supervised learning is a form of unsupervised learning where the data
provides the supervision.
● By defining a proxy loss, the NN learns representations, which should be
valuable for the actually target task.
Self-supervised Learning
16
Vision
Audio
Speech
Video
Synchronization among modalities
captured by video is exploited in a
self-supervised manner.
Self-supervised Learning
17
Encoder Decoder
Representation
Learn
Self-supervised Learning
18
Outline
1. Motivation
2. Feature Learning
a. Audio Features
b. Joint Feature Learning
c. Matching Networks
3. Cross-modal Translation
19
Outline
1. Motivation
2. Feature Learning
a. Audio Features
b. Joint Feature Learning
c. Matching Networks
3. Cross-modal Translation
20
Encoder Encoder
Representation
21
Audio Feature Learning
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Based on the assumption that ambient sound in video is related to the visual
semantics.
22
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Use videos to train a CNN that predicts the audio statistics of a frame.
Audio Feature Learning
23
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Task: Use the predicted audio stats to clusters images. Audio clusters built with
K-means algorithm over the training set
Cluster assignments at test time (one row=one cluster)
Audio Feature Learning
24
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Although the CNN was not trained with class labels, local units with semantic
meaning emerge.
Audio Feature Learning
25
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Video sonorization: Retrieve matching sounds for videos of people hitting objects
with a drumstick.
Audio Feature Learning (self-supervised)
26
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
The Greatest Hits Dataset
Audio Feature Learning (self-supervised)
27
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Audio Clip
Retrieval
Not end-to-end
Audio Feature Learning (self-supervised)
28
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
29
Speaker ID Feature Learning (disentangled)
Nagrani, Arsha, Joon Son Chung, Samuel Albanie, and Andrew Zisserman. "Disentangled Speech Embeddings using Cross-modal
Self-supervision." ICASSP 2020.
30
Speaker ID Feature Learning (disentangled)
Nagrani, Arsha, Joon Son Chung, Samuel Albanie, and Andrew Zisserman. "Disentangled Speech Embeddings using Cross-modal
Self-supervision." ICASSP 2020.
31
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Teacher network: Visual Recognition (object & scenes)
Audio Feature Learning (label transfer)
32
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS
2016.
33
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Learned audio features are good for environmental sound recognition.
Audio Feature Learning (label transfer)
34
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Learned audio features are good for environmental sound recognition.
Audio Feature Learning (label transfer)
35
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Audio Feature Learning (label transfer)
36
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Audio Feature Learning (label transfer)
37
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Visualize video frames that mostly activate a neuron in a late layer (conv7)
Audio Feature Learning (label transfer)
38
#SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Visualize video frames that mostly activate a neuron in a late layer (conv7)
Audio Feature Learning (label transfer)
39
S Albanie, A Nagrani, A Vedaldi, A Zisserman, “Emotion Recognition in Speech using Cross-Modal Transfer in the Wild”
ACM Multimedia 2018.
Teacher network: Facial Emotion Recognition (visual)
Audio Feature Learning (label transfer)
40
Outline
1. Motivation
2. Feature Learning
a. Audio Features
b. Joint Feature Learning
c. Matching Networks
3. Cross-modal Translation
41
Encoder Encoder
Representation
Joint Feature Learning
42
Joint Feature Learning
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video
and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
43
Joint Feature Learning
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video
and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
Best
match
Audio feature
44
Best
match
Visual feature Audio feature
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video
and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
Joint Feature Learning
45
Joint Feature Learning
#AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from
Self-Supervised Synchronization." NIPS 2018.
46
Joint Feature Learning
#AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from
Self-Supervised Synchronization." NIPS 2018.
47
Joint Feature Learning
#AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from
Self-Supervised Synchronization." NIPS 2018.
48
Outline
1. Motivation
2. Feature Learning
a. Audio Features
b. Joint Feature Learning
c. Matching Networks
3. Cross-modal Translation
49Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Matching Networks
50Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Most activated unit in pool4 layer of the visual network.
Matching Networks
51Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Visual features used to train a linear classifier on ImageNet.
Matching Networks
52Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Most activated unit in pool4 layer of the audio network
Matching Networks
53Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Audio features achieve state of the art performance.
Matching Networks
54Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." ECCV 2018.
Matching Networks (sound localization)
55Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." ECCV 2018.
56
Nagrani, Arsha, Samuel Albanie, and Andrew Zisserman. "Seeing voices and hearing faces: Cross-modal
biometric matching." CVPR 2018.
Matching Networks
57
Nagrani, Arsha, Samuel Albanie, and Andrew Zisserman. "Seeing voices and hearing faces: Cross-modal
biometric matching." CVPR 2018.
58
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
a. Speech Reconstruction
b. Skeleton animation
c. Face Hallucination
d. Sound separation
e. Visual re-dubbing (“deep fakes”)
59
DecoderEncoder
Representation
60
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
a. Speech Reconstruction
b. Face Hallucination
c. Sound separation
d. Visual re-dubbing (“deep fakes”)
61
Speech Reconstruction
Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: speech reconstruction from silent video." ICASSP 2017.
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis
62
Speech Reconstruction
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV 2017
Workshop on Computer Vision for Audio-Visual Media. 2017.
bit.ly/DLCV2018
#DLUPC
63
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV Workshop on
Computer Vision for Audio-Visual Media. 2017.
64
Encoder Decoder
Representation
65
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
a. Speech Reconstruction
b. Face Hallucination
c. Sound separation
d. Visual re-dubbing (“deep fakes”)
66
Face Hallucination from Low-res
Liu, C., Shum, H. Y., & Zhang, C. S. A two-step approach to hallucinating faces: global parametric model and local nonparametric
model. CVPR 2001.
67
Face Hallucination from Speech
#Wav2Pix Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva
Mohedano et al. “Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019
68
Generated faces from known identities.
Face Hallucination from Speech
#Wav2Pix Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva
Mohedano et al. “Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019
69
Faces from average speeches.
Face Hallucination from Speech
#Wav2Pix Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva
Mohedano et al. “Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019
70
Interpolated faces from interpolated speeches.
#Wav2Pix Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva
Mohedano et al. “Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019
Face Hallucination from Speech
71
Face Hallucination from Speech
#Speech2Face Oh, Tae-Hyun, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, and Wojciech
Matusik. "Speech2face: Learning the face behind a voice." CVPR 2019.
72
#Speech2Face Oh, Tae-Hyun, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, and Wojciech
Matusik. "Speech2face: Learning the face behind a voice." CVPR 2019.
Face Hallucination from Speech
73
Encoder Decoder
Representation
74
Avatar animation with music (skeletons)
Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. (2018). Audio to body dynamics. CVPR 2018.
75Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. Audio to body dynamics. CVPR 2018.
76
Sound Source Localization
Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual
Scenes." CVPR 2018.
77
Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual Scenes."
CVPR 2018.
78
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
a. Speech Reconstruction
b. Face Hallucination
c. Sound separation
d. Visual re-dubbing (“deep fakes”)
79
Encoder
Decoder
Representation
Encoder Representation
80
Speech Separation with Vision (lips)
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech Enhancement."
Interspeech 2018.
bit.ly/DLCV2018
#DLUPC
81
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech
Enhancement." Interspeech 2018.
82
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
a. Speech Reconstruction
b. Face Hallucination
c. Sound separation
d. Visual re-dubbing (“deep fakes”)
83
Encoder
Decoder
Representation
Encoder Representation
84
Visual re-dubbing (skeletons)
#Speech2Gesture Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, Jitendra Malik, “Learning Individual
Styles of Conversational Gesture” CVPR 2019.
85
Visual re-dubbing (skeletons)
#Speech2Gesture Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, Jitendra Malik, “Learning Individual
Styles of Conversational Gesture” CVPR 2019.
bit.ly/DLCV2018
#DLUPC
86
Daniel Cudeiro*, Timo Bolkart*, Cassidy Laidlaw, Anurag Ranjan and Michael J. Black, “Capture, Learning and Synthesis of
3D Speaking Styles” CVPR 2019
87
Visual Re-dubbing (lip keypoints)
Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from
audio." SIGGRAPH 2017.
bit.ly/DLCV2018
#DLUPC
88
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
89
Visual Re-dubbing (3D meshes)
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end
learning of pose and emotion." SIGGRAPH 2017
bit.ly/DLCV2018
#DLUPC
90
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
91
Visual Re-dubbing (3D meshes)
Daniel Cudeiro*, Timo Bolkart*, Cassidy Laidlaw, Anurag Ranjan and Michael J. Black, “Capture, Learning and Synthesis of
3D Speaking Styles” CVPR 2019
bit.ly/DLCV2018
#DLUPC
92
Daniel Cudeiro*, Timo Bolkart*, Cassidy Laidlaw, Anurag Ranjan and Michael J. Black, “Capture, Learning and Synthesis of
3D Speaking Styles” CVPR 2019
93
Visual Re-dubbing (pixels)
Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017. #speech2vid
bit.ly/DLCV2018
#DLUPC
94Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017. #speech2vid
95
Chen, Lele, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. "Lip Movements Generation at a Glance." ECCV
2018.
Visual Re-dubbing (pixels)
bit.ly/DLCV2018
#DLUPC
96Chen, Lele, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. "Lip Movements Generation at a Glance." ECCV 2018.
97
Visual Re-dubbing (pixels)
Vougioukas, Konstantinos, Stavros Petridis, and Maja Pantic. "End-to-End Speech-Driven Facial Animation with Temporal
GANs." BMVC 2018.
Adversarial losses at frame (spatial) & sequence (temporal) scales.
bit.ly/DLCV2018
#DLUPC
98
Vougioukas, Konstantinos, Stavros Petridis, and Maja Pantic. "End-to-End Speech-Driven Facial Animation with Temporal
GANs." BMVC 2018
99
Outline
1. Motivation
2. Feature Learning
3. Cross-modal Translation
100
Questions ?

More Related Content

What's hot

Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaUniversitat Politècnica de Catalunya
 
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019Universitat Politècnica de Catalunya
 
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...Universitat Politècnica de Catalunya
 
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Universitat Politècnica de Catalunya
 
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019Universitat Politècnica de Catalunya
 
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Universitat Politècnica de Catalunya
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Universitat Politècnica de Catalunya
 
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Universitat Politècnica de Catalunya
 

What's hot (20)

Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
 
Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)
 
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
 
One Perceptron to Rule Them All: Language and Vision
One Perceptron to Rule Them All: Language and VisionOne Perceptron to Rule Them All: Language and Vision
One Perceptron to Rule Them All: Language and Vision
 
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
 
Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)
 
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
 
Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)
 
Deep Learning for Video: Object Tracking (UPC 2018)
Deep Learning for Video: Object Tracking (UPC 2018)Deep Learning for Video: Object Tracking (UPC 2018)
Deep Learning for Video: Object Tracking (UPC 2018)
 
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
 
Neural Architectures for Video Encoding
Neural Architectures for Video EncodingNeural Architectures for Video Encoding
Neural Architectures for Video Encoding
 
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
 
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
 
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
 
Multimodal Deep Learning
Multimodal Deep LearningMultimodal Deep Learning
Multimodal Deep Learning
 
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
 
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
 
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
 
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
 

Similar to Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC TelecomBCN Barcelona

Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)Universitat Politècnica de Catalunya
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)Universitat Politècnica de Catalunya
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018Universitat Politècnica de Catalunya
 
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Universitat Politècnica de Catalunya
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Identification and Authentication Using Clavicles
Identification and Authentication Using ClaviclesIdentification and Authentication Using Clavicles
Identification and Authentication Using Claviclessugiuralab
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Modeling perceptual similarity and shift invariance in deep networks
Modeling perceptual similarity and shift invariance in deep networksModeling perceptual similarity and shift invariance in deep networks
Modeling perceptual similarity and shift invariance in deep networksNAVER Engineering
 
International Perspectives: Visualization in Science and Education
International Perspectives: Visualization in Science and EducationInternational Perspectives: Visualization in Science and Education
International Perspectives: Visualization in Science and EducationLiz Dorland
 
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...maranlar
 
Towards User-friendly Audio Creation
Towards User-friendly Audio CreationTowards User-friendly Audio Creation
Towards User-friendly Audio CreationJean Vanderdonckt
 
Content Modelling for Human Action Detection via Multidimensional Approach
Content Modelling for Human Action Detection via Multidimensional ApproachContent Modelling for Human Action Detection via Multidimensional Approach
Content Modelling for Human Action Detection via Multidimensional ApproachCSCJournals
 
Predicting Media Memorability with Audio, Video, and Text representations
Predicting Media Memorability with Audio, Video, and Text representationsPredicting Media Memorability with Audio, Video, and Text representations
Predicting Media Memorability with Audio, Video, and Text representationsAlison Reboud
 

Similar to Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC TelecomBCN Barcelona (20)

Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
 
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
One Perceptron to Rule Them All (Re-Work Deep Learning Summit, London 2017)
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
 
Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)
Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)
Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)
 
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
 
Once Perceptron to Rule Them all: Deep Learning for Multimedia
Once Perceptron to Rule Them all: Deep Learning for MultimediaOnce Perceptron to Rule Them all: Deep Learning for Multimedia
Once Perceptron to Rule Them all: Deep Learning for Multimedia
 
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
 
Deep Learning Summit (DLS01-7)
Deep Learning Summit (DLS01-7)Deep Learning Summit (DLS01-7)
Deep Learning Summit (DLS01-7)
 
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
 
Identification and Authentication Using Clavicles
Identification and Authentication Using ClaviclesIdentification and Authentication Using Clavicles
Identification and Authentication Using Clavicles
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
Modeling perceptual similarity and shift invariance in deep networks
Modeling perceptual similarity and shift invariance in deep networksModeling perceptual similarity and shift invariance in deep networks
Modeling perceptual similarity and shift invariance in deep networks
 
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
 
International Perspectives: Visualization in Science and Education
International Perspectives: Visualization in Science and EducationInternational Perspectives: Visualization in Science and Education
International Perspectives: Visualization in Science and Education
 
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
 
Towards User-friendly Audio Creation
Towards User-friendly Audio CreationTowards User-friendly Audio Creation
Towards User-friendly Audio Creation
 
Content Modelling for Human Action Detection via Multidimensional Approach
Content Modelling for Human Action Detection via Multidimensional ApproachContent Modelling for Human Action Detection via Multidimensional Approach
Content Modelling for Human Action Detection via Multidimensional Approach
 
Predicting Media Memorability with Audio, Video, and Text representations
Predicting Media Memorability with Audio, Video, and Text representationsPredicting Media Memorability with Audio, Video, and Text representations
Predicting Media Memorability with Audio, Video, and Text representations
 

More from Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Universitat Politècnica de Catalunya
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Universitat Politècnica de Catalunya
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Universitat Politècnica de Catalunya
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
 

Recently uploaded

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 

Recently uploaded (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 

Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC TelecomBCN Barcelona

  • 1. Module 6 - Day 10 - Lecture 2 Self-supervised Audiovisual Learning 26th March 2020 Xavier Giró-i-Nieto Deep Learning for Video lectures: [https://mcv-m6-video.github.io/deepvideo-2020/] @DocXavi
  • 2. 2 Acknowledgements 2 Santiago Pascual de la Puente santi.pascual@upc.edu Universitat Politècnica de Catalunya Technical University of Catalonia
  • 3. Video lecture 3 Lecture in Master in Computer Vision Barcelona 2019
  • 4. 4 Outline 1. Motivation 2. Feature Learning 3. Cross-modal Translation
  • 8. 8
  • 9. 9 Encoder 0 1 0 Cat A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” NIPS 2012
  • 10. 10Slide concept: Perronin, F., Tutorial on LSVR @ CVPR’14, Output embedding for LSVR One-hot Representation [1,0,0] [0,1,0] [0,0,1]
  • 13. 13 Decoder Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." ICLR 2016. #DCGAN 0 1 0 Cat Fig: Xudong Mao #DCGAN
  • 15. 15 ^ y loss Representations learned without labels Self-supervised learning is a form of unsupervised learning where the data provides the supervision. ● By defining a proxy loss, the NN learns representations, which should be valuable for the actually target task. Self-supervised Learning
  • 16. 16 Vision Audio Speech Video Synchronization among modalities captured by video is exploited in a self-supervised manner. Self-supervised Learning
  • 18. 18 Outline 1. Motivation 2. Feature Learning a. Audio Features b. Joint Feature Learning c. Matching Networks 3. Cross-modal Translation
  • 19. 19 Outline 1. Motivation 2. Feature Learning a. Audio Features b. Joint Feature Learning c. Matching Networks 3. Cross-modal Translation
  • 21. 21 Audio Feature Learning Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Based on the assumption that ambient sound in video is related to the visual semantics.
  • 22. 22 Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Use videos to train a CNN that predicts the audio statistics of a frame. Audio Feature Learning
  • 23. 23 Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Task: Use the predicted audio stats to clusters images. Audio clusters built with K-means algorithm over the training set Cluster assignments at test time (one row=one cluster) Audio Feature Learning
  • 24. 24 Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Although the CNN was not trained with class labels, local units with semantic meaning emerge. Audio Feature Learning
  • 25. 25 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Video sonorization: Retrieve matching sounds for videos of people hitting objects with a drumstick. Audio Feature Learning (self-supervised)
  • 26. 26 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. The Greatest Hits Dataset Audio Feature Learning (self-supervised)
  • 27. 27 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Audio Clip Retrieval Not end-to-end Audio Feature Learning (self-supervised)
  • 28. 28 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016.
  • 29. 29 Speaker ID Feature Learning (disentangled) Nagrani, Arsha, Joon Son Chung, Samuel Albanie, and Andrew Zisserman. "Disentangled Speech Embeddings using Cross-modal Self-supervision." ICASSP 2020.
  • 30. 30 Speaker ID Feature Learning (disentangled) Nagrani, Arsha, Joon Son Chung, Samuel Albanie, and Andrew Zisserman. "Disentangled Speech Embeddings using Cross-modal Self-supervision." ICASSP 2020.
  • 31. 31 #SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Teacher network: Visual Recognition (object & scenes) Audio Feature Learning (label transfer)
  • 32. 32 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.
  • 33. 33 #SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Learned audio features are good for environmental sound recognition. Audio Feature Learning (label transfer)
  • 34. 34 #SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Learned audio features are good for environmental sound recognition. Audio Feature Learning (label transfer)
  • 35. 35 #SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualization of the 1D filters over raw audio in conv1. Audio Feature Learning (label transfer)
  • 36. 36 #SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualization of the 1D filters over raw audio in conv1. Audio Feature Learning (label transfer)
  • 37. 37 #SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualize video frames that mostly activate a neuron in a late layer (conv7) Audio Feature Learning (label transfer)
  • 38. 38 #SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualize video frames that mostly activate a neuron in a late layer (conv7) Audio Feature Learning (label transfer)
  • 39. 39 S Albanie, A Nagrani, A Vedaldi, A Zisserman, “Emotion Recognition in Speech using Cross-Modal Transfer in the Wild” ACM Multimedia 2018. Teacher network: Facial Emotion Recognition (visual) Audio Feature Learning (label transfer)
  • 40. 40 Outline 1. Motivation 2. Feature Learning a. Audio Features b. Joint Feature Learning c. Matching Networks 3. Cross-modal Translation
  • 42. 42 Joint Feature Learning Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
  • 43. 43 Joint Feature Learning Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018. Best match Audio feature
  • 44. 44 Best match Visual feature Audio feature Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018. Joint Feature Learning
  • 45. 45 Joint Feature Learning #AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization." NIPS 2018.
  • 46. 46 Joint Feature Learning #AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization." NIPS 2018.
  • 47. 47 Joint Feature Learning #AVTS Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization." NIPS 2018.
  • 48. 48 Outline 1. Motivation 2. Feature Learning a. Audio Features b. Joint Feature Learning c. Matching Networks 3. Cross-modal Translation
  • 49. 49Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017. Matching Networks
  • 50. 50Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017. Most activated unit in pool4 layer of the visual network. Matching Networks
  • 51. 51Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017. Visual features used to train a linear classifier on ImageNet. Matching Networks
  • 52. 52Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017. Most activated unit in pool4 layer of the audio network Matching Networks
  • 53. 53Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017. Audio features achieve state of the art performance. Matching Networks
  • 54. 54Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." ECCV 2018. Matching Networks (sound localization)
  • 55. 55Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." ECCV 2018.
  • 56. 56 Nagrani, Arsha, Samuel Albanie, and Andrew Zisserman. "Seeing voices and hearing faces: Cross-modal biometric matching." CVPR 2018. Matching Networks
  • 57. 57 Nagrani, Arsha, Samuel Albanie, and Andrew Zisserman. "Seeing voices and hearing faces: Cross-modal biometric matching." CVPR 2018.
  • 58. 58 Outline 1. Motivation 2. Feature Learning 3. Cross-modal Translation a. Speech Reconstruction b. Skeleton animation c. Face Hallucination d. Sound separation e. Visual re-dubbing (“deep fakes”)
  • 60. 60 Outline 1. Motivation 2. Feature Learning 3. Cross-modal Translation a. Speech Reconstruction b. Face Hallucination c. Sound separation d. Visual re-dubbing (“deep fakes”)
  • 61. 61 Speech Reconstruction Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: speech reconstruction from silent video." ICASSP 2017. CNN (VGG) Frame from a silent video Audio feature Post-hoc synthesis
  • 62. 62 Speech Reconstruction Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. 2017.
  • 63. bit.ly/DLCV2018 #DLUPC 63 Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV Workshop on Computer Vision for Audio-Visual Media. 2017.
  • 65. 65 Outline 1. Motivation 2. Feature Learning 3. Cross-modal Translation a. Speech Reconstruction b. Face Hallucination c. Sound separation d. Visual re-dubbing (“deep fakes”)
  • 66. 66 Face Hallucination from Low-res Liu, C., Shum, H. Y., & Zhang, C. S. A two-step approach to hallucinating faces: global parametric model and local nonparametric model. CVPR 2001.
  • 67. 67 Face Hallucination from Speech #Wav2Pix Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al. “Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019
  • 68. 68 Generated faces from known identities. Face Hallucination from Speech #Wav2Pix Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al. “Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019
  • 69. 69 Faces from average speeches. Face Hallucination from Speech #Wav2Pix Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al. “Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019
  • 70. 70 Interpolated faces from interpolated speeches. #Wav2Pix Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al. “Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019 Face Hallucination from Speech
  • 71. 71 Face Hallucination from Speech #Speech2Face Oh, Tae-Hyun, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, and Wojciech Matusik. "Speech2face: Learning the face behind a voice." CVPR 2019.
  • 72. 72 #Speech2Face Oh, Tae-Hyun, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, and Wojciech Matusik. "Speech2face: Learning the face behind a voice." CVPR 2019. Face Hallucination from Speech
  • 74. 74 Avatar animation with music (skeletons) Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. (2018). Audio to body dynamics. CVPR 2018.
  • 75. 75Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. Audio to body dynamics. CVPR 2018.
  • 76. 76 Sound Source Localization Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual Scenes." CVPR 2018.
  • 77. 77 Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual Scenes." CVPR 2018.
  • 78. 78 Outline 1. Motivation 2. Feature Learning 3. Cross-modal Translation a. Speech Reconstruction b. Face Hallucination c. Sound separation d. Visual re-dubbing (“deep fakes”)
  • 80. 80 Speech Separation with Vision (lips) Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech Enhancement." Interspeech 2018.
  • 81. bit.ly/DLCV2018 #DLUPC 81 Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech Enhancement." Interspeech 2018.
  • 82. 82 Outline 1. Motivation 2. Feature Learning 3. Cross-modal Translation a. Speech Reconstruction b. Face Hallucination c. Sound separation d. Visual re-dubbing (“deep fakes”)
  • 84. 84 Visual re-dubbing (skeletons) #Speech2Gesture Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, Jitendra Malik, “Learning Individual Styles of Conversational Gesture” CVPR 2019.
  • 85. 85 Visual re-dubbing (skeletons) #Speech2Gesture Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, Jitendra Malik, “Learning Individual Styles of Conversational Gesture” CVPR 2019.
  • 86. bit.ly/DLCV2018 #DLUPC 86 Daniel Cudeiro*, Timo Bolkart*, Cassidy Laidlaw, Anurag Ranjan and Michael J. Black, “Capture, Learning and Synthesis of 3D Speaking Styles” CVPR 2019
  • 87. 87 Visual Re-dubbing (lip keypoints) Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from audio." SIGGRAPH 2017.
  • 88. bit.ly/DLCV2018 #DLUPC 88 Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017
  • 89. 89 Visual Re-dubbing (3D meshes) Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017
  • 90. bit.ly/DLCV2018 #DLUPC 90 Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017
  • 91. 91 Visual Re-dubbing (3D meshes) Daniel Cudeiro*, Timo Bolkart*, Cassidy Laidlaw, Anurag Ranjan and Michael J. Black, “Capture, Learning and Synthesis of 3D Speaking Styles” CVPR 2019
  • 92. bit.ly/DLCV2018 #DLUPC 92 Daniel Cudeiro*, Timo Bolkart*, Cassidy Laidlaw, Anurag Ranjan and Michael J. Black, “Capture, Learning and Synthesis of 3D Speaking Styles” CVPR 2019
  • 93. 93 Visual Re-dubbing (pixels) Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017. #speech2vid
  • 94. bit.ly/DLCV2018 #DLUPC 94Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017. #speech2vid
  • 95. 95 Chen, Lele, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. "Lip Movements Generation at a Glance." ECCV 2018. Visual Re-dubbing (pixels)
  • 96. bit.ly/DLCV2018 #DLUPC 96Chen, Lele, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. "Lip Movements Generation at a Glance." ECCV 2018.
  • 97. 97 Visual Re-dubbing (pixels) Vougioukas, Konstantinos, Stavros Petridis, and Maja Pantic. "End-to-End Speech-Driven Facial Animation with Temporal GANs." BMVC 2018. Adversarial losses at frame (spatial) & sequence (temporal) scales.
  • 98. bit.ly/DLCV2018 #DLUPC 98 Vougioukas, Konstantinos, Stavros Petridis, and Maja Pantic. "End-to-End Speech-Driven Facial Animation with Temporal GANs." BMVC 2018
  • 99. 99 Outline 1. Motivation 2. Feature Learning 3. Cross-modal Translation