This document contains slides from a lecture on deep learning for computer vision and speech. It discusses various neural network architectures for speech recognition, synthesis, and generation tasks. It also covers recent work on joint audio-visual models that match speech to faces, objects, or perform speech separation using visual inputs. Applications include speech reconstruction from silent video, speech-to-face synthesis, and speech recognition with a visual context.
11. bit.ly/DLCV2018
#DLUPC
11
Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville,
and Yoshua Bengio. "SampleRNN: An unconditional end-to-end neural audio generation model." ICLR 2017.
SampleRNN Speech Decoder
12. bit.ly/DLCV2018
#DLUPC
12
Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal
Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. "Wavenet: A generative model for raw audio." arXiv
preprint arXiv:1609.03499 (2016).
WaveNet Speech Decoder
13. bit.ly/DLCV2018
#DLUPC
13
Kalchbrenner, Nal, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian
Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. "Efficient Neural Audio Synthesis."
arXiv preprint arXiv:1802.08435 (2018).
WaveRNN Speech Decoder
14. bit.ly/DLCV2018
#DLUPC
14
Donahue, Chris, Julian McAuley, and Miller Puckette. "Synthesizing Audio with Generative Adversarial
Networks." arXiv preprint arXiv:1802.04208 (2018).
WaveGAN Speech Decoder
17. bit.ly/DLCV2018
#DLUPC
17
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In
ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. 2017.
18. bit.ly/DLCV2018
#DLUPC
18
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Reconstruction from Video
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis
19. bit.ly/DLCV2018
#DLUPC
19
Speech Reconstruction from Video
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In
ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. 2017.
24. bit.ly/DLCV2018
#DLUPC
24
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
25. bit.ly/DLCV2018
#DLUPC
25
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
Speech to Video Synthesis (vertex positions)
26. bit.ly/DLCV2018
#DLUPC
26
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
28. bit.ly/DLCV2018
#DLUPC
28
Matching speech to images
Harwath, David, Antonio Torralba, and James Glass. "Unsupervised learning of spoken language with visual
context." NIPS 2016. [talk]
Humans understand speech much earlier than text, could computers do the same ?
Large dataset (120,000) of speech description of images from Places dataset.
29. bit.ly/DLCV2018
#DLUPC
29
Harwath, David, Antonio Torralba, and James Glass. "Unsupervised learning of spoken language with visual
context." NIPS 2016. [talk]
Train a visual & speech networks with pairs of (non-)corresponding images & speech.
Matching speech to images
30. bit.ly/DLCV2018
#DLUPC
30
Harwath, David, Antonio Torralba, and James Glass. "Unsupervised learning of spoken language with visual
context." NIPS 2016. [talk]
Similarity curve show which regions of the spectrogram are relevant for the image.
Important: no text transcriptions used during the training !!
Matching speech to images
31. bit.ly/DLCV2018
#DLUPC
31
Harwath, David, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass. "Jointly
Discovering Visual Objects and Spoken Words from Raw Sensory Input." arXiv preprint arXiv:1804.01452 (2018).
Matching speech to objects (heatmap)
32. bit.ly/DLCV2018
#DLUPC
32
Harwath, David, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass. "Jointly
Discovering Visual Objects and Spoken Words from Raw Sensory Input." arXiv preprint arXiv:1804.01452 (2018).
Matching speech to objects (heatmap)
Regions matching the spoken word “WOMAN”:
33. bit.ly/DLCV2018
#DLUPC
33
Nagrani, Arsha, Samuel Albanie, and Andrew Zisserman. "Seeing Voices and Hearing Faces: Cross-modal
biometric matching." CVPR 2018. [video]
Matching speech to objects (faces)
38. bit.ly/DLCV2018
#DLUPC
38
Speech Separation with Vision (faces)
Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman and
Michael Rubinstein, "Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for
Speech Separation" SIGGRAPH 2018.
39. bit.ly/DLCV2018
#DLUPC
39
Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman and
Michael Rubinstein, "Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for
Speech Separation" SIGGRAPH 2018.
Speech Separation with Vision (faces)
40. bit.ly/DLCV2018
#DLUPC
40
Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman and
Michael Rubinstein, "Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for
Speech Separation" SIGGRAPH 2018.