Amaia Salvador
amaia.salvador@upc.edu
PhD Candidate
Universitat Politècnica de Catalunya
DEEP
LEARNING
WORKSHOP
Dublin City University
27-28 April 2017
Audio & Vision
D2 L9
2
Multimedia
Text
Audio
Vision
3
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading
4
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading
5
Transfer Learning
Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016
Audio features help learn semantically meaningful visual representation
6
Transfer Learning
Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016
Images are grouped into audio clusters.
The task is to predict the audio cluster for a given image (classification)
7
Transfer Learning
Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016
Although the model was not trained with class labels, units with
semantical meaning emerge
8
Transfer Learning: SoundNet
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Pretrained visual ConvNets supervise the training of a model for sound representation
9
Videos for training are unlabeled. Relies on Convnets trained on labeled images.
Transfer Learning: SoundNet
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
10
Kullback-Leibler Divergence:
Cross-Entropy:
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
prob. distribution of vision network
prob. distribution of sound network
Transfer Learning: SoundNet
11
Recognizing Objects and Scenes from Sound
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
12
Recognizing Objects and Scenes from Sound
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
13
Hidden layers of Soundnet are used to train a standard SVM
classifier that outperforms state of the art.
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
14
Visualization of the 1D filters over raw audio in conv1.
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Transfer Learning: SoundNet
15
Visualization of the 1D filters over raw audio in conv1.
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
16
Visualize samples that mostly activate a neuron in a late layer (conv7)
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
17
Visualization of the video frames associated to the sounds that activate some of the
last hidden units (conv7):
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
18Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
Hearing sounds that most activate a neuron in the sound network (conv7)
19
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
Hearing sounds that most activate a neuron in the sound network (conv5)
20
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading
21
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick.
Are features learned when training for sound prediction useful to identify the object’s material?
22
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
The Greatest Hits Dataset
23
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
The Greatest Hits Dataset
24
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
● Map video sequences to sequences of
sound features
● Sound features in the output are
converted into waveform by
example-based synthesis (i.e. nearest
neighbor retrieval).
25
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
26
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
27
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
28
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading
29
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis
30
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
Downsample
8kHz
Split in 40ms
chunks
(.5 overlap)
LPC analysis
9-dim LPC
features
Waveform processing
31Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
Waveform processing (2 LSP features/ frame)
32
Speech Generation from Video
Cooke et al. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America,
2006
GRID Dataset
33
Speech Generation from Video
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
34
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading
35Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
Lip Reading: LipNet
36Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
Lip Reading: LipNet
Input (frames) and output (sentence) sequences are not aligned
37
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with
Recurrent Neural Networks. ICML 2006
Lip Reading: LipNet
CTC Loss: Connectionist temporal classification
● Avoiding the need for alignment between input and output sequence by predicting
an additional “_” blank word
● Before computing the loss, repeated words and blank tokens are removed
● “a _ a b _ ” == “_ a a _ _ b b” == “a a b”
38
Lip Reading: LipNet
Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
39Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
40
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
41
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
Audio
features
Image
features
42
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
Attention over output
states from audio and
video is computed at
each timestep
43
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
44
Lip Reading: Watch, Listen, Attend & Spell
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
LipNet
45
Lip Reading: Watch, Listen, Attend & Spell
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
46
Summary
● Transfer learning
● Sonorization
● Speech generation
● Lip Reading
47
Questions?
48
Multimodal Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
49
Multimodal Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
50
Cross-Modality
Ngiam et al. Multimodal Deep Learning. ICML 2011
Denoising autoencoder required to reconstruct both modalities given only one
or both in the input
51
Cross-Modality
Ngiam et al. Multimodal Deep Learning. ICML 2011
Video classification: Add Linear SVM on top of learned features
● Combination of audio&video as inputs gives relative improvement (on AVLetters)
● Using video as input to predict both audio& video gives SotA results in both
52
Cross-Modality
Ngiam et al. Multimodal Deep Learning. ICML 2011
Audio classification: Add Linear SVM on top of learned features
● Performance of cross-modal features below SotA
● Still, remarkable performance when using video as input to classify audio
53
Multimodal Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
54
Shared Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
First, learn a shared representation using the two modalities as input
55
Shared Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
Then, train linear classifiers on top of shared representations (previously trained on audio/video
data) using only one of the two modalities as input. Testing is performed with the other modality.
56
Shared Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
Train Linear classifiers on top of shared representations (previously trained on audio/video data)
using only one of the two modalities as input. Testing is performed with the other modality.

Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)

  • 1.
    Amaia Salvador amaia.salvador@upc.edu PhD Candidate UniversitatPolitècnica de Catalunya DEEP LEARNING WORKSHOP Dublin City University 27-28 April 2017 Audio & Vision D2 L9
  • 2.
  • 3.
    3 Audio & Vision ●Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  • 4.
    4 Audio & Vision ●Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  • 5.
    5 Transfer Learning Owens etal. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016 Audio features help learn semantically meaningful visual representation
  • 6.
    6 Transfer Learning Owens etal. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016 Images are grouped into audio clusters. The task is to predict the audio cluster for a given image (classification)
  • 7.
    7 Transfer Learning Owens etal. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016 Although the model was not trained with class labels, units with semantical meaning emerge
  • 8.
    8 Transfer Learning: SoundNet Aytar,Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Pretrained visual ConvNets supervise the training of a model for sound representation
  • 9.
    9 Videos for trainingare unlabeled. Relies on Convnets trained on labeled images. Transfer Learning: SoundNet Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
  • 10.
    10 Kullback-Leibler Divergence: Cross-Entropy: Aytar, Vondrickand Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 prob. distribution of vision network prob. distribution of sound network Transfer Learning: SoundNet
  • 11.
    11 Recognizing Objects andScenes from Sound Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 12.
    12 Recognizing Objects andScenes from Sound Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 13.
    13 Hidden layers ofSoundnet are used to train a standard SVM classifier that outperforms state of the art. Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 14.
    14 Visualization of the1D filters over raw audio in conv1. Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Transfer Learning: SoundNet
  • 15.
    15 Visualization of the1D filters over raw audio in conv1. Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 16.
    16 Visualize samples thatmostly activate a neuron in a late layer (conv7) Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 17.
    17 Visualization of thevideo frames associated to the sounds that activate some of the last hidden units (conv7): Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 18.
    18Aytar, Vondrick andTorralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet Hearing sounds that most activate a neuron in the sound network (conv7)
  • 19.
    19 Aytar, Vondrick andTorralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet Hearing sounds that most activate a neuron in the sound network (conv5)
  • 20.
    20 Audio & Vision ●Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  • 21.
    21 Sonorization Owens et al.Visually indicated sounds. CVPR 2016 Learn synthesized sounds from videos of people hitting objects with a drumstick. Are features learned when training for sound prediction useful to identify the object’s material?
  • 22.
    22 Sonorization Owens et al.Visually indicated sounds. CVPR 2016 The Greatest Hits Dataset
  • 23.
    23 Sonorization Owens et al.Visually indicated sounds. CVPR 2016 The Greatest Hits Dataset
  • 24.
    24 Sonorization Owens et al.Visually indicated sounds. CVPR 2016 ● Map video sequences to sequences of sound features ● Sound features in the output are converted into waveform by example-based synthesis (i.e. nearest neighbor retrieval).
  • 25.
    25 Sonorization Owens et al.Visually indicated sounds. CVPR 2016
  • 26.
    26 Sonorization Owens et al.Visually indicated sounds. CVPR 2016
  • 27.
    27 Sonorization Owens et al.Visually indicated sounds. CVPR 2016
  • 28.
    28 Audio & Vision ●Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  • 29.
    29 Ephrat et al.Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video CNN (VGG) Frame from a silent video Audio feature Post-hoc synthesis
  • 30.
    30 Ephrat et al.Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video Downsample 8kHz Split in 40ms chunks (.5 overlap) LPC analysis 9-dim LPC features Waveform processing
  • 31.
    31Ephrat et al.Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video Waveform processing (2 LSP features/ frame)
  • 32.
    32 Speech Generation fromVideo Cooke et al. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America, 2006 GRID Dataset
  • 33.
    33 Speech Generation fromVideo Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
  • 34.
    34 Audio & Vision ●Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  • 35.
    35Assael et al.LipNet: Sentence-level Lipreading. arXiv Nov 2016 Lip Reading: LipNet
  • 36.
    36Assael et al.LipNet: Sentence-level Lipreading. arXiv Nov 2016 Lip Reading: LipNet Input (frames) and output (sentence) sequences are not aligned
  • 37.
    37 Graves et al.Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006 Lip Reading: LipNet CTC Loss: Connectionist temporal classification ● Avoiding the need for alignment between input and output sequence by predicting an additional “_” blank word ● Before computing the loss, repeated words and blank tokens are removed ● “a _ a b _ ” == “_ a a _ _ b b” == “a a b”
  • 38.
    38 Lip Reading: LipNet Assaelet al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
  • 39.
    39Chung et al.Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell
  • 40.
    40 Chung et al.Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell
  • 41.
    41 Chung et al.Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell Audio features Image features
  • 42.
    42 Chung et al.Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell Attention over output states from audio and video is computed at each timestep
  • 43.
    43 Chung et al.Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell
  • 44.
    44 Lip Reading: Watch,Listen, Attend & Spell Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 LipNet
  • 45.
    45 Lip Reading: Watch,Listen, Attend & Spell Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
  • 46.
    46 Summary ● Transfer learning ●Sonorization ● Speech generation ● Lip Reading
  • 47.
  • 48.
    48 Multimodal Learning Ngiam etal. Multimodal Deep Learning. ICML 2011
  • 49.
    49 Multimodal Learning Ngiam etal. Multimodal Deep Learning. ICML 2011
  • 50.
    50 Cross-Modality Ngiam et al.Multimodal Deep Learning. ICML 2011 Denoising autoencoder required to reconstruct both modalities given only one or both in the input
  • 51.
    51 Cross-Modality Ngiam et al.Multimodal Deep Learning. ICML 2011 Video classification: Add Linear SVM on top of learned features ● Combination of audio&video as inputs gives relative improvement (on AVLetters) ● Using video as input to predict both audio& video gives SotA results in both
  • 52.
    52 Cross-Modality Ngiam et al.Multimodal Deep Learning. ICML 2011 Audio classification: Add Linear SVM on top of learned features ● Performance of cross-modal features below SotA ● Still, remarkable performance when using video as input to classify audio
  • 53.
    53 Multimodal Learning Ngiam etal. Multimodal Deep Learning. ICML 2011
  • 54.
    54 Shared Learning Ngiam etal. Multimodal Deep Learning. ICML 2011 First, learn a shared representation using the two modalities as input
  • 55.
    55 Shared Learning Ngiam etal. Multimodal Deep Learning. ICML 2011 Then, train linear classifiers on top of shared representations (previously trained on audio/video data) using only one of the two modalities as input. Testing is performed with the other modality.
  • 56.
    56 Shared Learning Ngiam etal. Multimodal Deep Learning. ICML 2011 Train Linear classifiers on top of shared representations (previously trained on audio/video data) using only one of the two modalities as input. Testing is performed with the other modality.