SlideShare a Scribd company logo
Amaia Salvador
amaia.salvador@upc.edu
PhD Candidate
Universitat Politècnica de Catalunya
DEEP
LEARNING
WORKSHOP
Dublin City University
27-28 April 2017
Audio & Vision
D2 L9
2
Multimedia
Text
Audio
Vision
3
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading
4
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading
5
Transfer Learning
Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016
Audio features help learn semantically meaningful visual representation
6
Transfer Learning
Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016
Images are grouped into audio clusters.
The task is to predict the audio cluster for a given image (classification)
7
Transfer Learning
Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016
Although the model was not trained with class labels, units with
semantical meaning emerge
8
Transfer Learning: SoundNet
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Pretrained visual ConvNets supervise the training of a model for sound representation
9
Videos for training are unlabeled. Relies on Convnets trained on labeled images.
Transfer Learning: SoundNet
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
10
Kullback-Leibler Divergence:
Cross-Entropy:
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
prob. distribution of vision network
prob. distribution of sound network
Transfer Learning: SoundNet
11
Recognizing Objects and Scenes from Sound
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
12
Recognizing Objects and Scenes from Sound
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
13
Hidden layers of Soundnet are used to train a standard SVM
classifier that outperforms state of the art.
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
14
Visualization of the 1D filters over raw audio in conv1.
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Transfer Learning: SoundNet
15
Visualization of the 1D filters over raw audio in conv1.
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
16
Visualize samples that mostly activate a neuron in a late layer (conv7)
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
17
Visualization of the video frames associated to the sounds that activate some of the
last hidden units (conv7):
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
18Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
Hearing sounds that most activate a neuron in the sound network (conv7)
19
Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
Transfer Learning: SoundNet
Hearing sounds that most activate a neuron in the sound network (conv5)
20
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading
21
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick.
Are features learned when training for sound prediction useful to identify the object’s material?
22
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
The Greatest Hits Dataset
23
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
The Greatest Hits Dataset
24
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
● Map video sequences to sequences of
sound features
● Sound features in the output are
converted into waveform by
example-based synthesis (i.e. nearest
neighbor retrieval).
25
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
26
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
27
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
28
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading
29
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis
30
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
Downsample
8kHz
Split in 40ms
chunks
(.5 overlap)
LPC analysis
9-dim LPC
features
Waveform processing
31Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
Waveform processing (2 LSP features/ frame)
32
Speech Generation from Video
Cooke et al. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America,
2006
GRID Dataset
33
Speech Generation from Video
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
34
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading
35Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
Lip Reading: LipNet
36Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
Lip Reading: LipNet
Input (frames) and output (sentence) sequences are not aligned
37
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with
Recurrent Neural Networks. ICML 2006
Lip Reading: LipNet
CTC Loss: Connectionist temporal classification
● Avoiding the need for alignment between input and output sequence by predicting
an additional “_” blank word
● Before computing the loss, repeated words and blank tokens are removed
● “a _ a b _ ” == “_ a a _ _ b b” == “a a b”
38
Lip Reading: LipNet
Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
39Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
40
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
41
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
Audio
features
Image
features
42
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
Attention over output
states from audio and
video is computed at
each timestep
43
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell
44
Lip Reading: Watch, Listen, Attend & Spell
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
LipNet
45
Lip Reading: Watch, Listen, Attend & Spell
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
46
Summary
● Transfer learning
● Sonorization
● Speech generation
● Lip Reading
47
Questions?
48
Multimodal Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
49
Multimodal Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
50
Cross-Modality
Ngiam et al. Multimodal Deep Learning. ICML 2011
Denoising autoencoder required to reconstruct both modalities given only one
or both in the input
51
Cross-Modality
Ngiam et al. Multimodal Deep Learning. ICML 2011
Video classification: Add Linear SVM on top of learned features
● Combination of audio&video as inputs gives relative improvement (on AVLetters)
● Using video as input to predict both audio& video gives SotA results in both
52
Cross-Modality
Ngiam et al. Multimodal Deep Learning. ICML 2011
Audio classification: Add Linear SVM on top of learned features
● Performance of cross-modal features below SotA
● Still, remarkable performance when using video as input to classify audio
53
Multimodal Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
54
Shared Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
First, learn a shared representation using the two modalities as input
55
Shared Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
Then, train linear classifiers on top of shared representations (previously trained on audio/video
data) using only one of the two modalities as input. Testing is performed with the other modality.
56
Shared Learning
Ngiam et al. Multimodal Deep Learning. ICML 2011
Train Linear classifiers on top of shared representations (previously trained on audio/video data)
using only one of the two modalities as input. Testing is performed with the other modality.

More Related Content

Similar to Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)

Automated Podcasting System for Universities
Automated Podcasting System for UniversitiesAutomated Podcasting System for Universities
Automated Podcasting System for Universities
Educational Technology
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Codiax
 
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
Universitat Politècnica de Catalunya
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Christian Perone
 
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
Deep Learning JP
 
Music Gesture for Visual Sound Separation
Music Gesture for Visual Sound SeparationMusic Gesture for Visual Sound Separation
Music Gesture for Visual Sound Separation
ivaderivader
 
Sander Dieleman - Generating music in the raw audio domain - Creative AI meetup
Sander Dieleman - Generating music in the raw audio domain - Creative AI meetupSander Dieleman - Generating music in the raw audio domain - Creative AI meetup
Sander Dieleman - Generating music in the raw audio domain - Creative AI meetup
Luba Elliott
 
Multimodal deep learning
Multimodal deep learningMultimodal deep learning
Multimodal deep learning
Akhter Al Amin
 
Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)
Universitat Politècnica de Catalunya
 
Ry pyconjp2015 karaoke
Ry pyconjp2015 karaokeRy pyconjp2015 karaoke
Ry pyconjp2015 karaoke
Renyuan Lyu
 
ICML Talk on deep learning for music recommendation
ICML Talk on deep learning for music recommendationICML Talk on deep learning for music recommendation
ICML Talk on deep learning for music recommendation
Aloïs Gruson
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
Universitat Politècnica de Catalunya
 
Mini Project- Audio Enhancement
Mini Project- Audio EnhancementMini Project- Audio Enhancement
Miniproject audioenhancement-100223094301-phpapp02
Miniproject audioenhancement-100223094301-phpapp02Miniproject audioenhancement-100223094301-phpapp02
Miniproject audioenhancement-100223094301-phpapp02
mohankota
 
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
MLconf
 
10 Minute Research Presentation on Ambisonics and Impact
10 Minute Research Presentation on Ambisonics and Impact10 Minute Research Presentation on Ambisonics and Impact
10 Minute Research Presentation on Ambisonics and Impact
Bruce Wiggins
 
Dolby audio ai workshop speech coding - cong zhou
Dolby audio ai workshop   speech coding - cong zhouDolby audio ai workshop   speech coding - cong zhou
Dolby audio ai workshop speech coding - cong zhou
Ankit Shah
 
Video + Language 2019
Video + Language 2019Video + Language 2019
Video + Language 2019
Goergen Institute for Data Science
 
Video + Language
Video + LanguageVideo + Language
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
Forward Gradient
 

Similar to Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017) (20)

Automated Podcasting System for Universities
Automated Podcasting System for UniversitiesAutomated Podcasting System for Universities
Automated Podcasting System for Universities
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
 
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
 
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
[DL輪読会]IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
 
Music Gesture for Visual Sound Separation
Music Gesture for Visual Sound SeparationMusic Gesture for Visual Sound Separation
Music Gesture for Visual Sound Separation
 
Sander Dieleman - Generating music in the raw audio domain - Creative AI meetup
Sander Dieleman - Generating music in the raw audio domain - Creative AI meetupSander Dieleman - Generating music in the raw audio domain - Creative AI meetup
Sander Dieleman - Generating music in the raw audio domain - Creative AI meetup
 
Multimodal deep learning
Multimodal deep learningMultimodal deep learning
Multimodal deep learning
 
Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)
 
Ry pyconjp2015 karaoke
Ry pyconjp2015 karaokeRy pyconjp2015 karaoke
Ry pyconjp2015 karaoke
 
ICML Talk on deep learning for music recommendation
ICML Talk on deep learning for music recommendationICML Talk on deep learning for music recommendation
ICML Talk on deep learning for music recommendation
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
 
Mini Project- Audio Enhancement
Mini Project- Audio EnhancementMini Project- Audio Enhancement
Mini Project- Audio Enhancement
 
Miniproject audioenhancement-100223094301-phpapp02
Miniproject audioenhancement-100223094301-phpapp02Miniproject audioenhancement-100223094301-phpapp02
Miniproject audioenhancement-100223094301-phpapp02
 
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
 
10 Minute Research Presentation on Ambisonics and Impact
10 Minute Research Presentation on Ambisonics and Impact10 Minute Research Presentation on Ambisonics and Impact
10 Minute Research Presentation on Ambisonics and Impact
 
Dolby audio ai workshop speech coding - cong zhou
Dolby audio ai workshop   speech coding - cong zhouDolby audio ai workshop   speech coding - cong zhou
Dolby audio ai workshop speech coding - cong zhou
 
Video + Language 2019
Video + Language 2019Video + Language 2019
Video + Language 2019
 
Video + Language
Video + LanguageVideo + Language
Video + Language
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
 

More from Universitat Politècnica de Catalunya

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
Universitat Politècnica de Catalunya
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Universitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 

Recently uploaded

A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
bmucuha
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 

Recently uploaded (20)

A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 

Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)

  • 1. Amaia Salvador amaia.salvador@upc.edu PhD Candidate Universitat Politècnica de Catalunya DEEP LEARNING WORKSHOP Dublin City University 27-28 April 2017 Audio & Vision D2 L9
  • 3. 3 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  • 4. 4 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  • 5. 5 Transfer Learning Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016 Audio features help learn semantically meaningful visual representation
  • 6. 6 Transfer Learning Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016 Images are grouped into audio clusters. The task is to predict the audio cluster for a given image (classification)
  • 7. 7 Transfer Learning Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016 Although the model was not trained with class labels, units with semantical meaning emerge
  • 8. 8 Transfer Learning: SoundNet Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Pretrained visual ConvNets supervise the training of a model for sound representation
  • 9. 9 Videos for training are unlabeled. Relies on Convnets trained on labeled images. Transfer Learning: SoundNet Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
  • 10. 10 Kullback-Leibler Divergence: Cross-Entropy: Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 prob. distribution of vision network prob. distribution of sound network Transfer Learning: SoundNet
  • 11. 11 Recognizing Objects and Scenes from Sound Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 12. 12 Recognizing Objects and Scenes from Sound Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 13. 13 Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art. Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 14. 14 Visualization of the 1D filters over raw audio in conv1. Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Transfer Learning: SoundNet
  • 15. 15 Visualization of the 1D filters over raw audio in conv1. Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 16. 16 Visualize samples that mostly activate a neuron in a late layer (conv7) Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 17. 17 Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7): Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  • 18. 18Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet Hearing sounds that most activate a neuron in the sound network (conv7)
  • 19. 19 Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet Hearing sounds that most activate a neuron in the sound network (conv5)
  • 20. 20 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  • 21. 21 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 Learn synthesized sounds from videos of people hitting objects with a drumstick. Are features learned when training for sound prediction useful to identify the object’s material?
  • 22. 22 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 The Greatest Hits Dataset
  • 23. 23 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 The Greatest Hits Dataset
  • 24. 24 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 ● Map video sequences to sequences of sound features ● Sound features in the output are converted into waveform by example-based synthesis (i.e. nearest neighbor retrieval).
  • 25. 25 Sonorization Owens et al. Visually indicated sounds. CVPR 2016
  • 26. 26 Sonorization Owens et al. Visually indicated sounds. CVPR 2016
  • 27. 27 Sonorization Owens et al. Visually indicated sounds. CVPR 2016
  • 28. 28 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  • 29. 29 Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video CNN (VGG) Frame from a silent video Audio feature Post-hoc synthesis
  • 30. 30 Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video Downsample 8kHz Split in 40ms chunks (.5 overlap) LPC analysis 9-dim LPC features Waveform processing
  • 31. 31Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video Waveform processing (2 LSP features/ frame)
  • 32. 32 Speech Generation from Video Cooke et al. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America, 2006 GRID Dataset
  • 33. 33 Speech Generation from Video Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
  • 34. 34 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  • 35. 35Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016 Lip Reading: LipNet
  • 36. 36Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016 Lip Reading: LipNet Input (frames) and output (sentence) sequences are not aligned
  • 37. 37 Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006 Lip Reading: LipNet CTC Loss: Connectionist temporal classification ● Avoiding the need for alignment between input and output sequence by predicting an additional “_” blank word ● Before computing the loss, repeated words and blank tokens are removed ● “a _ a b _ ” == “_ a a _ _ b b” == “a a b”
  • 38. 38 Lip Reading: LipNet Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
  • 39. 39Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell
  • 40. 40 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell
  • 41. 41 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell Audio features Image features
  • 42. 42 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell Attention over output states from audio and video is computed at each timestep
  • 43. 43 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell
  • 44. 44 Lip Reading: Watch, Listen, Attend & Spell Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 LipNet
  • 45. 45 Lip Reading: Watch, Listen, Attend & Spell Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
  • 46. 46 Summary ● Transfer learning ● Sonorization ● Speech generation ● Lip Reading
  • 48. 48 Multimodal Learning Ngiam et al. Multimodal Deep Learning. ICML 2011
  • 49. 49 Multimodal Learning Ngiam et al. Multimodal Deep Learning. ICML 2011
  • 50. 50 Cross-Modality Ngiam et al. Multimodal Deep Learning. ICML 2011 Denoising autoencoder required to reconstruct both modalities given only one or both in the input
  • 51. 51 Cross-Modality Ngiam et al. Multimodal Deep Learning. ICML 2011 Video classification: Add Linear SVM on top of learned features ● Combination of audio&video as inputs gives relative improvement (on AVLetters) ● Using video as input to predict both audio& video gives SotA results in both
  • 52. 52 Cross-Modality Ngiam et al. Multimodal Deep Learning. ICML 2011 Audio classification: Add Linear SVM on top of learned features ● Performance of cross-modal features below SotA ● Still, remarkable performance when using video as input to classify audio
  • 53. 53 Multimodal Learning Ngiam et al. Multimodal Deep Learning. ICML 2011
  • 54. 54 Shared Learning Ngiam et al. Multimodal Deep Learning. ICML 2011 First, learn a shared representation using the two modalities as input
  • 55. 55 Shared Learning Ngiam et al. Multimodal Deep Learning. ICML 2011 Then, train linear classifiers on top of shared representations (previously trained on audio/video data) using only one of the two modalities as input. Testing is performed with the other modality.
  • 56. 56 Shared Learning Ngiam et al. Multimodal Deep Learning. ICML 2011 Train Linear classifiers on top of shared representations (previously trained on audio/video data) using only one of the two modalities as input. Testing is performed with the other modality.