Multimodal deep learning

孟泽
张氏秋怀 TRUONGTHITHUHOAI

MULTIMODAL DEEP LEARNING
PRESENTATION

MULTIMODAL DEEP LEARNING
 Jiquan Ngiam
 Aditya Khosla, Mingyu Kim, Juhan Nam,
Honglak Lee, Andrew Y. Ng

 Computer Science Department, Stanford
University
 Department of Music, Stanford University
 Computer Science & Engineering Division,
University of Michigan, Ann Arbor

MCGURK EFFECT

 In speech recognition, people are known to
integrate audio-visual information in order to
understand speech.

 This was ﬁrst exempliﬁed in the McGurk
effect where a visual /ga/ with a voiced /ba/
is perceived as /da/ by most subjects.

AUDIO-VISUAL SPEECH RECOGNITION

FEATURE CHALLENGE

Classifier (e.g.
SVM)

REPRESENTING LIPS

• Can we learn better representations for
audio/visual speech recognition?

• How can multimodal data (multiple
sources of input) be used to find better
features?

CROSS-MODALITY FEATURE LEARNING

BACKGROUND

 Sparse Restricted Boltzmann Machines
(RBMs)

FEATURE LEARNING WITH AUTOENCODERS

Audio Reconstruction Video Reconstruction
... ...
... ...
... ...
Audio Input Video Input

BIMODAL AUTOENCODER

Video Reconstruction
Audio Reconstruction
... ...
... Hidden
Representation

... ...

SHALLOW LEARNING
Hidden Units

Video Input Audio Input

• Mostly unimodal features learned

BIMODAL AUTOENCODER

... ...
... Hidden
Representation

...
Video Input

Cross-modality Learning:
Learn better video features by using audio as a
cue

CROSS-MODALITY DEEP AUTOENCODER
... ...
... ...
... Learned
Representation

...
...
Video Input

CROSS-MODALITY DEEP AUTOENCODER
... ...
... ...
... Learned
Representation

...
...
Audio Input

BIMODAL DEEP AUTOENCODERS
... ...
... ...
... Shared
Representation

“Phonemes” “Visemes”
... ... (Mouth Shapes)

... ...

... ...
... ...
...
“Visemes”
... (Mouth Shapes)

...
Video Input

... ...
... ...
...
“Phonemes”
...
...
Audio Input

TRAINING BIMODAL DEEP AUTOENCODER
... ... Audio Reconstruction
...
...
...
...
... ... ... ... ... ...
... Shared
Representation
... Shared
Representation
... Shared
Representation

... ... ... ...
... ... ... ...
Audio Input Video Input Audio Input Video Input

• Train a single model to perform all 3
tasks

• Similar in spirit to denoising
autoencoders

VISUALIZATIONS OF LEARNED FEATURES

0 ms 33 ms 67 ms 100 ms

0 ms 33 ms 67 ms 100 ms

Audio (spectrogram) and Video
features
learned over 100ms windows

LEARNING SETTINGS

 We will consider the learning settings
shown in Figure 1.

LIP-READING WITH AVLETTERS

 AVLetters: Audio Reconstruction
...
...
 26-way Letter Classification ... ...
 10 Speakers ... Learned
Representation

 60x80 pixels lip regions ...
...
 Cross-modality learning Video Input

Feature Supervised
Testing
Learning Learning
Audio + Video Video Video


Feature Representation Classification
Accuracy
Multiscale Spatial Analysis 44.6%
(Matthews et al., 2002)

Local Binary Pattern 58.5%
(Zhao & Barnard, 2009)

Accuracy


Video-Only Learning
54.2%
(Single Modality Learning)

Accuracy


Video-Only Learning
54.2%
Our Features
64.4%
(Cross Modality Learning)

LIP-READING WITH CUAVE

 CUAVE: Audio Reconstruction Video Reconstruction
... ...
 10-way Digit Classification
... ...
 36 Speakers
... Learned
Representation

 Cross Modality Learning ...
...
Video Input

Feature Supervised
Testing
Learning Learning
Audio + Video Video Video

Classification
Feature Representation
Accuracy
Baseline Preprocessed Video 58.5%
Video-Only Learning
65.4%

Classification
Accuracy
Video-Only Learning
65.4%
Our Features
68.7%

Classification
Accuracy
Video-Only Learning
65.4%
Our Features
68.7%

Discrete Cosine Transform 64.0%
(Gurban & Thiran, 2009)

Visemic AAM 83.0%
(Papandreou et al., 2009)

MULTIMODAL RECOGNITION

... ...
 CUAVE: ... ...

 10-way Digit Classification ... Shared
Representation

... ...
 36 Speakers
... ...

 Evaluate in clean and noisy audio
scenarios
 Inthe clean audio scenario, audio performs
extremely well alone
Feature Supervised
Testing
Learning Learning
Audio +
Audio + Video Audio + Video
Video

Classification
Accuracy
(Noisy Audio at 0db
SNR)
Audio Features (RBM) 75.8%
Our Best Video Features 68.7%

Classification
Accuracy
(Noisy Audio at 0db
SNR)
Bimodal Deep Autoencoder 77.3%

Classification
Accuracy
(Noisy Audio at 0db
SNR)
Bimodal Deep Autoencoder 77.3%

Bimodal Deep Autoencoder
82.2%
+ Audio Features (RBM)

SHARED REPRESENTATION EVALUATION
Feature Supervised
Testing
Learning Learning
Audio + Video Audio Video
Linear Classifier Supervised
Testing

Shared Shared
Representation Representation

Audio Video Audio Video

Training Testing

SHARED REPRESENTATION EVALUATION
 Method: Learned Features + Canonical Correlation
Analysis
Feature Supervised
Testing Accuracy
Learning Learning
Audio + Video Audio Video 57.3%
Audio + Video Video Audio 91.7%

Linear Classifier Supervised
Testing

Shared Shared
Representation Representation

Audio Video Audio Video

Training Testing

MCGURK EFFECT
A visual /ga/ combined with an audio /ba/ is often
perceived as /da/.

Audio Video Model Predictions
Input Input /ga/ /ba/ /da/
/ga/ /ga/ 82.6% 2.2% 15.2%
/ba/ /ba/ 4.4% 89.1% 6.5%

MCGURK EFFECT
A visual /ga/ combined with an audio /ba/ is often
perceived as /da/.

Audio Video Model Predictions
Input Input /ga/ /ba/ /da/
/ga/ /ga/ 82.6% 2.2% 15.2%
/ba/ /ba/ 4.4% 89.1% 6.5%
/ba/ /ga/ 28.3% 13.0% 58.7%

CONCLUSION
 Applied deep autoencoders to Audio Reconstruction
...
...
discover features in multimodal ... ...
data ... Learned
Representation

...
...
 Cross-modality Learning: Video Input

We obtained better video features

(for lip-reading) using audio as a

... ...
cue ... ...
... Shared
Representation

 Multimodal Feature Learning: ... ...
Learn representations that relate ... ...

across audio and video data

Multimodal deep learning

More Related Content

What's hot

Viewers also liked

Similar to Multimodal deep learning

Recently uploaded

Multimodal deep learning

Editor's Notes