Multimodal deep learning

Multimodal Deep Learning
Akhter Al Amin
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee &
Andrew Ng

Audio-Visual Speech Recognition

Feature Challenge
Classifier (e.g. SVM)

Representing Lips
• Can we learn better representations for
audio/visual speech recognition?
• How can multimodal data (multiple sources
of input) be used to find better features?

Unsupervised Feature Learning
5
1.1
.
.
.
10
9
1.67
.
.
.
3

Multimodal Features
1
2.1
5
9
.
.
.
.
.
.
.
6.5
9

Cross-Modality Feature Learning
5
1.1
.
.
.
10

Feature Learning with Autoencoders
...
...
Audio Input
...
...
Video Input
... ...
Audio Reconstruction Video Reconstruction

Bimodal Autoencoder
...
... ...
... ...
Audio Input Video Input
Hidden
Representation
Adapted from: MIT 191

Shallow Learning
Hidden
Units
Video Input Audio Input
• Mostly unimodal features learned

Bimodal Autoencoder
...
...
... ...
Video Input
Hidden
Representation
Cross-modality Learning:
Learn better video features by using audio as a cue

Cross-modality Deep Autoencoder
...
...
...
...
... ...
...
Video Input
Learned
Representation

Cross-modality Deep Autoencoder
...
...
...
...
... ...
...
Audio Input
Learned
Representation

Bimodal Deep Autoencoders
...
...
... ...
...
...
... ...
...
Shared
Representation
“Visemes”
(Mouth Shapes)
“Phonemes”

...
...
...
...
... ...
...
Video Input
“Visemes”
(Mouth Shapes)

“Phonemes”
...
...
...
...
... ...
...
Audio Input

Training Bimodal Deep Autoencoder
...
...
...
...
... ...
...
Audio Input
Shared
Representation
...
...
...
...
... ...
...
Video Input
Shared
Representation
...
...
... ...
...
...
... ...
...
Shared
Representation
• Train a single model to perform all 3 tasks
• Similar in spirit to denoising autoencoders
(Vincent et al., 2008)

Visualizations of Learned Features
0 ms 33 ms 67 ms 100 ms
0 ms 33 ms 67 ms 100 ms
Audio (spectrogram) and Video features
learned over 100ms windows

Lip-reading with AVLetters
● AVLetters:
○ 26-way Letter Classification
○ 10 Speakers
○ 60x80 pixels lip regions
● Cross-modality learning
...
...
...
...
... ...
...
Video Input
Learned
Representation
Feature Learning Supervised Learning Testing
Audio + Video Video Video

Feature Representation Classification Accuracy
Multiscale Spatial Analysis
(Matthews et al., 2002)
44.6%
Local Binary Pattern
(Zhao & Barnard, 2009)
58.5%

44.6%
58.5%
Video-Only Learning
(Single Modality Learning)
54.2%

44.6%
58.5%
Video-Only Learning
54.2%
(Cross Modality Learning) 64.4%

Lip-reading with CUAVE
● CUAVE:
○ 10-way Digit Classification
○ 36 Speakers
● Cross Modality Learning
...
...
...
...
... ...
...
Video Input
Learned
Representation
Audio + Video Video Video

Baseline Preprocessed Video 58.5%
Video-Only Learning
65.4%

Video-Only Learning
65.4%

Video-Only Learning
65.4%
Discrete Cosine Transform
(Gurban & Thiran, 2009)
64.0%
Visemic AAM
(Papandreou et al., 2009)
83.0%

Multimodal Recognition
● CUAVE:
○ 10-way Digit Classification
○ 36 Speakers
● Evaluate in clean and noisy audio scenarios
○ In the clean audio scenario, audio performs extremely well alone
Audio + Video Audio + Video Audio + Video
...
...
... ...
...
...
... ...
...
Shared
Representation

Feature Representation
Classification Accuracy
(Noisy Audio at 0db SNR)
Audio Features (RBM) 75.8%
Our Best Video Features 68.7%

Bimodal Deep Autoencoder 77.3%

Bimodal Deep Autoencoder 77.3%
Bimodal Deep Autoencoder
+ Audio Features (RBM)
82.2%

Shared Representation Evaluation
Supervise
d
Testing
Audio
Shared
Representation
Video Audio
Shared
Representation
Video
Linear Classifier
Trainin
g
Testin
g
Audio + Video Audio Video

Shared Representation Evaluation
Supervise
d
Testing
Audio
Shared
Representation
Video Audio
Shared
Representation
Video
Linear Classifier
Trainin
g
Testin
g
● Method: Learned Features + Canonical Correlation
Analysis
Feature Learning
Supervised
Learning
Testing Accuracy
Audio + Video Audio Video 57.3%
Audio + Video Video Audio 91.7%

McGurk Effect
A visual /ga/ combined with an audio /ba/ is often
perceived as /da/.
Audio
Input
Video
Input
Model Predictions
/ga/ /ba/ /da/
/ga/ /ga/ 82.6% 2.2% 15.2%
/ba/ /ba/ 4.4% 89.1% 6.5%

McGurk Effect
A visual /ga/ combined with an audio /ba/ is often
perceived as /da/.
Audio
Input
Video
Input
Model Predictions
/ga/ /ba/ /da/
/ga/ /ga/ 82.6% 2.2% 15.2%
/ba/ /ba/ 4.4% 89.1% 6.5%
/ga/ /ba/ 28.3% 13.0% 58.7%

Conclusion
● Applied deep autoencoders to
discover features in multimodal
data
● Cross-modality Learning:
We obtained better video features
(for lip-reading) using audio as a
cue
● Multimodal Feature Learning:
Learn representations that relate
across audio and video data
...
...
...
...
... ...
...
Video Input
Learned
Representation
...
...
... ...
...
...
... ...
...
Shared
Representation

Bimodal Learning with RBMs
…...
...
Audio Input
Hidden Units
...
Video Input

Discussion Topic
● What are the current state-of-the-art multimodal deep learning model?
● How current models get improved than this one?

Multimodal deep learning

Recommended

Recommended

More Related Content

Similar to Multimodal deep learning

Similar to Multimodal deep learning (20)

Recently uploaded

Recently uploaded (20)

Multimodal deep learning