Multimodal deep learning is a contemporary technique of feature extraction from multiple modalities. Previously neural networks were able to analyze monolithic source of input such as audio or video or text. However, this new model multiplex multiple source of input and perform better quality prediction.
4. Representing Lips
• Can we learn better representations for
audio/visual speech recognition?
• How can multimodal data (multiple sources
of input) be used to find better features?
14. Bimodal Autoencoder
...
... ...
... ...
Audio Input Video Input
Hidden
Representation
Audio Reconstruction Video Reconstruction
Adapted from: MIT 191
15. Bimodal Autoencoder
...
...
... ...
Video Input
Hidden
Representation
Audio Reconstruction Video Reconstruction
Cross-modality Learning:
Learn better video features by using audio as a cue
Adapted from: MIT 191
21. Bimodal Deep Autoencoders
...
...
... ...
...
...
... ...
...
Audio Input Video Input
Shared
Representation
Audio Reconstruction Video Reconstruction
“Visemes”
(Mouth Shapes)
“Phonemes”
Adapted from: MIT 191
22. Training Bimodal Deep Autoencoder
...
...
...
...
... ...
...
Audio Input
Shared
Representation
Audio Reconstruction Video Reconstruction
...
...
...
...
... ...
...
Video Input
Shared
Representation
Audio Reconstruction Video Reconstruction
...
...
... ...
...
...
... ...
...
Audio Input Video Input
Shared
Representation
Audio Reconstruction Video Reconstruction
• Train a single model to perform all 3 tasks
• Similar in spirit to denoising autoencoders
(Vincent et al., 2008)
24. Visualizations of Learned Features
0 ms 33 ms 67 ms 100 ms
0 ms 33 ms 67 ms 100 ms
Audio (spectrogram) and Video features
learned over 100ms windows
25. Lip-reading with AVLetters
● AVLetters:
○ 26-way Letter Classification
○ 10 Speakers
○ 60x80 pixels lip regions
● Cross-modality learning
...
...
...
...
... ...
...
Video Input
Learned
Representation
Audio Reconstruction Video Reconstruction
Feature Learning Supervised Learning Testing
Audio + Video Video Video
26. Lip-reading with AVLetters
Feature Representation Classification Accuracy
Multiscale Spatial Analysis
(Matthews et al., 2002)
44.6%
Local Binary Pattern
(Zhao & Barnard, 2009)
58.5%
27. Lip-reading with AVLetters
Feature Representation Classification Accuracy
Multiscale Spatial Analysis
(Matthews et al., 2002)
44.6%
Local Binary Pattern
(Zhao & Barnard, 2009)
58.5%
Video-Only Learning
(Single Modality Learning)
54.2%
39. McGurk Effect
A visual /ga/ combined with an audio /ba/ is often
perceived as /da/.
Audio
Input
Video
Input
Model Predictions
/ga/ /ba/ /da/
/ga/ /ga/ 82.6% 2.2% 15.2%
/ba/ /ba/ 4.4% 89.1% 6.5%
40. McGurk Effect
A visual /ga/ combined with an audio /ba/ is often
perceived as /da/.
Audio
Input
Video
Input
Model Predictions
/ga/ /ba/ /da/
/ga/ /ga/ 82.6% 2.2% 15.2%
/ba/ /ba/ 4.4% 89.1% 6.5%
/ga/ /ba/ 28.3% 13.0% 58.7%
41. Conclusion
● Applied deep autoencoders to
discover features in multimodal
data
● Cross-modality Learning:
We obtained better video features
(for lip-reading) using audio as a
cue
● Multimodal Feature Learning:
Learn representations that relate
across audio and video data
...
...
...
...
... ...
...
Video Input
Learned
Representation
Audio Reconstruction Video Reconstruction
...
...
... ...
...
...
... ...
...
Audio Input Video Input
Shared
Representation
Audio Reconstruction Video Reconstruction