Multimodal deep learning

2,372 views

Published on

Published in: Technology, Business
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,372
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
111
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • In this work, I’m going to talk about audio visual speech recognition and how we can apply deep learning to this multimodal setting.For example, if we’re given a small speech segment with the video of person saying letters can we determine which letter he said – images of his lips; and the audio – how do we integrate these two sources of data.Multimodal learning involves relating information from multiple sources. For example, images and 3-d depth scans are correlated at first-order as depth discontinuities often manifest as strong edges in images. Conversely, audio and visual data for speech recognition have non-linear correlations at a “mid-level”, as phonemes or visemes; it is difficult to relate raw pixels to audio waveforms or spectrograms. In this paper, we are interested in modeling “mid-level” relationships, thus we choose to use audio-visual speech classification to validate our methods. In particular, we focus on learning representations for speech audio which are coupled with videos of the lips.
  • So how do we solve this problem? A common machine learning pipeline goes like this – we take the inputs and extract some features and then feed it into our standard ML toolbox (e.g., classifier). The hardest part is really the features – how we represent the audio and video data for use in our classifier.While for audio, the speech community have developed many features such as MFCCs which work really well,it is not obvious what features we should use for lips.
  • So what does state of the art features look like? Engineering these features took long timeTo this, we address two questions in this work – [click]Furthermore, what is interesting in this problem is the deep question – that audio and video features are only related at a deep level
  • Concretely, our task is to convert sequences of lip images into a vector of numbersSimilarly, for the audio
  • Now that we have multimodal data, one easy way is to simply concatenate them – however simply concatenating the features like this fails to model the interactions between the modalitiesHowever, this is a very limited view of multimodal features – instead what we would like to do [click] is to
  • Find better ways to relate the audio and visual inputs and get features that arise out of relating them together
  • Next I’m going to describe adifferent feature learning settingSuppose that at test time, only the lip images are available, and you donot get the audio signal And supposed at training time, you have both audio and video – can the audio at training time help you do better at test time even though you don’t have audio at test time(lip-reading not well defined)But there are more settings to consider!If our task is only to do lip reading, visual speech recognition.An interesting question to ask is -- can we improve our lip reading features if we had audio data.
  • Lets step back a bit and take a similar but related approach to the problem.What if we learn an autoencoderBut, this still has the problem! But, wait now we can do something interesting
  • So there are different versions of these shallow models and if you trained a model of this form, this is what one usually gets.If you look at the hidden units, it turns out that there are hidden units ….that respond to X and Y only…So why doesn’t this work? We think that there are two reasons for this.In the shallow models, we’re trying relate pixels values to the values in the audio spectrogram.Instead, what we expect is for mid-level video features such as mouth motions to inform us on the audio content.It turns out that the model learn many unimodal units. The figure shows the connectivity ..(explain)We think there are two reasons possible here – 1) that the model is unable to do it (no incentive) 2) we’re actually trying to relate pixel values to values in the audio spectrogram. But this is really difficult, for example, we do not expect that change in one pixel value to inform us abouhow the audio pitch changing. Thus, the relations across the modalities are deep – and we really need a deep model to do this.Review: 1) no incentive and 2) deep
  • But, this still has the problem! But, wait now we can do something interestingThis model will be trained on clips with both audio and video.
  • However, the connections between audio and video are (arguably) deep instead of shallow – so ideally, we want to extract mid-level features before trying to connect them together …. moreSince audio is really good for speech recognition, the model is going to learn representations that can reconstruct audio and thus hopefully be good for speech recognition as well
  • But, what we like to do is not to have to train many versions of this models. It turns out that you can unify the separates models together.
  • [pause] the second model we present is the bimodal deep autoencoderWhat we want this bimodal deep AE to do is – to learn representations that relate both the audio and video data. Concretely, we want the bimodal deep AE to learn representations that are robust to the input modality
  • Features correspond to mouth motions and are also paired up with the audio spectrogramThe features are generic and are not speaker specific
  • Explain in phases!
  • Explain in phases
  • Explain in phases
  • Multimodal deep learning

    1. 1. 孟泽张氏秋怀 TRUONGTHITHUHOAIMULTIMODAL DEEP LEARNINGPRESENTATION
    2. 2. MULTIMODAL DEEP LEARNING  Jiquan Ngiam  Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, Andrew Y. Ng  Computer Science Department, Stanford University  Department of Music, Stanford University  Computer Science & Engineering Division, University of Michigan, Ann Arbor
    3. 3. MCGURK EFFECT In speech recognition, people are known to integrate audio-visual information in order to understand speech. This was first exemplified in the McGurk effect where a visual /ga/ with a voiced /ba/ is perceived as /da/ by most subjects.
    4. 4. AUDIO-VISUAL SPEECH RECOGNITION
    5. 5. FEATURE CHALLENGE Classifier (e.g. SVM)
    6. 6. REPRESENTING LIPS• Can we learn better representations for audio/visual speech recognition?• How can multimodal data (multiple sources of input) be used to find better features?
    7. 7. UNSUPERVISED FEATURE LEARNING
    8. 8. UNSUPERVISED FEATURE LEARNING
    9. 9. MULTIMODAL FEATURES
    10. 10. CROSS-MODALITY FEATURE LEARNING
    11. 11. FEATURE LEARNING MODELS
    12. 12. BACKGROUND Sparse Restricted Boltzmann Machines (RBMs)
    13. 13. FEATURE LEARNING WITH AUTOENCODERSAudio Reconstruction Video Reconstruction ... ... ... ... ... ... Audio Input Video Input
    14. 14. BIMODAL AUTOENCODER Video Reconstruction Audio Reconstruction ... ... ... Hidden Representation ... ... Audio Input Video Input
    15. 15. SHALLOW LEARNING Hidden Units Video Input Audio Input • Mostly unimodal features learned
    16. 16. BIMODAL AUTOENCODER Video Reconstruction Audio Reconstruction ... ... ... Hidden Representation ... Video InputCross-modality Learning:Learn better video features by using audio as acue
    17. 17. CROSS-MODALITY DEEP AUTOENCODER Audio Reconstruction Video Reconstruction ... ... ... ... ... Learned Representation ... ... Video Input
    18. 18. CROSS-MODALITY DEEP AUTOENCODER Audio Reconstruction Video Reconstruction ... ... ... ... ... Learned Representation ... ... Audio Input
    19. 19. BIMODAL DEEP AUTOENCODERS Audio Reconstruction Video Reconstruction ... ... ... ... ... Shared Representation “Phonemes” “Visemes” ... ... (Mouth Shapes) ... ... Audio Input Video Input
    20. 20. BIMODAL DEEP AUTOENCODERS Audio Reconstruction Video Reconstruction ... ... ... ... ... “Visemes” ... (Mouth Shapes) ... Video Input
    21. 21. BIMODAL DEEP AUTOENCODERS Audio Reconstruction Video Reconstruction ... ... ... ... ... “Phonemes” ... ... Audio Input
    22. 22. TRAINING BIMODAL DEEP AUTOENCODERAudio Reconstruction Video Reconstruction ... ... Audio Reconstruction ... Video Reconstruction ... Audio Reconstruction ... Video Reconstruction ... ... ... ... ... ... ... ... Shared Representation ... Shared Representation ... Shared Representation ... ... ... ... ... ... ... ... Audio Input Video Input Audio Input Video Input • Train a single model to perform all 3 tasks • Similar in spirit to denoising autoencoders
    23. 23. EVALUATIONS
    24. 24. VISUALIZATIONS OF LEARNED FEATURES 0 ms 33 ms 67 ms 100 ms 0 ms 33 ms 67 ms 100 msAudio (spectrogram) and Videofeatureslearned over 100ms windows
    25. 25. LEARNING SETTINGS We will consider the learning settings shown in Figure 1.
    26. 26. LIP-READING WITH AVLETTERS AVLetters: Audio Reconstruction ... Video Reconstruction ...  26-way Letter Classification ... ...  10 Speakers ... Learned Representation  60x80 pixels lip regions ... ... Cross-modality learning Video Input Feature Supervised Testing Learning Learning Audio + Video Video Video
    27. 27. LIP-READING WITH AVLETTERS Feature Representation Classification Accuracy Multiscale Spatial Analysis 44.6% (Matthews et al., 2002) Local Binary Pattern 58.5% (Zhao & Barnard, 2009)
    28. 28. LIP-READING WITH AVLETTERS Feature Representation Classification Accuracy Multiscale Spatial Analysis 44.6% (Matthews et al., 2002) Local Binary Pattern 58.5% (Zhao & Barnard, 2009) Video-Only Learning 54.2% (Single Modality Learning)
    29. 29. LIP-READING WITH AVLETTERS Feature Representation Classification Accuracy Multiscale Spatial Analysis 44.6% (Matthews et al., 2002) Local Binary Pattern 58.5% (Zhao & Barnard, 2009) Video-Only Learning 54.2% (Single Modality Learning) Our Features 64.4% (Cross Modality Learning)
    30. 30. LIP-READING WITH CUAVE CUAVE: Audio Reconstruction Video Reconstruction ... ...  10-way Digit Classification ... ...  36 Speakers ... Learned Representation Cross Modality Learning ... ... Video Input Feature Supervised Testing Learning Learning Audio + Video Video Video
    31. 31. LIP-READING WITH CUAVE Classification Feature Representation Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning 65.4% (Single Modality Learning)
    32. 32. LIP-READING WITH CUAVE Classification Feature Representation Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning 65.4% (Single Modality Learning) Our Features 68.7% (Cross Modality Learning)
    33. 33. LIP-READING WITH CUAVE Classification Feature Representation Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning 65.4% (Single Modality Learning) Our Features 68.7% (Cross Modality Learning) Discrete Cosine Transform 64.0% (Gurban & Thiran, 2009) Visemic AAM 83.0% (Papandreou et al., 2009)
    34. 34. MULTIMODAL RECOGNITION Audio Reconstruction Video Reconstruction ... ... CUAVE: ... ...  10-way Digit Classification ... Shared Representation ... ...  36 Speakers ... ... Audio Input Video Input Evaluate in clean and noisy audio scenarios  Inthe clean audio scenario, audio performs extremely well alone Feature Supervised Testing Learning Learning Audio + Audio + Video Audio + Video Video
    35. 35. MULTIMODAL RECOGNITION Classification Accuracy Feature Representation (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7%
    36. 36. MULTIMODAL RECOGNITION Classification Accuracy Feature Representation (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7% Bimodal Deep Autoencoder 77.3%
    37. 37. MULTIMODAL RECOGNITION Classification Accuracy Feature Representation (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7% Bimodal Deep Autoencoder 77.3% Bimodal Deep Autoencoder 82.2% + Audio Features (RBM)
    38. 38. SHARED REPRESENTATION EVALUATION Feature Supervised Testing Learning Learning Audio + Video Audio Video Linear Classifier Supervised Testing Shared Shared Representation Representation Audio Video Audio Video Training Testing
    39. 39. SHARED REPRESENTATION EVALUATION Method: Learned Features + Canonical Correlation Analysis Feature Supervised Testing Accuracy Learning Learning Audio + Video Audio Video 57.3% Audio + Video Video Audio 91.7% Linear Classifier Supervised Testing Shared Shared Representation Representation Audio Video Audio Video Training Testing
    40. 40. MCGURK EFFECTA visual /ga/ combined with an audio /ba/ is oftenperceived as /da/. Audio Video Model Predictions Input Input /ga/ /ba/ /da/ /ga/ /ga/ 82.6% 2.2% 15.2% /ba/ /ba/ 4.4% 89.1% 6.5%
    41. 41. MCGURK EFFECTA visual /ga/ combined with an audio /ba/ is oftenperceived as /da/. Audio Video Model Predictions Input Input /ga/ /ba/ /da/ /ga/ /ga/ 82.6% 2.2% 15.2% /ba/ /ba/ 4.4% 89.1% 6.5% /ba/ /ga/ 28.3% 13.0% 58.7%
    42. 42. CONCLUSION Applied deep autoencoders to Audio Reconstruction ... Video Reconstruction ... discover features in multimodal ... ... data ... Learned Representation ... ... Cross-modality Learning: Video Input We obtained better video features Video Reconstruction (for lip-reading) using audio as a Audio Reconstruction ... ... cue ... ... ... Shared Representation Multimodal Feature Learning: ... ... Learn representations that relate ... ... Audio Input Video Input across audio and video data
    43. 43. THANK FOR ATTENTION!

    ×