7. Overview AUDIO FEATURE SELECTION AND EXTRACTION VISUAL FEATURE SELECTION AND EXTRACTION VIDEO t t AUDIO VISUAL INTEGRATION AUDIO-ONLY ASR VISUAL-ONLY ASR (LIP READING) AUDIO VISUAL ASR AUDIO
8.
9.
10.
11. Multiple Multi-class SVM AUDIO VIDEO Recognition Audio multi-class SVM AUDIO VIDEO MFCC + SCF ROI Feature extraction Feature selection Adaptive fusion Decision making Confidence factors kCCA Audio and visual feature estimate MFCC + SCF PCA ROI 2D + time SIFT descriptors Joint feature Multi-class SVM noise noise Video multi-class SVM PCA 2D + time SIFT descriptors
12.
13. Isolated word recognition Recognition rate (%) over 10 digits using kCCA and multiple MSVM Average generalization performance (%) from [Movellan 1996] 1. Visual feature more robust to occlusion than salt and pepper 2. Residual information in degraded signals is extracted and used 3. Higher recognition rate under occlusion compared to [Movellan 1996]
What you hear depends on what you see We know that human speech perception is bimodal. We use : Sight Hearing The best way to see the importance of this bimodality is to look at a short video sequence describing the McGurk effect (link above). The McGurk effect was first demonstrated by Harry McGurk and John MacDonald in 1976 This effect is a compelling demonstration of how we all use visual speech information and it establishes the human speech perception bimodality. Now I’m going to run 2 times the video sequence You will see and hear a person speaking six syllables. Watch the mouth closely, but concentrate on what you're hearing too. START THE MOVIE NOW. Now close your eyes and listen carefully START THE MOVIE NOW. Indeed the syllable [ba] had been dubbed on to lip movements for [ga], normal adults reported hearing [da]. That means that what you hear depends on whether your eyes are opened or closed. We have seen that the video signal is as important as the audio one for human speech perception
Psychologist - It's not what you say but the way you say it: matching faces and voices