• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Mphil Transfer
 

Mphil Transfer

on

  • 499 views

 

Statistics

Views

Total Views
499
Views on SlideShare
497
Embed Views
2

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 2

http://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • What you hear depends on what you see We know that human speech perception is bimodal. We use : Sight Hearing The best way to see the importance of this bimodality is to look at a short video sequence describing the McGurk effect (link above). The McGurk effect was first demonstrated by Harry McGurk and John MacDonald in 1976 This effect is a compelling demonstration of how we all use visual speech information and it establishes the human speech perception bimodality. Now I’m going to run 2 times the video sequence You will see and hear a person speaking six syllables. Watch the mouth closely, but concentrate on what you're hearing too. START THE MOVIE NOW. Now close your eyes and listen carefully START THE MOVIE NOW. Indeed the syllable [ba] had been dubbed on to lip movements for [ga], normal adults reported hearing [da]. That means that what you hear depends on whether your eyes are opened or closed. We have seen that the video signal is as important as the audio one for human speech perception
  • Psychologist - It's not what you say but the way you say it: matching faces and voices

Mphil Transfer Mphil Transfer Presentation Transcript

  • Audio-visual speech reading system Samuel Pachoud Mphil transfer presentation
  • McGurk effect
  • Why bimodality?
    • Audio vs. video
      • 43 phonemes  15 visemes (British English)
        • / k /, / ɡ /, / ɳ /  /k/
        • / ʈʃ /, / ʃ /, / ʤ /, / ʒ /  /ch/
      • Facial discrimination is more accurate than acoustic discrimination in some cases [Lander 2007] :
        • / l / and / r / could be quite similar  “grass” vs. “glass”
    • Noisy environments:
      • Which cue to rely on?
  • Difficulties
    • Representation and extraction
      • Low- level “perceptual” information
      • Filter noisy signal highly depends on noise nature
    • Integration
      • adaptive and effective
      • with degraded input signals
  • Current limitations
    • Lip deformation, self-occlusion, distortion [Nefian 2002,Göcke 2005]
    • Manual labelling or alignment between frames [Chen 2001, Aleksic 2002, Wang 2004]
    • No explicit use of the close link between audio and video [Potamianos 2001, Gordan 2002]
    • No studies with both audio and video degraded
      • Except [Movellan1996]. However he used a small data corpus (only 4 classes)
  • Our contributions
    • Occlusion, missing data
      • Set of built-in space-time visual feature [CVPR 2008]
    • Synchronization
      • Similar structure for both audio and visual feature extraction [BMVC 2008]
    • Degraded signals
      • Use of a discriminative model to provide levels of confidence [BMVC 2009, under review process]
      • Use of canonical correlation to fuse audio and visual features
  • Overview AUDIO FEATURE SELECTION AND EXTRACTION VISUAL FEATURE SELECTION AND EXTRACTION VIDEO t t AUDIO VISUAL INTEGRATION AUDIO-ONLY ASR VISUAL-ONLY ASR (LIP READING) AUDIO VISUAL ASR AUDIO
  • Spatio-temporal features
    • Lip motion features:
    Image-to-image approach: 2 separate frames of 3 moving features. Space-time volume modelling: a sequence with several spatio-temporal features
  • Confidence factors
    • Which cue to rely on?
      • Levels of confidence of the audio and visual signals
      • Comparing the distribution between training and testing set
    • Confidence factors
      • Provided by single modality classification using Support Vector Machine (SVM)
      • Used to select the most effective strategy to integrate audio and visual feature
  • Adaptive fusion
    • Correlate the linear relationship between audio and video
      • Canonical Correlation Analysis (CCA)
    • Create a canonical space based on uncontaminated input signals
      • Extract dominant canonical factors from the training set
    • Map and construct the testing set
      • Using the trained regressions matrices and canonical factor pairs
  • Multiple Multi-class SVM AUDIO VIDEO Recognition Audio multi-class SVM AUDIO VIDEO MFCC + SCF ROI Feature extraction Feature selection Adaptive fusion Decision making Confidence factors kCCA Audio and visual feature estimate MFCC + SCF PCA ROI 2D + time SIFT descriptors Joint feature Multi-class SVM noise noise Video multi-class SVM PCA 2D + time SIFT descriptors
  • Used database (digits)
    • Clean data
    • Degraded signals
    Digit 0 Salt and pepper occlusion Digit 0 , SNR = -5dB Region-of-Interest
  • Isolated word recognition Recognition rate (%) over 10 digits using kCCA and multiple MSVM Average generalization performance (%) from [Movellan 1996] 1. Visual feature more robust to occlusion than salt and pepper 2. Residual information in degraded signals is extracted and used 3. Higher recognition rate under occlusion compared to [Movellan 1996]
  • Problems addressed
    • Degraded signals
      • Robust and accurate spatio-temporal feature representation
    • Adaptive fusion
      • Using Canonical Correlation Analysis (CCA)
      • Capable of combining features given condition at hand
    • Isolated word recognition (digits)
      • Based on multiple Multi-class Support Vector Machine (MSVM)
  • Discussion
    • AV-ASR implies continuous speech recognition
      • Contain structure
    • Difficult to do with MSVM
      • Segmentation
      • Scanning
    • Remaining issues:
      • Need for a structural system
      • Need for a data corpus containing contextual information
  • Future work
    • Extend to a structural model  Structured Support Vector Machine (SSVM) [Tsochantaridis 2005]
      • Create a joint feature map
    • Evaluation performed using the GRID audiovisual sentence corpus
      • 6 word long sentences with particular structure:
  • Thank you Questions?