SlideShare a Scribd company logo
1 of 17
Audio-visual speech reading system Samuel Pachoud Mphil transfer presentation
McGurk effect
Why bimodality? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Difficulties ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Current limitations ,[object Object],[object Object],[object Object],[object Object],[object Object]
Our contributions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Overview AUDIO FEATURE SELECTION AND EXTRACTION VISUAL FEATURE SELECTION AND EXTRACTION VIDEO t t AUDIO VISUAL INTEGRATION AUDIO-ONLY ASR VISUAL-ONLY ASR (LIP READING) AUDIO VISUAL  ASR AUDIO
Spatio-temporal features ,[object Object],Image-to-image approach: 2 separate frames of 3 moving features. Space-time volume modelling: a sequence with several spatio-temporal features
Confidence factors ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Adaptive fusion ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Multiple Multi-class SVM AUDIO VIDEO Recognition Audio multi-class SVM AUDIO VIDEO MFCC + SCF ROI Feature  extraction Feature  selection Adaptive fusion Decision making Confidence factors kCCA Audio and visual feature estimate MFCC + SCF PCA ROI 2D + time SIFT descriptors Joint feature Multi-class SVM noise noise Video  multi-class SVM PCA 2D + time SIFT descriptors
Used database (digits) ,[object Object],[object Object],Digit  0 Salt and pepper occlusion Digit  0 , SNR = -5dB Region-of-Interest
Isolated word recognition Recognition rate (%) over 10 digits using kCCA and multiple MSVM Average generalization performance (%) from [Movellan 1996] 1. Visual feature more robust to occlusion than salt and pepper 2. Residual information in degraded signals is extracted and used 3. Higher recognition rate under occlusion compared to [Movellan 1996]
Problems addressed ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Discussion ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Future work ,[object Object],[object Object],[object Object],[object Object]
Thank you Questions?

More Related Content

Viewers also liked

What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...butest
 
What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...butest
 
Emulating Human Essay Scoring With Machine Learning Methods
Emulating Human Essay Scoring With Machine Learning MethodsEmulating Human Essay Scoring With Machine Learning Methods
Emulating Human Essay Scoring With Machine Learning Methodsbutest
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
Multimodal deep learning
Multimodal deep learningMultimodal deep learning
Multimodal deep learninghoai_ln
 
Anvita Audio Classification Presentation
Anvita Audio Classification PresentationAnvita Audio Classification Presentation
Anvita Audio Classification Presentationguest6e7a1b1
 
Multimodal Learning Preferences - Jessie Herring
Multimodal Learning Preferences - Jessie HerringMultimodal Learning Preferences - Jessie Herring
Multimodal Learning Preferences - Jessie Herringjrherring2
 
Multimedia Data Mining using Deep Learning
Multimedia Data Mining using Deep LearningMultimedia Data Mining using Deep Learning
Multimedia Data Mining using Deep LearningBhagyashree Barde
 

Viewers also liked (10)

What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...
 
What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...
 
Emulating Human Essay Scoring With Machine Learning Methods
Emulating Human Essay Scoring With Machine Learning MethodsEmulating Human Essay Scoring With Machine Learning Methods
Emulating Human Essay Scoring With Machine Learning Methods
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Learning Style
Learning StyleLearning Style
Learning Style
 
Smart audio feature
Smart audio featureSmart audio feature
Smart audio feature
 
Multimodal deep learning
Multimodal deep learningMultimodal deep learning
Multimodal deep learning
 
Anvita Audio Classification Presentation
Anvita Audio Classification PresentationAnvita Audio Classification Presentation
Anvita Audio Classification Presentation
 
Multimodal Learning Preferences - Jessie Herring
Multimodal Learning Preferences - Jessie HerringMultimodal Learning Preferences - Jessie Herring
Multimodal Learning Preferences - Jessie Herring
 
Multimedia Data Mining using Deep Learning
Multimedia Data Mining using Deep LearningMultimedia Data Mining using Deep Learning
Multimedia Data Mining using Deep Learning
 

Similar to Mphil Transfer

Audio Visual Emotion Recognition Using Cross Correlation and Wavelet Packet D...
Audio Visual Emotion Recognition Using Cross Correlation and Wavelet Packet D...Audio Visual Emotion Recognition Using Cross Correlation and Wavelet Packet D...
Audio Visual Emotion Recognition Using Cross Correlation and Wavelet Packet D...Shamman Noor Shoudha
 
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...MediaEval2012
 
Adaptive wavelet thresholding with robust hybrid features for text-independe...
Adaptive wavelet thresholding with robust hybrid features  for text-independe...Adaptive wavelet thresholding with robust hybrid features  for text-independe...
Adaptive wavelet thresholding with robust hybrid features for text-independe...IJECEIAES
 
A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audi...
A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audi...A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audi...
A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audi...CSCJournals
 
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...sipij
 
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...ijtsrd
 
IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech EnhancementIRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech EnhancementIRJET Journal
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionNAVER Engineering
 
MLConf2013: Teaching Computer to Listen to Music
MLConf2013: Teaching Computer to Listen to MusicMLConf2013: Teaching Computer to Listen to Music
MLConf2013: Teaching Computer to Listen to MusicEric Battenberg
 
Ml conf2013 teaching_computers_share
Ml conf2013 teaching_computers_shareMl conf2013 teaching_computers_share
Ml conf2013 teaching_computers_shareMLconf
 
Image quality, digital technology and radiation protection
Image quality, digital technology and radiation protectionImage quality, digital technology and radiation protection
Image quality, digital technology and radiation protectionRad Tech
 
Environmental Sound detection Using MFCC technique
Environmental Sound detection Using MFCC techniqueEnvironmental Sound detection Using MFCC technique
Environmental Sound detection Using MFCC techniquePankaj Kumar
 
IRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET Journal
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...IJECEIAES
 

Similar to Mphil Transfer (20)

Audio Visual Emotion Recognition Using Cross Correlation and Wavelet Packet D...
Audio Visual Emotion Recognition Using Cross Correlation and Wavelet Packet D...Audio Visual Emotion Recognition Using Cross Correlation and Wavelet Packet D...
Audio Visual Emotion Recognition Using Cross Correlation and Wavelet Packet D...
 
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
 
Adaptive wavelet thresholding with robust hybrid features for text-independe...
Adaptive wavelet thresholding with robust hybrid features  for text-independe...Adaptive wavelet thresholding with robust hybrid features  for text-independe...
Adaptive wavelet thresholding with robust hybrid features for text-independe...
 
A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audi...
A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audi...A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audi...
A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audi...
 
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
 
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
 
IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech EnhancementIRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
 
SPEAKER VERIFICATION
SPEAKER VERIFICATIONSPEAKER VERIFICATION
SPEAKER VERIFICATION
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detection
 
Thesis
ThesisThesis
Thesis
 
MLConf2013: Teaching Computer to Listen to Music
MLConf2013: Teaching Computer to Listen to MusicMLConf2013: Teaching Computer to Listen to Music
MLConf2013: Teaching Computer to Listen to Music
 
Ml conf2013 teaching_computers_share
Ml conf2013 teaching_computers_shareMl conf2013 teaching_computers_share
Ml conf2013 teaching_computers_share
 
Image quality, digital technology and radiation protection
Image quality, digital technology and radiation protectionImage quality, digital technology and radiation protection
Image quality, digital technology and radiation protection
 
Environmental Sound detection Using MFCC technique
Environmental Sound detection Using MFCC techniqueEnvironmental Sound detection Using MFCC technique
Environmental Sound detection Using MFCC technique
 
IRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound Recognition
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
CCRMA - 2011
CCRMA - 2011CCRMA - 2011
CCRMA - 2011
 
1801 1805
1801 18051801 1805
1801 1805
 
1801 1805
1801 18051801 1805
1801 1805
 
A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...A novel automatic voice recognition system based on text-independent in a noi...
A novel automatic voice recognition system based on text-independent in a noi...
 

Mphil Transfer

  • 1. Audio-visual speech reading system Samuel Pachoud Mphil transfer presentation
  • 3.
  • 4.
  • 5.
  • 6.
  • 7. Overview AUDIO FEATURE SELECTION AND EXTRACTION VISUAL FEATURE SELECTION AND EXTRACTION VIDEO t t AUDIO VISUAL INTEGRATION AUDIO-ONLY ASR VISUAL-ONLY ASR (LIP READING) AUDIO VISUAL ASR AUDIO
  • 8.
  • 9.
  • 10.
  • 11. Multiple Multi-class SVM AUDIO VIDEO Recognition Audio multi-class SVM AUDIO VIDEO MFCC + SCF ROI Feature extraction Feature selection Adaptive fusion Decision making Confidence factors kCCA Audio and visual feature estimate MFCC + SCF PCA ROI 2D + time SIFT descriptors Joint feature Multi-class SVM noise noise Video multi-class SVM PCA 2D + time SIFT descriptors
  • 12.
  • 13. Isolated word recognition Recognition rate (%) over 10 digits using kCCA and multiple MSVM Average generalization performance (%) from [Movellan 1996] 1. Visual feature more robust to occlusion than salt and pepper 2. Residual information in degraded signals is extracted and used 3. Higher recognition rate under occlusion compared to [Movellan 1996]
  • 14.
  • 15.
  • 16.

Editor's Notes

  1. What you hear depends on what you see We know that human speech perception is bimodal. We use : Sight Hearing The best way to see the importance of this bimodality is to look at a short video sequence describing the McGurk effect (link above). The McGurk effect was first demonstrated by Harry McGurk and John MacDonald in 1976 This effect is a compelling demonstration of how we all use visual speech information and it establishes the human speech perception bimodality. Now I’m going to run 2 times the video sequence You will see and hear a person speaking six syllables. Watch the mouth closely, but concentrate on what you're hearing too. START THE MOVIE NOW. Now close your eyes and listen carefully START THE MOVIE NOW. Indeed the syllable [ba] had been dubbed on to lip movements for [ga], normal adults reported hearing [da]. That means that what you hear depends on whether your eyes are opened or closed. We have seen that the video signal is as important as the audio one for human speech perception
  2. Psychologist - It's not what you say but the way you say it: matching faces and voices