Mphil Transfer

•Download as PPT, PDF•

1 like•349 views

spachoud

Audio-visual speech reading system Samuel Pachoud Mphil transfer presentation

Why bimodality? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Difficulties ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Current limitations ,[object Object],[object Object],[object Object],[object Object],[object Object]

Our contributions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Overview AUDIO FEATURE SELECTION AND EXTRACTION VISUAL FEATURE SELECTION AND EXTRACTION VIDEO t t AUDIO VISUAL INTEGRATION AUDIO-ONLY ASR VISUAL-ONLY ASR (LIP READING) AUDIO VISUAL ASR AUDIO

Spatio-temporal features ,[object Object],Image-to-image approach: 2 separate frames of 3 moving features. Space-time volume modelling: a sequence with several spatio-temporal features

Confidence factors ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Adaptive fusion ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Multiple Multi-class SVM AUDIO VIDEO Recognition Audio multi-class SVM AUDIO VIDEO MFCC + SCF ROI Feature extraction Feature selection Adaptive fusion Decision making Confidence factors kCCA Audio and visual feature estimate MFCC + SCF PCA ROI 2D + time SIFT descriptors Joint feature Multi-class SVM noise noise Video multi-class SVM PCA 2D + time SIFT descriptors

Used database (digits) ,[object Object],[object Object],Digit 0 Salt and pepper occlusion Digit 0 , SNR = -5dB Region-of-Interest

Isolated word recognition Recognition rate (%) over 10 digits using kCCA and multiple MSVM Average generalization performance (%) from [Movellan 1996] 1. Visual feature more robust to occlusion than salt and pepper 2. Residual information in degraded signals is extracted and used 3. Higher recognition rate under occlusion compared to [Movellan 1996]

Problems addressed ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Discussion ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Future work ,[object Object],[object Object],[object Object],[object Object]

Viewers also liked

What s an Event ? How Ontologies and Linguistic Semantics ...butest

Emulating Human Essay Scoring With Machine Learning Methodsbutest

Machine Learningbutest

Learning StyleClaudia Cárdenas

Smart audio featureYoungTae (Henry) Huh

Multimodal deep learninghoai_ln

Anvita Audio Classification Presentationguest6e7a1b1

Multimodal Learning Preferences - Jessie Herringjrherring2

Multimedia Data Mining using Deep LearningBhagyashree Barde

Viewers also liked (10)

What s an Event ? How Ontologies and Linguistic Semantics ...

Emulating Human Essay Scoring With Machine Learning Methods

Machine Learning

Learning Style

Smart audio feature

Multimodal deep learning

Anvita Audio Classification Presentation

Multimodal Learning Preferences - Jessie Herring

Multimedia Data Mining using Deep Learning

Similar to Mphil Transfer

Audio Visual Emotion Recognition Using Cross Correlation and Wavelet Packet D...Shamman Noor Shoudha

Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...MediaEval2012

Adaptive wavelet thresholding with robust hybrid features for text-independe...IJECEIAES

A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audi...CSCJournals

A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...sipij

Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...ijtsrd

IRJET- Survey on Efficient Signal Processing Techniques for Speech EnhancementIRJET Journal

SPEAKER VERIFICATIONniranjan kumar

Introduction to deep learning based voice activity detectionNAVER Engineering

Thesisjoseangl

MLConf2013: Teaching Computer to Listen to MusicEric Battenberg

Ml conf2013 teaching_computers_shareMLconf

Image quality, digital technology and radiation protectionRad Tech

Environmental Sound detection Using MFCC techniquePankaj Kumar

IRJET- A Survey on Sound RecognitionIRJET Journal

International Journal of Engineering and Science Invention (IJESI)inventionjournals

CCRMA - 2011Alvaro Barbosa

1801 1805Editor IJARCET

A novel automatic voice recognition system based on text-independent in a noi...IJECEIAES

Similar to Mphil Transfer (20)

Audio Visual Emotion Recognition Using Cross Correlation and Wavelet Packet D...

Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...

Adaptive wavelet thresholding with robust hybrid features for text-independe...

A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audi...

A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...

Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...

IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement

SPEAKER VERIFICATION

Introduction to deep learning based voice activity detection

Thesis

MLConf2013: Teaching Computer to Listen to Music

Ml conf2013 teaching_computers_share

Image quality, digital technology and radiation protection

Environmental Sound detection Using MFCC technique

IRJET- A Survey on Sound Recognition

International Journal of Engineering and Science Invention (IJESI)

CCRMA - 2011

1801 1805

A novel automatic voice recognition system based on text-independent in a noi...

Mphil Transfer

1. Audio-visual speech reading system Samuel Pachoud Mphil transfer presentation

2. McGurk effect

7. Overview AUDIO FEATURE SELECTION AND EXTRACTION VISUAL FEATURE SELECTION AND EXTRACTION VIDEO t t AUDIO VISUAL INTEGRATION AUDIO-ONLY ASR VISUAL-ONLY ASR (LIP READING) AUDIO VISUAL ASR AUDIO

10.

11. Multiple Multi-class SVM AUDIO VIDEO Recognition Audio multi-class SVM AUDIO VIDEO MFCC + SCF ROI Feature extraction Feature selection Adaptive fusion Decision making Confidence factors kCCA Audio and visual feature estimate MFCC + SCF PCA ROI 2D + time SIFT descriptors Joint feature Multi-class SVM noise noise Video multi-class SVM PCA 2D + time SIFT descriptors

12.

13. Isolated word recognition Recognition rate (%) over 10 digits using kCCA and multiple MSVM Average generalization performance (%) from [Movellan 1996] 1. Visual feature more robust to occlusion than salt and pepper 2. Residual information in degraded signals is extracted and used 3. Higher recognition rate under occlusion compared to [Movellan 1996]

14.

15.

16.

17. Thank you Questions?

Editor's Notes

What you hear depends on what you see We know that human speech perception is bimodal. We use : Sight Hearing The best way to see the importance of this bimodality is to look at a short video sequence describing the McGurk effect (link above). The McGurk effect was first demonstrated by Harry McGurk and John MacDonald in 1976 This effect is a compelling demonstration of how we all use visual speech information and it establishes the human speech perception bimodality. Now I’m going to run 2 times the video sequence You will see and hear a person speaking six syllables. Watch the mouth closely, but concentrate on what you're hearing too. START THE MOVIE NOW. Now close your eyes and listen carefully START THE MOVIE NOW. Indeed the syllable [ba] had been dubbed on to lip movements for [ga], normal adults reported hearing [da]. That means that what you hear depends on whether your eyes are opened or closed. We have seen that the video signal is as important as the audio one for human speech perception
Psychologist - It's not what you say but the way you say it: matching faces and voices

Mphil Transfer

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to Mphil Transfer

Similar to Mphil Transfer (20)

Mphil Transfer

Editor's Notes