Audio-Visual Speech Processing
Gérard Chollet
with Meriem Bendris, Hervé Bredin, Thomas Hueber,
Walid Karam, Rémi Landais,...
Page 2 ATSIP, Sousse, May 18th, 2014
Some motivations,…
■  A talking face is more intelligible, expressive,
recognisable, ...
Page 3 ATSIP, Sousse, May 18th, 2014
Some topics under study,…
■  Audio-visual speech recognition
–  Automatic ‘lip-readin...
Page 4 ATSIP, Sousse, May 18th, 2014
Audio Visual Speech Recognition
Dictionary Grammar
Acoustic models
Features
extractio...
Page 5 ATSIP, Sousse, May 18th, 2014
Video Mike (IBM, 2004)
■  IBM
■  2004
Page 6 ATSIP, Sousse, May 18th, 2014
Audio processing
■  Features  extraction  	
■  Digits  detection	
■  Digits  recognit...
Page 7 ATSIP, Sousse, May 18th, 2014
Video processing
■  Video  extraction	
■  Lips  localisation	
■  Images  interpolatio...
Page 8 ATSIP, Sousse, May 18th, 2014
Fusion techniques
q  Parameters fusion :
• Concatenation
•  Dimension decrease : Lin...
Page 9 ATSIP, Sousse, May 18th, 2014
Experimental results :
parameters fusion
0
10
20
30
40
50
60
70
80
90
100
-15 -10 -5 ...
Page 10 ATSIP, Sousse, May 18th, 2014
Experimental results :
Scores fusion at -5db
42
43
44
45
46
47
48
49
50
51
52
Speech...
Page 11 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Fusion of face and speech for identity verifica...
Page 12 ATSIP, Sousse, May 18th, 2014
12
Talking-face and
2D face sequence database
■  Data: video sequences (.avi) in whi...
Page 13 ATSIP, Sousse, May 18th, 2014
Audio-Visual Speech Features
Raw
Pixel
Value
DCT
Transform
Shape
Related
Many
Others...
Page 14 ATSIP, Sousse, May 18th, 2014
Audio-Visual
Audio-Visual Subspaces
AudioVisual
Reduced
Audiovisual Subspace
Princip...
Page 15 ATSIP, Sousse, May 18th, 2014
Correspondence Measures
Audiovisual subspace Correlated subspaces
Gaussian
Mixture
M...
Page 16 ATSIP, Sousse, May 18th, 2014
Application to indexation
■  High-level requests
–  “Find videos where John Doe is s...
Page 17 ATSIP, Sousse, May 18th, 2014
Who is speaking?
■  Face tracking
■  Correlation
–  Pixel of each face
–  Raw audio ...
Page 18 ATSIP, Sousse, May 18th, 2014
How	
  to	
  Perform	
  “Talking-­‐Face”	
  
Authen:ca:on?	
  
Face
recognition
Spea...
Page 19 ATSIP, Sousse, May 18th, 2014
Biometrics
■  Identity Verification with Talking Faces
–  Speaker Verification
–  Fa...
Page 20 ATSIP, Sousse, May 18th, 2014
Identity Verification
Enrolment of client λ
Model for
client λ
Person ε pretending t...
Page 21 ATSIP, Sousse, May 18th, 2014
Test
Replay Attacks Detection
Training
Co-IA
CCA
accepted if
rejected otherwise
Sync...
Page 22 ATSIP, Sousse, May 18th, 2014
Replay Attacks Detection
Genuine synchronized video Audio replay attack
Lips do not ...
Page 23 ATSIP, Sousse, May 18th, 2014
Example of Replay attacks
Page 24 ATSIP, Sousse, May 18th, 2014
delayed video delayed audio
-5 0 +5
Alignment by maximum correlation
-1
Page 25 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Available features
–  Face : Face features (lip...
Page 26 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Face modality
–  Detection:
•  Generative model...
Page 27 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Face Modality:
–  Two verification strategies a...
Page 28 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Face Modality:
•  SIFT descriptors:
–  Keypoint...
Page 29 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Face Modality:
•  SVD-based matching method:
– ...
Page 30 ATSIP, Sousse, May 18th, 2014
Variability !!!!
Page 31 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Speech Modality:
–  GMM-based approach;
•  One ...
Page 32 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Synchrony Modality:
–  Principle: synchrony bet...
Page 33 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Experiments:
–  BANCA database:
•  52 persons d...
Page 34 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Experiments:
Page 35 ATSIP, Sousse, May 18th, 2014
SecurePhone
■  Technical solution that improves security
■  Biometric recognition
– ...
Page 36 ATSIP, Sousse, May 18th, 2014
Biometrics in SecurePhone
■  Operation
Pre-processing
Modelling
Modelling
Modelling
...
Page 37 ATSIP, Sousse, May 18th, 2014
The BioSecure Multimodal Evaluation Campaign
■  Launched in April 2007
■  Many modal...
Page 38 ATSIP, Sousse, May 18th, 2014
Audio-­‐visual	
  forgery	
  scenarios	
  
■  Low-­‐effort	
  
–  “Paparazzi”	
  scen...
Page 39 ATSIP, Sousse, May 18th, 2014
Detec:on	
  of	
  imposture	
  
Face modality:
ACCEPTED!
Voice modality:
ACCEPTED!
S...
Page 40 ATSIP, Sousse, May 18th, 2014
40
Audio replay + “random” face
Talking-Face forgeries @ BMEC
Audio replay attack
" ...
Page 41 ATSIP, Sousse, May 18th, 2014
41
CRAZY TALK Face animation + TTS
Talking-Face forgeries @ BMEC
Replay attack
"   A...
Page 42 ATSIP, Sousse, May 18th, 2014
42
Picture presentation + TTS forgeries
Talking-Face forgeries @ BMEC
Replay attack
...
Page 43 ATSIP, Sousse, May 18th, 2014
43
Systems with fusion of
(face, speech)
face
score
speech
score
fusion
score
video ...
Page 44 ATSIP, Sousse, May 18th, 2014
44
Voice Conversion methods
■ GMM	
  conversion	
  
–  Training	
  of	
  a	
  joined...
Page 45 ATSIP, Sousse, May 18th, 2014
Voice conversion techniques
Definition: Process of making one person’s voice « sourc...
Page 46 ATSIP, Sousse, May 18th, 2014
Principle of ALISP
Dictionary of
representative
segments
Dictionary of
representativ...
Page 47 ATSIP, Sousse, May 18th, 2014
Details of Encoding
speech Spectral
analysis
Prosodic
analysis
HMM
Recognition
Dicti...
Page 48 ATSIP, Sousse, May 18th, 2014
Details of decoding
Output speech
Synth unit A1
…
Synth unit A8
ALISP Index
Synth un...
Page 49 ATSIP, Sousse, May 18th, 2014
Principle of Alisp conversion
Learning step: one hour of target voice
- Parametric a...
Page 50 ATSIP, Sousse, May 18th, 2014
Voice conversion using ALISP
results
BREF databaseNIST database
Source
Result
Target...
Page 51 ATSIP, Sousse, May 18th, 2014
Demonstra:on	
  of	
  Voice	
  Conversion	
  
Impostor voice Converted voice with GM...
Page 52 ATSIP, Sousse, May 18th, 2014
3D reconstruction
•  3D face modeling from a front and a profile shot :
•  Animated ...
Page 53 ATSIP, Sousse, May 18th, 2014
Face Tranformation
Control point
selection
Image
segmentation
Figure	
  2:	
  Divisi...
Page 54 ATSIP, Sousse, May 18th, 2014
Face Transformation
Source	
  
?	
  
54	
  
-­‐>	
  LocalisaAon	
  of	
  control	
  ...
Page 55 ATSIP, Sousse, May 18th, 2014
Face	
  transforma:on	
  (IBM)	
  
Page 56 ATSIP, Sousse, May 18th, 2014
Ouisper1 - Silent Speech Interface
■  Sensor-based system allowing speech communicat...
Page 57 ATSIP, Sousse, May 18th, 2014
Ouisper - System Overview
Ultrasound video
of the vocal tract
Optical video
of the s...
Page 58 ATSIP, Sousse, May 18th, 2014
Ouisper - Training Data
Page 59 ATSIP, Sousse, May 18th, 2014
Ouisper - Video Stream Coding
T.Hueber , G. Aversano, G.Chollet, B. Denby, G. Dreyfu...
Page 60 ATSIP, Sousse, May 18th, 2014
Ouisper - Audio Stream Coding
ALISP Segmentation
Detection of quasi-stationary parts...
Page 61 ATSIP, Sousse, May 18th, 2014
Audiovisual dictionary building
■  Visual and acoustic data
are synchronously
record...
Page 62 ATSIP, Sousse, May 18th, 2014
Visuo-acoustic decoding
■  Visual speech recognition
–  Train HMM model for each vis...
Page 63 ATSIP, Sousse, May 18th, 2014
Speech recognition from
video-only data
ow p ax n y uh r b uh k t uw dh ax f er s t ...
Page 64 ATSIP, Sousse, May 18th, 2014
Ouisper - Conclusion
■  More information on
–  http://www.neurones.espci.fr/ouisper/...
Page 65 ATSIP, Sousse, May 18th, 2014
Audio-Visual Speech Processing
Conclusions and Perspectives
■  A talking face is mor...
Page 66 ATSIP, Sousse, May 18th, 2014
Upcoming SlideShare
Loading in...5
×

Atsip avsp17

1,531

Published on

Audio-Visual Speech Processing presentation at the IEEE conference on Advanced Technologies for Signal and Image Processing in Sousse, Tunisia, March 18th, 2014

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,531
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Atsip avsp17

  1. 1. Audio-Visual Speech Processing Gérard Chollet with Meriem Bendris, Hervé Bredin, Thomas Hueber, Walid Karam, Rémi Landais, Patrick Perrot, Eduardo Sanchez-Soto, Leila Zouari ATSIP, Sousse, March 18th 2014
  2. 2. Page 2 ATSIP, Sousse, May 18th, 2014 Some motivations,… ■  A talking face is more intelligible, expressive, recognisable, attractive than acoustic speech alone. ■  The combined use of facial and speech information improves identity verification and robustness to forgeries. ■  Multi-stream models of the synchrony of visual and acoustic information have applications in the analysis, coding, recognition and synthesis of talking faces. ■  SmartPhones, VisioPhones, WebPhones, SecurePhones, Visio Conferences, Virtual Reality worlds are gaining popularity.
  3. 3. Page 3 ATSIP, Sousse, May 18th, 2014 Some topics under study,… ■  Audio-visual speech recognition –  Automatic ‘lip-reading’ ■  Audio-visual speaker verification –  Detection of forgeries ■  Speech driven animation of the face –  Could we look and sound like somebody else ? ■  Speaker indexing –  ‘Who is talking in a video sequence ?’ ■  OUISPER : a silent speech interface –  Corpus based synthesis from tongue and lips
  4. 4. Page 4 ATSIP, Sousse, May 18th, 2014 Audio Visual Speech Recognition Dictionary Grammar Acoustic models Features extraction Decoder
  5. 5. Page 5 ATSIP, Sousse, May 18th, 2014 Video Mike (IBM, 2004) ■  IBM ■  2004
  6. 6. Page 6 ATSIP, Sousse, May 18th, 2014 Audio processing ■  Features  extraction   ■  Digits  detection ■  Digits  recognition:     •  Acoustic  parameters  :  MFCC •  Context  independent    HMMs •   Decoding  :  Time  synchronous   algorithm ■  Sound  effect –  Noise  :  Babble ■  Recognition  experiments
  7. 7. Page 7 ATSIP, Sousse, May 18th, 2014 Video processing ■  Video  extraction ■  Lips  localisation ■  Images  interpolation   (same  frequency  as  speech) ■  Features  extraction •  DCT  and  DCT2  (DCT+LDA) •  Projections    :  PRO  et  PRO2   (PRO+LDA) ■  Recognition  experiments
  8. 8. Page 8 ATSIP, Sousse, May 18th, 2014 Fusion techniques q  Parameters fusion : • Concatenation •  Dimension decrease : Linear Discriminant Analysis (LDA) •  Modelisation : classical HMM with one stream q  Scores fusion : Multi-stream HMM
  9. 9. Page 9 ATSIP, Sousse, May 18th, 2014 Experimental results : parameters fusion 0 10 20 30 40 50 60 70 80 90 100 -15 -10 -5 0 5 10S/N %Accuracy Speech only Video only : Pro2 Video only : DCT2 AV Fusion : Pro2 AV Fusion : DCT2
  10. 10. Page 10 ATSIP, Sousse, May 18th, 2014 Experimental results : Scores fusion at -5db 42 43 44 45 46 47 48 49 50 51 52 Speech only AV : PRO AV :PRO2 AV : DCT AV : DCT2
  11. 11. Page 11 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Fusion of face and speech for identity verification ■  Detection of possible forgeries ■  Compulsory ? for: –  Homeland/firms security: restricted access,… –  Secured computer login –  Secured on-line signature of contracts
  12. 12. Page 12 ATSIP, Sousse, May 18th, 2014 12 Talking-face and 2D face sequence database ■  Data: video sequences (.avi) in which a short phrase in English is pronounced / duration ≈ 10s (actual speech duration ≈ 2s) ■  Audio-video data used for talking faces evaluations ■  Same sequences used for 2D face from video sequences evaluations ■  430 subjects pronounced 4 phrases : –  from a set 430 English phrases –  2 indoor video files acquired during the first session –  2 outdoor video files acquired during the second session –  realistic forgeries created a posteriori
  13. 13. Page 13 ATSIP, Sousse, May 18th, 2014 Audio-Visual Speech Features Raw Pixel Value DCT Transform Shape Related Many Others … Raw amplitude « Classical » MFCC coefficients Many others
  14. 14. Page 14 ATSIP, Sousse, May 18th, 2014 Audio-Visual Audio-Visual Subspaces AudioVisual Reduced Audiovisual Subspace Principal Component & Linear Discriminant Analysis x Correlated Audio & Visual Subspaces Co-inertia & Canonical Correlation Analysis
  15. 15. Page 15 ATSIP, Sousse, May 18th, 2014 Correspondence Measures Audiovisual subspace Correlated subspaces Gaussian Mixture Models Neural Networks Coupled HMM Correlation Mutual Information
  16. 16. Page 16 ATSIP, Sousse, May 18th, 2014 Application to indexation ■  High-level requests –  “Find videos where John Doe is speaking” –  “Find dialogues between Mr X and Mrs Y” –  “Locate the singer in this music video” Raw Energy Raw Pixel Value Correlation
  17. 17. Page 17 ATSIP, Sousse, May 18th, 2014 Who is speaking? ■  Face tracking ■  Correlation –  Pixel of each face –  Raw audio energy ■  Find maximum synchrony Green: current speaker
  18. 18. Page 18 ATSIP, Sousse, May 18th, 2014 How  to  Perform  “Talking-­‐Face”   Authen:ca:on?   Face recognition Speaker verification Score fusion What if…? OK OK OK Deliberate imposture
  19. 19. Page 19 ATSIP, Sousse, May 18th, 2014 Biometrics ■  Identity Verification with Talking Faces –  Speaker Verification –  Face Recognition ■  What if? Face OK Voice OK NO X
  20. 20. Page 20 ATSIP, Sousse, May 18th, 2014 Identity Verification Enrolment of client λ Model for client λ Person ε pretending to be client λ accepted if rejected otherwise Co-Inertia Analysis Equal Error Rate: 30 %
  21. 21. Page 21 ATSIP, Sousse, May 18th, 2014 Test Replay Attacks Detection Training Co-IA CCA accepted if rejected otherwise Sync Model
  22. 22. Page 22 ATSIP, Sousse, May 18th, 2014 Replay Attacks Detection Genuine synchronized video Audio replay attack Lips do not match audio perfectly Equal Error Rate: 14 %
  23. 23. Page 23 ATSIP, Sousse, May 18th, 2014 Example of Replay attacks
  24. 24. Page 24 ATSIP, Sousse, May 18th, 2014 delayed video delayed audio -5 0 +5 Alignment by maximum correlation -1
  25. 25. Page 25 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Available features –  Face : Face features (lip, eyes) à Face Modality –  Speech à Speech Modality –  Speech Synchrony à Synchrony Modality Video
  26. 26. Page 26 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Face modality –  Detection: •  Generative models (MPT toolbox) •  Temporal median Filtering •  Eyes detection within faces –  Normalization: geometry + illumination
  27. 27. Page 27 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Face Modality: –  Two verification strategies and one single comparison framework •  Global = Eigenfaces: –  Calculation of a set of directions (eigenfaces) defining a projection space –  Two faces are compared regarding their projection on the eigenfaces space. –  Learning data: BIOMET (130 pers.) + BANCA (30 pers.)
  28. 28. Page 28 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Face Modality: •  SIFT descriptors: –  Keypoints extraction –  Keypoints representation: 128-dimensional vector (gradient orientation histogramme,…) + 4-dimensional position vector SIFT descriptor (dim 128) Position (x,y) + scale + orientation (dim 4)
  29. 29. Page 29 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Face Modality: •  SVD-based matching method: –  Compare two videos V1 and V2 –  Exclusive principle: One-to-one correspondences between »  Faces (global) »  Descriptors (local) –  Principle: »  Proximity matrix computation between faces or descriptors »  Extraction of good pairings (made easy by SVD computation) –  Scores: »  One matching score between global representations »  One matching score between local representations
  30. 30. Page 30 ATSIP, Sousse, May 18th, 2014 Variability !!!!
  31. 31. Page 31 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Speech Modality: –  GMM-based approach; •  One world model •  Each speaker model is derived from the World Model by MAP adaptation •  Speech verification score: derived from likelihood ratio
  32. 32. Page 32 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Synchrony Modality: –  Principle: synchrony between lips and speech carries identity information –  Process: •  Computation of a synchrony model (CoIA analysis) for each person based on DCT (visual signal) and MFCC (speech signal) •  Comparison of the test sample with the synchrony model
  33. 33. Page 33 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Experiments: –  BANCA database: •  52 persons divided into two groups (G1 and G2) •  3 recording conditions •  1 person à 8 recordings (4 client accesses, 4 impostor accesses) •  Evaluation based on P protocol: 234 client accesses and 312 impostor accesses –  Scores: •  4 scores per access (PCA face, SIFT face, speech, synchrony) •  Score fusion based on RBF-SVM: hyperplan learned on G1/tested on G2 and conversely)
  34. 34. Page 34 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Experiments:
  35. 35. Page 35 ATSIP, Sousse, May 18th, 2014 SecurePhone ■  Technical solution that improves security ■  Biometric recognition –  Makes use of VOICE, FACE and SIGNATURE ■  Electronic signature used to secure information exchange
  36. 36. Page 36 ATSIP, Sousse, May 18th, 2014 Biometrics in SecurePhone ■  Operation Pre-processing Modelling Modelling Modelling Pre-processingPre-processing Access DeniedAccess Granted FUSION Face Voice Written Signature Modelling
  37. 37. Page 37 ATSIP, Sousse, May 18th, 2014 The BioSecure Multimodal Evaluation Campaign ■  Launched in April 2007 ■  Many modalities including ‘Video sequences’ and ‘Talking Faces’ ■  Development data and reference systems available ■  Evaluations on the sequestrated BioSecure data base (1000 clients) ■  Debriefing workshop ■  More info on : http://www.int-evry.fr/biometrics/BMEC2007/index.php
  38. 38. Page 38 ATSIP, Sousse, May 18th, 2014 Audio-­‐visual  forgery  scenarios   ■  Low-­‐effort   –  “Paparazzi”  scenario   •  The  impostor  owns  a  picture  of  the  face  and  a  recording  of  the  voice  of  the   target   –  “Big  Brother”  scenario   •  The  impostor  owns  a  video  of  the  face  and  a  recording  of  the  voice  of  the   target   ■  High-­‐effort   –  “Imitator”  scenario   •  The  impostor  owns  a  recording  of  the  voice  of  the  target  and  transforms  his   own  voice  to  sound  like  the  target   –  “Playback”  scenario   •  The  impostor  owns  a  picture  of  the  face  of  the  target  and  animate  it   according  to  his  own  face  moAon   –  “Ventriloquist”  scenario   •  combines  the  two  previous  ones  
  39. 39. Page 39 ATSIP, Sousse, May 18th, 2014 Detec:on  of  imposture   Face modality: ACCEPTED! Voice modality: ACCEPTED! Synchronisation: DENIED!
  40. 40. Page 40 ATSIP, Sousse, May 18th, 2014 40 Audio replay + “random” face Talking-Face forgeries @ BMEC Audio replay attack "   Assumptions §  Forger has recorded speech data from the genuine user in outdoor (test) conditions §  Forger is replaying the audio and uses his face in front of the sensor Stolen wave Audio replay + forger face
  41. 41. Page 41 ATSIP, Sousse, May 18th, 2014 41 CRAZY TALK Face animation + TTS Talking-Face forgeries @ BMEC Replay attack "   Assumptions §  Forger has stolen a picture §  Forger uses a face animation software and TTS (male or female) §  Forger plays back the animation to the sensor Stolen picture Contour detection Generated avi
  42. 42. Page 42 ATSIP, Sousse, May 18th, 2014 42 Picture presentation + TTS forgeries Talking-Face forgeries @ BMEC Replay attack "   Assumptions §  Forger has stolen a picture §  Forger has printed the picture §  Forger present the picture to the sensor and uses TTS (same wave as for the face animation forgery) Stolen picture Presented picture
  43. 43. Page 43 ATSIP, Sousse, May 18th, 2014 43 Systems with fusion of (face, speech) face score speech score fusion score video sequence frames speech signal Face verification Speaker verification
  44. 44. Page 44 ATSIP, Sousse, May 18th, 2014 44 Voice Conversion methods ■ GMM  conversion   –  Training  of  a  joined  Gaussian  model   •   parallel  corpus  of  aligned  sentences  of  both  source  and  target   voice   •   MFCC  on  HNM  (Harmonic  plus  Noise  Model)  parameterizaAon     –  Speech  synthesis  from  Gaussian  model   •   Inversion  of  the  MFCC   •   Pitch  correcAon   ■ ALISP  conversion   –  Very  low  debit  speech  compression  (500  bps)  method   •   Originally  developed  by  TELECOM-­‐ParisTech   –  Indexed  segments  dicAonary  system  (of  the  target  voice)   –  HNM  parameterizaAon  
  45. 45. Page 45 ATSIP, Sousse, May 18th, 2014 Voice conversion techniques Definition: Process of making one person’s voice « source » sounds like another person’s voice target source target Voice conversion My name is John My name is John
  46. 46. Page 46 ATSIP, Sousse, May 18th, 2014 Principle of ALISP Dictionary of representative segments Dictionary of representative segments Spectral analysis Prosodic analysis Selection of segmental units Segment index Prosodic parameters Input speech Concatenative synthesis HNM Output speech CODER
  47. 47. Page 47 ATSIP, Sousse, May 18th, 2014 Details of Encoding speech Spectral analysis Prosodic analysis HMM Recognition Dictionary of HMM models of ALISP classes Synth unit A1 … Synth unit A8 HMM A Representative units of the class Selection by DTW Prosodic encoding Index of ALISP class Index of synth. unit Pitch, energy, duration
  48. 48. Page 48 ATSIP, Sousse, May 18th, 2014 Details of decoding Output speech Synth unit A1 … Synth unit A8 ALISP Index Synth unit index within class Prosodic parameters Loading Synth unit Concatenative synthesis
  49. 49. Page 49 ATSIP, Sousse, May 18th, 2014 Principle of Alisp conversion Learning step: one hour of target voice - Parametric analysis: MFCC - Segmentation based on temporal decompostion and vector quantization - Stochastic modelling based on HMM - Creation of representative units Conversion step - Parametric analysis: MFCC - HMM recognition - Selection of representative segment à DTW Synthesis step - Concatenation of representative - HNM synthesis
  50. 50. Page 50 ATSIP, Sousse, May 18th, 2014 Voice conversion using ALISP results BREF databaseNIST database Source Result TargetSource Target Result female female female male
  51. 51. Page 51 ATSIP, Sousse, May 18th, 2014 Demonstra:on  of  Voice  Conversion   Impostor voice Converted voice with GMM Converted voice with ALISP Target voiceConverted voice with ALISP+GMM
  52. 52. Page 52 ATSIP, Sousse, May 18th, 2014 3D reconstruction •  3D face modeling from a front and a profile shot : •  Animated face •  https://picoforge.int-evry.fr/cgi-bin/twiki/view/ Myblog3d/Web/Demos
  53. 53. Page 53 ATSIP, Sousse, May 18th, 2014 Face Tranformation Control point selection Image segmentation Figure  2:  Division  of  an  image    Figure  1:  Control  points  selec8on   Linear transformation between source and target image Blending step source target
  54. 54. Page 54 ATSIP, Sousse, May 18th, 2014 Face Transformation Source   ?   54   -­‐>  LocalisaAon  of  control  points   -­‐>  Warping   -­‐>  Blending   Cible   ?   X’  =  f(X)   p  =  αp  +  (1  –  α)p’   X X’   p   p’  
  55. 55. Page 55 ATSIP, Sousse, May 18th, 2014 Face  transforma:on  (IBM)  
  56. 56. Page 56 ATSIP, Sousse, May 18th, 2014 Ouisper1 - Silent Speech Interface ■  Sensor-based system allowing speech communication via standard articulators, but without glottal activity ■  Two distinct types of application –  alternative to tracheo-oesophagal speech (TES) for persons having undergone a tracheotomy –  a "silent telephone" for use in situations where quiet must be maintained, or for communication in very noisy environments ■  Speech Synthesis from ultrasound and optical imagery of the tongue and lips 1) Oral Ultrasound synthetIc SPEech souRce
  57. 57. Page 57 ATSIP, Sousse, May 18th, 2014 Ouisper - System Overview Ultrasound video of the vocal tract Optical video of the speaker lips Recorded audio Speech Alignment Text Visual Feature Extraction Audio-Visual Speech Corpus Visual Speech Recognizer Visual Unit Selection Audio Unit Concatenation T R A I N I N G T E S T Visual Data N-best Phonetic or ALISP Targets
  58. 58. Page 58 ATSIP, Sousse, May 18th, 2014 Ouisper - Training Data
  59. 59. Page 59 ATSIP, Sousse, May 18th, 2014 Ouisper - Video Stream Coding T.Hueber , G. Aversano, G.Chollet, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, M. Stone, “EigenTongue Feature Extraction For An Ultrasound-based Silent Speech Interface,” IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu Hawaii, USA, 2007. Eigenvectors Build a subset of typical frames Perform PCA Code new frames with their projections onto the set of Eigenvectors
  60. 60. Page 60 ATSIP, Sousse, May 18th, 2014 Ouisper - Audio Stream Coding ALISP Segmentation Detection of quasi-stationary parts in the parametric representation of speech Assignment of segments to class using unsupervised classification techniques Phonetic Segmentation Forced-alignement of speech with the text Need of a relevant and correct phonetic transcription of the uttered signal Corpus-based synthesis Need of a preliminary segmental description of the signal
  61. 61. Page 61 ATSIP, Sousse, May 18th, 2014 Audiovisual dictionary building ■  Visual and acoustic data are synchronously recorded ■  Audio segmentation is used to bootstrap visual speech recognizer /e  -­‐  r/ 2)    Train  HMM  model  for  each  phonetic  class /a  -­‐  j//u  -­‐  th/ Audiovisual dictionary
  62. 62. Page 62 ATSIP, Sousse, May 18th, 2014 Visuo-acoustic decoding ■  Visual speech recognition –  Train HMM model for each visual class •  Use multistream-based learning techniques –  Perform a « visuo-phonetic » decoding step •  Use N-Best list •  Introduce linguistic constraints –  Language model –  Dictionary –  Multigrams ■  Corpus-based speech synthesis –  Combine probabilistic and data-driven approach in the audiovisual unit selection step.
  63. 63. Page 63 ATSIP, Sousse, May 18th, 2014 Speech recognition from video-only data ow p ax n y uh r b uh k t uw dh ax f er s t p ey jh ax w ih y uh r b uh k sh uw dh ax v er s p ey jh Open your book to the first page Ref Rec A wear your book shoe the verse page Corpus-based synthesis driven by predicted phonetic lattice is currently under study
  64. 64. Page 64 ATSIP, Sousse, May 18th, 2014 Ouisper - Conclusion ■  More information on –  http://www.neurones.espci.fr/ouisper/ ■  Contacts –  gerard.chollet@enst.fr –  denby@ieee.org –  hueber@ieee.org
  65. 65. Page 65 ATSIP, Sousse, May 18th, 2014 Audio-Visual Speech Processing Conclusions and Perspectives ■  A talking face is more intelligible, expressive, recognisable, attractive than acoustic speech alone. ■  The combined use of facial and speech information improves identity verification and robustness to forgeries. ■  Multi-stream models of the synchrony of visual and acoustic information have applications in the analysis, coding, recognition and synthesis of talking faces.
  66. 66. Page 66 ATSIP, Sousse, May 18th, 2014
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×