Atsip avsp17
Upcoming SlideShare
Loading in...5
×
 

Atsip avsp17

on

  • 1,116 views

Audio-Visual Speech Processing presentation at the IEEE conference on Advanced Technologies for Signal and Image Processing in Sousse, Tunisia, March 18th, 2014

Audio-Visual Speech Processing presentation at the IEEE conference on Advanced Technologies for Signal and Image Processing in Sousse, Tunisia, March 18th, 2014

Statistics

Views

Total Views
1,116
Views on SlideShare
1,116
Embed Views
0

Actions

Likes
1
Downloads
10
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Atsip avsp17 Atsip avsp17 Presentation Transcript

  • Audio-Visual Speech Processing Gérard Chollet with Meriem Bendris, Hervé Bredin, Thomas Hueber, Walid Karam, Rémi Landais, Patrick Perrot, Eduardo Sanchez-Soto, Leila Zouari ATSIP, Sousse, March 18th 2014
  • Page 2 ATSIP, Sousse, May 18th, 2014 Some motivations,… ■  A talking face is more intelligible, expressive, recognisable, attractive than acoustic speech alone. ■  The combined use of facial and speech information improves identity verification and robustness to forgeries. ■  Multi-stream models of the synchrony of visual and acoustic information have applications in the analysis, coding, recognition and synthesis of talking faces. ■  SmartPhones, VisioPhones, WebPhones, SecurePhones, Visio Conferences, Virtual Reality worlds are gaining popularity.
  • Page 3 ATSIP, Sousse, May 18th, 2014 Some topics under study,… ■  Audio-visual speech recognition –  Automatic ‘lip-reading’ ■  Audio-visual speaker verification –  Detection of forgeries ■  Speech driven animation of the face –  Could we look and sound like somebody else ? ■  Speaker indexing –  ‘Who is talking in a video sequence ?’ ■  OUISPER : a silent speech interface –  Corpus based synthesis from tongue and lips
  • Page 4 ATSIP, Sousse, May 18th, 2014 Audio Visual Speech Recognition Dictionary Grammar Acoustic models Features extraction Decoder
  • Page 5 ATSIP, Sousse, May 18th, 2014 Video Mike (IBM, 2004) ■  IBM ■  2004
  • Page 6 ATSIP, Sousse, May 18th, 2014 Audio processing ■  Features  extraction   ■  Digits  detection ■  Digits  recognition:     •  Acoustic  parameters  :  MFCC •  Context  independent    HMMs •   Decoding  :  Time  synchronous   algorithm ■  Sound  effect –  Noise  :  Babble ■  Recognition  experiments
  • Page 7 ATSIP, Sousse, May 18th, 2014 Video processing ■  Video  extraction ■  Lips  localisation ■  Images  interpolation   (same  frequency  as  speech) ■  Features  extraction •  DCT  and  DCT2  (DCT+LDA) •  Projections    :  PRO  et  PRO2   (PRO+LDA) ■  Recognition  experiments
  • Page 8 ATSIP, Sousse, May 18th, 2014 Fusion techniques q  Parameters fusion : • Concatenation •  Dimension decrease : Linear Discriminant Analysis (LDA) •  Modelisation : classical HMM with one stream q  Scores fusion : Multi-stream HMM
  • Page 9 ATSIP, Sousse, May 18th, 2014 Experimental results : parameters fusion 0 10 20 30 40 50 60 70 80 90 100 -15 -10 -5 0 5 10S/N %Accuracy Speech only Video only : Pro2 Video only : DCT2 AV Fusion : Pro2 AV Fusion : DCT2
  • Page 10 ATSIP, Sousse, May 18th, 2014 Experimental results : Scores fusion at -5db 42 43 44 45 46 47 48 49 50 51 52 Speech only AV : PRO AV :PRO2 AV : DCT AV : DCT2
  • Page 11 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Fusion of face and speech for identity verification ■  Detection of possible forgeries ■  Compulsory ? for: –  Homeland/firms security: restricted access,… –  Secured computer login –  Secured on-line signature of contracts
  • Page 12 ATSIP, Sousse, May 18th, 2014 12 Talking-face and 2D face sequence database ■  Data: video sequences (.avi) in which a short phrase in English is pronounced / duration ≈ 10s (actual speech duration ≈ 2s) ■  Audio-video data used for talking faces evaluations ■  Same sequences used for 2D face from video sequences evaluations ■  430 subjects pronounced 4 phrases : –  from a set 430 English phrases –  2 indoor video files acquired during the first session –  2 outdoor video files acquired during the second session –  realistic forgeries created a posteriori
  • Page 13 ATSIP, Sousse, May 18th, 2014 Audio-Visual Speech Features Raw Pixel Value DCT Transform Shape Related Many Others … Raw amplitude « Classical » MFCC coefficients Many others
  • Page 14 ATSIP, Sousse, May 18th, 2014 Audio-Visual Audio-Visual Subspaces AudioVisual Reduced Audiovisual Subspace Principal Component & Linear Discriminant Analysis x Correlated Audio & Visual Subspaces Co-inertia & Canonical Correlation Analysis
  • Page 15 ATSIP, Sousse, May 18th, 2014 Correspondence Measures Audiovisual subspace Correlated subspaces Gaussian Mixture Models Neural Networks Coupled HMM Correlation Mutual Information
  • Page 16 ATSIP, Sousse, May 18th, 2014 Application to indexation ■  High-level requests –  “Find videos where John Doe is speaking” –  “Find dialogues between Mr X and Mrs Y” –  “Locate the singer in this music video” Raw Energy Raw Pixel Value Correlation
  • Page 17 ATSIP, Sousse, May 18th, 2014 Who is speaking? ■  Face tracking ■  Correlation –  Pixel of each face –  Raw audio energy ■  Find maximum synchrony Green: current speaker
  • Page 18 ATSIP, Sousse, May 18th, 2014 How  to  Perform  “Talking-­‐Face”   Authen:ca:on?   Face recognition Speaker verification Score fusion What if…? OK OK OK Deliberate imposture
  • Page 19 ATSIP, Sousse, May 18th, 2014 Biometrics ■  Identity Verification with Talking Faces –  Speaker Verification –  Face Recognition ■  What if? Face OK Voice OK NO X
  • Page 20 ATSIP, Sousse, May 18th, 2014 Identity Verification Enrolment of client λ Model for client λ Person ε pretending to be client λ accepted if rejected otherwise Co-Inertia Analysis Equal Error Rate: 30 %
  • Page 21 ATSIP, Sousse, May 18th, 2014 Test Replay Attacks Detection Training Co-IA CCA accepted if rejected otherwise Sync Model
  • Page 22 ATSIP, Sousse, May 18th, 2014 Replay Attacks Detection Genuine synchronized video Audio replay attack Lips do not match audio perfectly Equal Error Rate: 14 %
  • Page 23 ATSIP, Sousse, May 18th, 2014 Example of Replay attacks
  • Page 24 ATSIP, Sousse, May 18th, 2014 delayed video delayed audio -5 0 +5 Alignment by maximum correlation -1
  • Page 25 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Available features –  Face : Face features (lip, eyes) à Face Modality –  Speech à Speech Modality –  Speech Synchrony à Synchrony Modality Video
  • Page 26 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Face modality –  Detection: •  Generative models (MPT toolbox) •  Temporal median Filtering •  Eyes detection within faces –  Normalization: geometry + illumination
  • Page 27 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Face Modality: –  Two verification strategies and one single comparison framework •  Global = Eigenfaces: –  Calculation of a set of directions (eigenfaces) defining a projection space –  Two faces are compared regarding their projection on the eigenfaces space. –  Learning data: BIOMET (130 pers.) + BANCA (30 pers.)
  • Page 28 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Face Modality: •  SIFT descriptors: –  Keypoints extraction –  Keypoints representation: 128-dimensional vector (gradient orientation histogramme,…) + 4-dimensional position vector SIFT descriptor (dim 128) Position (x,y) + scale + orientation (dim 4)
  • Page 29 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Face Modality: •  SVD-based matching method: –  Compare two videos V1 and V2 –  Exclusive principle: One-to-one correspondences between »  Faces (global) »  Descriptors (local) –  Principle: »  Proximity matrix computation between faces or descriptors »  Extraction of good pairings (made easy by SVD computation) –  Scores: »  One matching score between global representations »  One matching score between local representations
  • Page 30 ATSIP, Sousse, May 18th, 2014 Variability !!!!
  • Page 31 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Speech Modality: –  GMM-based approach; •  One world model •  Each speaker model is derived from the World Model by MAP adaptation •  Speech verification score: derived from likelihood ratio
  • Page 32 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Synchrony Modality: –  Principle: synchrony between lips and speech carries identity information –  Process: •  Computation of a synchrony model (CoIA analysis) for each person based on DCT (visual signal) and MFCC (speech signal) •  Comparison of the test sample with the synchrony model
  • Page 33 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Experiments: –  BANCA database: •  52 persons divided into two groups (G1 and G2) •  3 recording conditions •  1 person à 8 recordings (4 client accesses, 4 impostor accesses) •  Evaluation based on P protocol: 234 client accesses and 312 impostor accesses –  Scores: •  4 scores per access (PCA face, SIFT face, speech, synchrony) •  Score fusion based on RBF-SVM: hyperplan learned on G1/tested on G2 and conversely)
  • Page 34 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Experiments:
  • Page 35 ATSIP, Sousse, May 18th, 2014 SecurePhone ■  Technical solution that improves security ■  Biometric recognition –  Makes use of VOICE, FACE and SIGNATURE ■  Electronic signature used to secure information exchange
  • Page 36 ATSIP, Sousse, May 18th, 2014 Biometrics in SecurePhone ■  Operation Pre-processing Modelling Modelling Modelling Pre-processingPre-processing Access DeniedAccess Granted FUSION Face Voice Written Signature Modelling
  • Page 37 ATSIP, Sousse, May 18th, 2014 The BioSecure Multimodal Evaluation Campaign ■  Launched in April 2007 ■  Many modalities including ‘Video sequences’ and ‘Talking Faces’ ■  Development data and reference systems available ■  Evaluations on the sequestrated BioSecure data base (1000 clients) ■  Debriefing workshop ■  More info on : http://www.int-evry.fr/biometrics/BMEC2007/index.php
  • Page 38 ATSIP, Sousse, May 18th, 2014 Audio-­‐visual  forgery  scenarios   ■  Low-­‐effort   –  “Paparazzi”  scenario   •  The  impostor  owns  a  picture  of  the  face  and  a  recording  of  the  voice  of  the   target   –  “Big  Brother”  scenario   •  The  impostor  owns  a  video  of  the  face  and  a  recording  of  the  voice  of  the   target   ■  High-­‐effort   –  “Imitator”  scenario   •  The  impostor  owns  a  recording  of  the  voice  of  the  target  and  transforms  his   own  voice  to  sound  like  the  target   –  “Playback”  scenario   •  The  impostor  owns  a  picture  of  the  face  of  the  target  and  animate  it   according  to  his  own  face  moAon   –  “Ventriloquist”  scenario   •  combines  the  two  previous  ones  
  • Page 39 ATSIP, Sousse, May 18th, 2014 Detec:on  of  imposture   Face modality: ACCEPTED! Voice modality: ACCEPTED! Synchronisation: DENIED!
  • Page 40 ATSIP, Sousse, May 18th, 2014 40 Audio replay + “random” face Talking-Face forgeries @ BMEC Audio replay attack "   Assumptions §  Forger has recorded speech data from the genuine user in outdoor (test) conditions §  Forger is replaying the audio and uses his face in front of the sensor Stolen wave Audio replay + forger face
  • Page 41 ATSIP, Sousse, May 18th, 2014 41 CRAZY TALK Face animation + TTS Talking-Face forgeries @ BMEC Replay attack "   Assumptions §  Forger has stolen a picture §  Forger uses a face animation software and TTS (male or female) §  Forger plays back the animation to the sensor Stolen picture Contour detection Generated avi
  • Page 42 ATSIP, Sousse, May 18th, 2014 42 Picture presentation + TTS forgeries Talking-Face forgeries @ BMEC Replay attack "   Assumptions §  Forger has stolen a picture §  Forger has printed the picture §  Forger present the picture to the sensor and uses TTS (same wave as for the face animation forgery) Stolen picture Presented picture
  • Page 43 ATSIP, Sousse, May 18th, 2014 43 Systems with fusion of (face, speech) face score speech score fusion score video sequence frames speech signal Face verification Speaker verification
  • Page 44 ATSIP, Sousse, May 18th, 2014 44 Voice Conversion methods ■ GMM  conversion   –  Training  of  a  joined  Gaussian  model   •   parallel  corpus  of  aligned  sentences  of  both  source  and  target   voice   •   MFCC  on  HNM  (Harmonic  plus  Noise  Model)  parameterizaAon     –  Speech  synthesis  from  Gaussian  model   •   Inversion  of  the  MFCC   •   Pitch  correcAon   ■ ALISP  conversion   –  Very  low  debit  speech  compression  (500  bps)  method   •   Originally  developed  by  TELECOM-­‐ParisTech   –  Indexed  segments  dicAonary  system  (of  the  target  voice)   –  HNM  parameterizaAon  
  • Page 45 ATSIP, Sousse, May 18th, 2014 Voice conversion techniques Definition: Process of making one person’s voice « source » sounds like another person’s voice target source target Voice conversion My name is John My name is John
  • Page 46 ATSIP, Sousse, May 18th, 2014 Principle of ALISP Dictionary of representative segments Dictionary of representative segments Spectral analysis Prosodic analysis Selection of segmental units Segment index Prosodic parameters Input speech Concatenative synthesis HNM Output speech CODER
  • Page 47 ATSIP, Sousse, May 18th, 2014 Details of Encoding speech Spectral analysis Prosodic analysis HMM Recognition Dictionary of HMM models of ALISP classes Synth unit A1 … Synth unit A8 HMM A Representative units of the class Selection by DTW Prosodic encoding Index of ALISP class Index of synth. unit Pitch, energy, duration
  • Page 48 ATSIP, Sousse, May 18th, 2014 Details of decoding Output speech Synth unit A1 … Synth unit A8 ALISP Index Synth unit index within class Prosodic parameters Loading Synth unit Concatenative synthesis
  • Page 49 ATSIP, Sousse, May 18th, 2014 Principle of Alisp conversion Learning step: one hour of target voice - Parametric analysis: MFCC - Segmentation based on temporal decompostion and vector quantization - Stochastic modelling based on HMM - Creation of representative units Conversion step - Parametric analysis: MFCC - HMM recognition - Selection of representative segment à DTW Synthesis step - Concatenation of representative - HNM synthesis
  • Page 50 ATSIP, Sousse, May 18th, 2014 Voice conversion using ALISP results BREF databaseNIST database Source Result TargetSource Target Result female female female male
  • Page 51 ATSIP, Sousse, May 18th, 2014 Demonstra:on  of  Voice  Conversion   Impostor voice Converted voice with GMM Converted voice with ALISP Target voiceConverted voice with ALISP+GMM
  • Page 52 ATSIP, Sousse, May 18th, 2014 3D reconstruction •  3D face modeling from a front and a profile shot : •  Animated face •  https://picoforge.int-evry.fr/cgi-bin/twiki/view/ Myblog3d/Web/Demos
  • Page 53 ATSIP, Sousse, May 18th, 2014 Face Tranformation Control point selection Image segmentation Figure  2:  Division  of  an  image    Figure  1:  Control  points  selec8on   Linear transformation between source and target image Blending step source target
  • Page 54 ATSIP, Sousse, May 18th, 2014 Face Transformation Source   ?   54   -­‐>  LocalisaAon  of  control  points   -­‐>  Warping   -­‐>  Blending   Cible   ?   X’  =  f(X)   p  =  αp  +  (1  –  α)p’   X X’   p   p’  
  • Page 55 ATSIP, Sousse, May 18th, 2014 Face  transforma:on  (IBM)  
  • Page 56 ATSIP, Sousse, May 18th, 2014 Ouisper1 - Silent Speech Interface ■  Sensor-based system allowing speech communication via standard articulators, but without glottal activity ■  Two distinct types of application –  alternative to tracheo-oesophagal speech (TES) for persons having undergone a tracheotomy –  a "silent telephone" for use in situations where quiet must be maintained, or for communication in very noisy environments ■  Speech Synthesis from ultrasound and optical imagery of the tongue and lips 1) Oral Ultrasound synthetIc SPEech souRce
  • Page 57 ATSIP, Sousse, May 18th, 2014 Ouisper - System Overview Ultrasound video of the vocal tract Optical video of the speaker lips Recorded audio Speech Alignment Text Visual Feature Extraction Audio-Visual Speech Corpus Visual Speech Recognizer Visual Unit Selection Audio Unit Concatenation T R A I N I N G T E S T Visual Data N-best Phonetic or ALISP Targets
  • Page 58 ATSIP, Sousse, May 18th, 2014 Ouisper - Training Data
  • Page 59 ATSIP, Sousse, May 18th, 2014 Ouisper - Video Stream Coding T.Hueber , G. Aversano, G.Chollet, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, M. Stone, “EigenTongue Feature Extraction For An Ultrasound-based Silent Speech Interface,” IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu Hawaii, USA, 2007. Eigenvectors Build a subset of typical frames Perform PCA Code new frames with their projections onto the set of Eigenvectors
  • Page 60 ATSIP, Sousse, May 18th, 2014 Ouisper - Audio Stream Coding ALISP Segmentation Detection of quasi-stationary parts in the parametric representation of speech Assignment of segments to class using unsupervised classification techniques Phonetic Segmentation Forced-alignement of speech with the text Need of a relevant and correct phonetic transcription of the uttered signal Corpus-based synthesis Need of a preliminary segmental description of the signal
  • Page 61 ATSIP, Sousse, May 18th, 2014 Audiovisual dictionary building ■  Visual and acoustic data are synchronously recorded ■  Audio segmentation is used to bootstrap visual speech recognizer /e  -­‐  r/ 2)    Train  HMM  model  for  each  phonetic  class /a  -­‐  j//u  -­‐  th/ Audiovisual dictionary
  • Page 62 ATSIP, Sousse, May 18th, 2014 Visuo-acoustic decoding ■  Visual speech recognition –  Train HMM model for each visual class •  Use multistream-based learning techniques –  Perform a « visuo-phonetic » decoding step •  Use N-Best list •  Introduce linguistic constraints –  Language model –  Dictionary –  Multigrams ■  Corpus-based speech synthesis –  Combine probabilistic and data-driven approach in the audiovisual unit selection step.
  • Page 63 ATSIP, Sousse, May 18th, 2014 Speech recognition from video-only data ow p ax n y uh r b uh k t uw dh ax f er s t p ey jh ax w ih y uh r b uh k sh uw dh ax v er s p ey jh Open your book to the first page Ref Rec A wear your book shoe the verse page Corpus-based synthesis driven by predicted phonetic lattice is currently under study
  • Page 64 ATSIP, Sousse, May 18th, 2014 Ouisper - Conclusion ■  More information on –  http://www.neurones.espci.fr/ouisper/ ■  Contacts –  gerard.chollet@enst.fr –  denby@ieee.org –  hueber@ieee.org
  • Page 65 ATSIP, Sousse, May 18th, 2014 Audio-Visual Speech Processing Conclusions and Perspectives ■  A talking face is more intelligible, expressive, recognisable, attractive than acoustic speech alone. ■  The combined use of facial and speech information improves identity verification and robustness to forgeries. ■  Multi-stream models of the synchrony of visual and acoustic information have applications in the analysis, coding, recognition and synthesis of talking faces.
  • Page 66 ATSIP, Sousse, May 18th, 2014