Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Formant dynamics and durations of um improve the performance of automatic speaker recognition systems

65 views

Published on

Hughes, V., Foulkes, P. and Wood, S. (2016) Formant dynamics and durations of um improve the performance of automatic speaker recognition systems. Paper presented at the 16th Australasian Conference on Speech Science and Technology (ASSTA). University of Western Sydney, Australia. 6-9 December 2016.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Formant dynamics and durations of um improve the performance of automatic speaker recognition systems

  1. 1. Formant dynamics and durations of um improve the performance of automatic speaker recognition systems Vincent Hughes Paul Foulkes Sophie Wood 16th Speech Science and Technology (SST) Conference 9th December 2016
  2. 2. 1. Forensic voice comparison (FVC) vs. unknown offender known suspect 2
  3. 3. 1. Forensic voice comparison (FVC) defence (different speakers) prosecution (same speaker) FVC 3 Likelihood ratio (LR) p(E|Hp) p(E|Hd)
  4. 4. 1. FVC: Methods of analysis 1. Linguistic-phonetic – componential approach: • segmental/ suprasegmental/ temporal/ syntactic/ non- linguistic/ pathological… • e.g. HESitations (um) – auditory & acoustic analysis 4
  5. 5. 1. FVC: Methods of analysis 2. Automatic speaker recognition (ASR) – holistic analysis of entire speech active portion of the signal – signal divided into series of overlapping frames • extract features (e.g. MFCCs) from each frame – statistical modelling (GMM-UBM, iVectors…) 5
  6. 6. 1. FVC: Combining approaches • largely developed in isolation – consideredfundamentally different (??) – but ultimate aim is the same… • increasing focus on combination of ASR and ling-phon approaches – (H)ASR element of NIST (Greenberg et al. 2010) – govt labs in UK, Germany and Sweden using combined approach in casework – Zhang et al. (2013) 6
  7. 7. 2. Research questions 1. How does the performance of the hesitation marker um compare to that of a generic MFCC-based ASR? 2. Is there an improvement in system performance when fusing um with MFCCs (over the baseline MFCC system)? 3. Does channel mismatch affect the improvement achieved with the addition of um? 7
  8. 8. 3. Method • DyViS (Nolan et al., 2009) – 63 speakers – young SSBE (RP) men • Two recordings per speaker – 15-20 mins/ sample – studio and telephone samples – Task 1: mock police interview – Task 2: telephone conversation with accomplice 8
  9. 9. 3. Method: Features Linguistic-phonetic • hesitation marker um – best system performance in Hughes et al. (2016) • features – quadratic polynomial coefficients of formant trajectories – V and N durations – mean N tokens/speaker = 38 9
  10. 10. 3. Method: Features Automatic • speech-active portion of the signal extracted – removal of overlapping speech/ interlocutor/ background noise/ clipping/ silences (VAD) • signal divided into 20ms frames shifted at 10ms intervals – MFCC feature vector extracted from each frame 10
  11. 11. 3. Method: Feature-to-score stage • speakers divided into sets: – training (20 speakers) – test (20 speakers) – reference (23 speakers) • SS and DS LRs computed – um: MVKD (Aitken & Lucy 2004) – MFCC: GMM-UBM (w. MAP adaptation) – Task 1 = suspect/ Task 2 = offender 11
  12. 12. 3. Method: Score-to-LR stage • logistic regression calibration/fusion: – applied separately for individual and combined systems • system validity: – Equal error rate (EER): – Log LR Cost Function (Cllr; Brümmer & du Preez 2006) replicated 20 times using random sets of speakers 12
  13. 13. 4. Results: Experiment 1 13 vs. Task 1: suspect (HQ) Task 2: offender (HQ)
  14. 14. 4. Results: Experiment 1 Input • um: – quadratic polynomial coefficients – F1, F2, and F3 trajectories • MFCC: – 16 MFCCs/ 16 ∆s/ 16 ∆∆s – 0-5000Hz range 14
  15. 15. 4. Results: Experiment 1 MFCC UM 1 23 4 5 6 7 8 910 11 12 13 14 15 161718 19 20 12 3 4 5 7 8 9 10 1112 131415 16 1718 19 20 0 2 4 6 8 10 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Log LR Cost (Cllr ) EER(%) Individual systems MFCC um Mean Cllr = 0.144 Mean EER = 2.57% Mean Cllr = 0.261 Mean EER = 4.83%
  16. 16. 4. Results: Experiment 1 16 Cllr EER Fused systems
  17. 17. 4. Results: Experiment 2 17 vs. Task 1: suspect (HQ) Task 2: offender (Landline Telephone)
  18. 18. 4. Results: Experiment 2 Input • um: – quadratic polynomial coefficients – F2 and F3 trajectories • MFCC: – 16 MFCCs/ 16 ∆s/ 16 ∆∆s – 300-3400Hz range 18
  19. 19. 4. Results: Experiment 2 19 Individual systems Fused system System EER (%) Cllr um 5.13 0.2448 MFCC 0 0.0034 EER = 0% Cllr = 0.0031
  20. 20. 5. Discussion RQ(1): um vs. MFCCs • MFCCs better than um – but… only by c. 2.2% EER and 0.12 Cllr • extremely promising results for um – MFCC analysis = entire signal/ um analysis = small portion of the signal – performance similar to that in Hughes et al. (2016) – ∴ fairly robust to sources of within-sp variability 20
  21. 21. 5. Discussion RQ(2): baseline vs. fused systems • Experiment 1 – improvement in EER and Cllr across almost all 20 replications – marked improvement in some cases • Experiment 2 – v. small improvement in Cllr – already at 0% EER 21
  22. 22. 5. Discussion RQ(3): effect of channel mismatch • improvement in system validity = considerably less for channel mismatch data – consistent with Zhang et al. (2013) • but… – still v. good performance of um in isolation – F2 + F3 carriers of speaker specificity – doesn’t mean phonetic information isn’t useful 22
  23. 23. 6. Conclusions • more work needed at the intersection of ASR and ling-phon FVC – MFCC analysis of um – fusion with state-of-the-art system (i.e. iVectors) • important not to see ASR and ling-phon as fundamentally opposed – both have pros and cons – tools in the toolkit 23
  24. 24. Thanks! Questions? 16th Speech Science and Technology (SST) Conference 9th December 2016

×