Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Improving on the state-of-the-art: the contribution of linguistics and phonetics to automatic speaker recognition

64 views

Published on

Hughes, V. (2016) Improving on the state-of-the-art: the contribution of linguistics and phonetics to automatic speaker recognition. Department of Language and Linguistic Science Colloquium, University of York, UK. 20 January 2016. (INVITED TALK)

Published in: Education
  • Be the first to comment

  • Be the first to like this

Improving on the state-of-the-art: the contribution of linguistics and phonetics to automatic speaker recognition

  1. 1. Improving on the state-of-the art Integrating linguistic-phonetic and automatic methods in forensic voice comparison Vincent Hughes Department of Language and Linguistic Science Colloquium 20th January 2016
  2. 2. In this talk… • forensic voice comparison – methods of analysis: linguistic-phonetic & ASR • ASR: state-of-the-art… • linguistic-phonetic: – hesitation markers (uhs and ums) • integrating methods • discussion – what does this tell us about speaker identity? 2
  3. 3. 1. Background forensic voice comparison 3 known suspect vs. unknown offender
  4. 4. 1. Background (1) linguistic-phonetic: – componential approach: • segmental/ suprasegmental/ temporal/ syntactic/ non- linguistic/ pathological… – auditory & acoustic analysis • focus here on acoustics • yield empirical data = quantification of strength of evidence 4
  5. 5. 1. Background (2) automatic speaker recognition (ASR): – holistic analysis of entire speech signal • …but not always – signal divided into frames (10ms) • extract features (e.g. MFCCs) from each frame i. commercial ASR (e.g. Agnitio, Nuance) • stand-along software with integrated functions • GUI ii. manual (??) ASR • the same – bit more work, cheaper… 5
  6. 6. 1. Background • methods developed in isolation • considered fundamentally different – the approaches look different • ASR = technology/ automation • ling-phon = human expert – commercial ASR? – CSI effect? – lack of understanding? – terminology? 6
  7. 7. 1. Background 7 • maybe they’re not so different after all… • fundamental aim: find a way of extracting and analysing features of the speech signal that best separate speakers from each other (a) features (b) analysis
  8. 8. 1. Background (a) features (extraction) – high between-speaker variation – low within-speaker variability – resistance to disguise/mimicry – availability – robust to technical factors (e.g. telephone) – measurability from Nolan (1983: 11) 8
  9. 9. 1. Background (b) analysis (evaluation) – probabilistic reasoning about strength of evidence • ‘reasoning under uncertainty’ • based on judgements of similarity and typicality – both formal and informal: • statistical models à using data • existing literature • experience – it’s all probability! 9
  10. 10. linguistic-phonetic 10 1. Background ü û well understood principles of linguistics/phonetics time consuming/labour intensive more robust to channel mismatch/low quality recordings difficult to establish error rates for auditory analysis possible to make judgements on limited data humans are ‘black boxes’ too! explainable to courts subjectivity
  11. 11. ASR 11 1. Background ü û efficient; easy to process masses of data ‘black box’ error rates difficult to explain to courts viewed as more scientific/objective poorer performance under real forensic conditions perform extremely well under certain conditions require relatively long samples and large amounts of background data
  12. 12. 2. Questions • what is the state-of-the-art in ASR, and how does it perform? (used as a baseline…) • what is the best performance with a linguistic- phonetic system? • can we improve ASR by adding linguistic- phonetic info? 12
  13. 13. 3. Data • DyViS (Nolan et al., 2009) – 100 speakers – males – 18-25 years old – Cambridge University educated – recorded in 2006-2007 – collected for forensic phonetic research 13
  14. 14. 3. Data • Task 1: mock police interview – “cognitive conflict” – studio quality (44.1 kHz, 16-bit depth) – duration = c. 20 mins • Task 2: telephone conversation w. accomplice – information exchange – (semi) non-contemporaneous – studio quality (44.1 kHz, 16-bit depth) – duration = c. 15 mins 14
  15. 15. 3. Data • Task 1: mock police interview – “cognitive conflict” – studio quality (44.1 kHz, 16-bit depth) – duration = c. 20 mins • Task 2: telephone conversation w. accomplice – information exchange – (semi) non-contemporaneous – studio quality (44.1 kHz, 16-bit depth) – duration = c. 15 mins 15 plus… numerous steps to pre- process recordings
  16. 16. 4. Method (1) data extraction – extraction of empirical data for target variables (2) speakers divided into three sets 16 training test reference
  17. 17. 4. Method (3) calculate some likelihood ratios (LRs) – LR = value representing strength of evidence – typicality estimated based on a sample of the (relevant) population (reference data) – numerical value calculated using a variety of formulae – natural log applied 17 similarity typicality
  18. 18. 4. Method 18 training test reference SS & DS comparisons suspect offender 1 vs. 1 1 vs. 2 1 vs. 3 1 vs. 4… suspect offender 1 vs. 1 1 vs. 2 1 vs. 3 1 vs. 4… SS & DS comparisons
  19. 19. 4. Method (4) calibrate your LRs – improve your test LRs based on your training LRs – logistic regression (Brümmer & du Preez, 2006) (5) calculate error rates – Equal error rate (EER) • % false hits (DS LLR > 0) = % misses (SS LLR < 0) – Log LR Cost function (Cllr) • penalty for high magnitude errors 19 < 0 (-) 0 > 0 (+) defence no evidence prosecution
  20. 20. 5. ASR: features (a) comparison of features – subset of data: 73 speakers (see later…) • 25 training/ 25 test/ 23 reference • DyViS Tasks 1 (suspect) and 2 (offender) – 25 SS/ 300 DS comparisons 20
  21. 21. 5. ASR: features (a) comparison of features – subset of data: 73 speakers (see later…) • 25 training/ 25 test/ 23 reference • DyViS Tasks 1 (suspect) and 2 (offender) – 25 SS/ 300 DS comparisons 21
  22. 22. 5. ASR: features (a) comparison of features – subset of data: 73 speakers (see later…) • 25 training/ 25 test/ 23 reference • DyViS Tasks 1 (suspect) and 2 (offender) – 25 SS/ 300 DS comparisons 22
  23. 23. 5. ASR: features (a) comparison of features – subset of data: 73 speakers (see later…) • 25 training/ 25 test/ 23 reference • DyViS Tasks 1 (suspect) and 2 (offender) – 25 SS/ 300 DS comparisons 23
  24. 24. 5. ASR: features (a) comparison of features – subset of data: 73 speakers (see later…) • 25 training/ 25 test/ 23 reference • DyViS Tasks 1 (suspect) and 2 (offender) – 25 SS/ 300 DS comparisons 24
  25. 25. 5. ASR: features (a) comparison of features – subset of data: 73 speakers (see later…) • 25 training/ 25 test/ 23 reference • DyViS Tasks 1 (suspect) and 2 (offender) – 25 SS/ 300 DS comparisons 25 20ms frames 10ms shift 50% overlap
  26. 26. 5. ASR: features 26 Computed using GMM-UBM
  27. 27. 5. ASR: features 27 MFCCs EER = 3% Cllr = 0.097 Computed using GMM-UBM
  28. 28. 5. ASR: features so what are MFCCs? • representation of the power spectrum – Mel scale = human perceptual system • cepstrum: – inverse of log power spectrum – decouples source and filter information • leaving supralaryngeal information, in theory 28
  29. 29. 5. ASR: analysis (b) method of analysis – GMM-UBM = standard in ASR
  30. 30. 5. ASR: analysis (b) method of analysis – GMM-UBM = standard in ASR – iVectors = state-of-the art:
  31. 31. 5. ASR: analysis (b) method of analysis – GMM-UBM = standard in ASR – iVectors = state-of-the art: • total variability (TV) subspace learned from ref data • used to create low D identity vectors (iVectors) • D reduced again through LDA • iVectors modelled with pLDA to calculate LRs
  32. 32. 5. ASR: analysis
  33. 33. 5. ASR: analysis MFCC + ∆s + ∆∆s iVectors EER = 0% Cllr = 0.0028
  34. 34. 6. Linguistic-phonetic: features (a) comparison of features – 20 speakers (c. 20 tokens per speaker) – DyViS Task 1 – range of vowels – F1, F2, & F3 midpoints – MVKD (Aitken & Lucy, 2004) • standard in ling-phon FVC • suspect data = normal • ref data = kernel density 34
  35. 35. 6. Linguistic-phonetic: features 35 Feature Cllr EER (%) DRESS 0.55 19.6 STRUT 0.69 19.7 TRAP 0.77 20.7 KIT 0.67 20.0 COT 0.48 15.9 THOUGHT 0.84 25.1 FLEECE 0.47 15.1 GOOSE 0.94 25.0 UH 0.48 15.7 UM 0.33 10.1 Raw data collected by: Simpson (2008) Atkinson (2009) King (2012), Wood (2013), & Hughes (2014)
  36. 36. 6. Linguistic-phonetic: features 36 Feature Cllr EER (%) DRESS 0.55 19.6 STRUT 0.69 19.7 TRAP 0.77 20.7 KIT 0.67 20.0 COT 0.48 15.9 THOUGHT 0.84 25.1 FLEECE 0.47 15.1 GOOSE 0.94 25.0 UH 0.48 15.7 UM 0.33 10.1 Raw data collected by: Simpson (2008) Atkinson (2009) King (2012), Wood (2013), & Hughes (2014)
  37. 37. 6. Linguistic-phonetic: features (b) hesitation (HES) markers – [əː] or [əːm] à typically central(-ish) vowel – good candidates for sp discrimination • speaker specific (Künzel, 1997; Clark & Fox Tree, 2002) • produced unconsciously • limited coarticulation • occur frequently (c. 3.7/min; Tschäpse et al., 2005) • susceptibility to regional/ social variation? – which features perform best? 37
  38. 38. 6. Linguistic-phonetic: features • data extraction: – 60 speakers • 20-20-20 sets – 16 tokens per speaker – DyViS Task 1 38
  39. 39. 6. Linguistic-phonetic: features • data extraction: – 60 speakers • 20-20-20 sets – 16 tokens per speaker – DyViS Task 1 39 (1)
  40. 40. 6. Linguistic-phonetic • data extraction: – 60 speakers • 20-20-20 sets – 16 tokens per speaker – DyViS Task 1 40 (1) (2)
  41. 41. 6. Linguistic-phonetic • data extraction: – 60 speakers • 20-20-20 sets – 16 tokens per speaker – DyViS Task 1 41 (1) (2) (3)
  42. 42. 6. Linguistic-phonetic: features • input features: – UH: • F1, F2 & F3 midpoints (+50% step) • F1, F2 & F3 quadratic coefficients • vowel duration – UM: • F1, F2 & F3 midpoints (+50% step) • F1, F2 & F3 quadratic coefficients • vowel duration • nasal duration 42
  43. 43. 6. Linguistic-phonetic: features 43 FLEECE GOOSE TRAP NORTH 200 300 400 500 600 700 800 900 8001200160020002400 F2 (Hz) F1(Hz) UM FLEECE GOOSE TRAP NORTH 200 300 400 500 600 700 800 900 8001200160020002400 F2 (Hz) F1(Hz) UH
  44. 44. 6. Linguistic-phonetic: features 44 UM F2 (Hz) F1(Hz) 23 48 111 30 FLEECE GOOSE TRAP NORTH 2400 2000 1600 1200 800 900800700600500400300200 UH F2 (Hz) F1(Hz) 23 59 111 17 FLEECE GOOSE TRAP NORTH 2400 2000 1600 1200 800 900800700600500400300200
  45. 45. V Duration wo/ Duration 10 20 30 10 20 30 MidpointsQuadratic 0.4 0.6 0.8 0.4 0.6 0.8 Log LR Cost (Cllr) EER(%) Formant F1 F2 F3 F1,F2 F2,F3 F1,F3 F1~F3 UH N Duration V Duration V+N Duration wo/ Duration 10 20 30 10 20 30 MidpointsQuadratic 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 Log LR Cost (Cllr) EER(%) Formant F1 F2 F3 F1,F2 F2,F3 F1,F3 F1~F3 UM – UM < UH – F1~F3 < individual/ combo – w. durs < wo. Durs UH: – midpoints < dynamics UM: – dynamics < Midpoints 45
  46. 46. V Duration wo/ Duration 10 20 30 10 20 30 MidpointsQuadratic 0.4 0.6 0.8 0.4 0.6 0.8 Log LR Cost (Cllr) EER(%) Formant F1 F2 F3 F1,F2 F2,F3 F1,F3 F1~F3 UH – UM < UH – F1~F3 < individual/ combo – w. durs < wo. Durs UH: – midpoints < dynamics UM: – dynamics < Midpoints 46 UM F1~F3 dynamics w. durs EER = 4.08% Cllr = 0.12 N Duration V Duration V+N Duration wo/ Duration 10 20 30 10 20 30 MidpointsQuadratic 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 Log LR Cost (Cllr) EER(%) Formant F1 F2 F3 F1,F2 F2,F3 F1,F3 F1~F3 UM
  47. 47. 6. Linguistic-phonetic: analysis (c) methods of analysis – same subset of data as for ASR • 73 speakers (27 removed from 100) • 14 tokens per speaker per recording (28 total) • 25 training/ 25 test/ 23 reference • DyViS Tasks 1 (suspect) and 2 (offender) – UM F1~F3 dynamics w. durs – LRs calculated using: • MVKD (Aitken and Lucy, 2004) • GMM-UBM (Reynolds et al., 2000) 47
  48. 48. 6. Linguistic-phonetic: analysis 48
  49. 49. 6. Linguistic-phonetic: analysis 49 UM F1~F3 dynamics w. durs MVKD EER = 7.17% Cllr = 0.338
  50. 50. 7. Correlations 50
  51. 51. 7. Correlations where does all of this leave us? • using the same data set… – ASR: • best performance: MFCCs + ∆s + ∆∆s (iVectors) • Cllr = 0.003; EER = 0% • GMM-UBM: Cllr = 0.003; EER = 0% – linguistic-phonetic: • best performance: UM quadratic F1~F3 w/ durs (MVKD) • Cllr = 0.338; EER = 7.17% • …but limited/no correlation in output! 51
  52. 52. 8. Combination • ∴ potential improvement in baseline ASR with inclusion of UM data • issues: – simple multiplication if independent (naïve Bayes) – speech data = complex correlations – logistic-regression fusion • considers correlations in LRs • developed in ASR • not without issues (see Gold & Hughes, 2014, 2015) 52
  53. 53. 8. Combination 53 Single Feature EER (%) Cllr MFCC – iVector 0 0.003 MFCC – GMM-UBM 3 0.097 UM - MVKD 7.17 0.338 GMM-UBM + HESiVector + HES
  54. 54. 8. Combination 54 Single Feature EER (%) Cllr MFCC – iVector 0 0.003 MFCC – GMM-UBM 3 0.097 UM - MVKD 7.17 0.338 Fused EER (%) Cllr MFCC (iV) + UM 0 0.002 MFCC (GMM) + UM 0.5 0.058 GMM-UBM + HESiVector + HES
  55. 55. 9. Discussion • ASR: iVectors = 0% errors – forensically realistic data? • acoustics of UM are very useful for speaker discrimination – EER = 7.17% (+ low magnitude errors) – less data! • improvement when combined… – Cllr: 29% ↓ (iVector + UM) – EER: 88% ↓, Cllr: 41% ↓ (GMM-UBM + UM) 55
  56. 56. 9. Discussion • this is just a limited amount of ling-phon analysis… – Gold & Hughes (2015) • different ways of combining LRs from multiple features • focus on correlations, but… • AR, f0, LTFDs, HES, PRICE, TRAP, GOOSE, THOUGHT, VOT • EER = 0.33%! (based on 324 comparisons) • plenty of scope for further improvement in performance 56
  57. 57. 9. Discussion • domains… – vocal tract configuration (static) – dynamic implementation of speech strings – temporal information – abstract mathematical representations of the spectrum at static points – dynamics across these points 57
  58. 58. 9. Discussion • provide different types of speaker specific information – clear from differences in error profiles – see also Gonzalez-Rodriguez et al. (2014) • so far the focus has just been on the supralaryngeal vocal tract output – potential benefit from looking at the larynx – Voice and Identity grant 58
  59. 59. 10. Conclusions • increasing moves toward integration of ASR and ling-phon • evidence shows improvement in combination • in principle methods are the same: – features & methods of analysis – error rates, comparing systems, combination... • view methods as tools in the FVC toolkit! 59
  60. 60. Thanks! Questions?

×