Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

To uuuhhh is human, for guilt divine

Foulkes, P. and Hughes, V. (2016) To uuuhhh is human, for guilt divine. New Zealand Institute of Language Brain and Behaviour (NZILBB) Seminar, University of Canterbury, Christchurch, NZ. 1 February 2016. (INVITED TALK)

  • Login to see the comments

  • Be the first to like this

To uuuhhh is human, for guilt divine

  1. 1. To uuuhhh is human, for guilt divine Integrating linguistic-phonetic and automatic methods in forensic voice comparison Paul Foulkes & Vincent Hughes
  2. 2. Overview § this study: § hesitation markers (HES) as variables for forensic analysis § uh (~ er), um (~ erm) § how well do the phonetic properties of HES discriminate between speakers?
  3. 3. Overview § context: voice as a biometric § voice is a marker of human identity § but it’s an imperfect biometric § within-speaker variation; no feature permanent or unchanging (cf. DNA or fingerprints); technical effects; health & ageing… § no voiceprint (despite CSI-type claims)
  4. 4. Overview § forensic analysis of voice since 1960s § in UK – phoneticians & linguists § in US – engineers § largely separate fields of research & case practice § in UK/Europe/Australia – forensic phonetics § in US – ASR (automatic speaker recognition)
  5. 5. Overview § Voice & Identity project, 2015-2018, UK AHRC § Foulkes, French, Harrison, Hughes, San Segundo § aim: integrating ASR & forensic phonetics § compare results on same data § assess scope for complementary application
  6. 6. 0. OVERVIEW 1. forensic voice analysis 2. data 3. comparison of ASR and acoustics 4. discussion
  7. 7. known suspect vs. unknown offender 1. Forensic voice analysis
  8. 8. 1.1 phonetic methods § phonetic-linguistic § componential § vowels, consonants, f0, VQ, syntax… § mainly standard analytic methods § formants, durations, f0 range & mean…
  9. 9. 1.2 ASR § multiple options § ~ holistic analysis of entire speech signal § commercial ASR (e.g. Agnitio, Nuance) § stand-along software with integrated functions § manual ASR § the same – bit more work, cheaper…
  10. 10. 1.2 ASR § signal divided into frames (e.g. 10 ms) § extract features from each frame § standard approach: MFCC § Mel freq. cepstral coefficients § Mel scale = captures human perceptual system § cepstrum = inverse of log power spectrum § in theory, decouples source and filter information § leaves only supralaryngeal information
  11. 11. 1.3 ASR & phonetics § ASR & phonetic methods § different history § relatively little interaction § but same basic agenda § find a way of extracting and analysing features of the speech signal that best separate speakers from each other
  12. 12. 1.3 ASR & phonetics § pros and cons § improve forensic voice analysis via combination? phonetic ASR mapping to concrete entities ü (û) explainable in court ü û time & effort û ü robust to channel (ü) û objectivity (û) ü error rates calculable û ü works with limited/poor materials (ü) (û)
  13. 13. 1.3 ASR & phonetics § good (phonetic) variables (Nolan 1983: 11) § high between-speaker variation § low within-speaker variability § availability § measurability § robustness § technical factors (e.g. telephone) § syntagmatic factors (e.g. coarticulation) § disguise/mimicry/health/intoxicants…
  14. 14. 1.3 ASR & phonetics § hesitations – uh, um § high between-sp. variation ü (Künzel 1997) § low within-sp. variability ü § syntagmatic factors ü (isolated, long) § availability ü (~ 3.7/min; Tschäpe 2005) § measurability ü ([əː, əːm]) § robustness § telephone ü ([əː, əːm]) § disguise/mimicry ü (unconscious)
  15. 15. 0. OVERVIEW 1. forensic voice analysis 2. data 3. comparison of ASR and acoustics 4. discussion
  16. 16. 2.1 Data: corpus § DyViS corpus (Dynamic Variability in Speech) § Nolan et al (2009) § 100 young RP men (Cambridge students) § Task 1: simulated police interview (~ 20 min) § Task 2: phone call with ‘accomplice’ (~ 15 min) § near end (studio quality) § far end (telephone transmission)
  17. 17. 2.1 Data: corpus § DyViS sample § 60 young RP men § police interview materials (suspect) § phone calls (offender) § various pre-processing steps taken in Voice & Identity § i.e. not original DyViS files
  18. 18. 2.2 Data: overall approach § stage 1: extraction & initial exploration of data § stage 2: testing of data against a reference (background population) § how typical are the data for this population? § assessed via likelihood ratios (LRs) § stage 3: calibrate the LRs to improve analyses on the actual test data § stage 4: calculate LRS for test data, & assess error rates
  19. 19. 2.3 Data: acoustic measures vowels: 10% intervals quadratics (3 terms) formant analysis
  20. 20. 2.3.1 Data: raw overall FLEECE GOOSE TRAP NORTH 200 300 400 500 600 700 800 900 8001200160020002400 F2 (Hz) F1(Hz) UM FLEECE GOOSE TRAP NORTH 200 300 400 500 600 700 800 900 8001200160020002400 F2 (Hz) F1(Hz) UH vowel midpoints
  21. 21. 2.3.2 Data: raw example Ss UM F2 (Hz) F1(Hz) 23 48 111 30 FLEECE GOOSE TRAP NORTH 2400 2000 1600 1200 800 900800700600500400300200 UH F2 (Hz) F1(Hz) 23 59 111 17 FLEECE GOOSE TRAP NORTH 2400 2000 1600 1200 800 900800700600500400300200 vowel midpoints + s.d.
  22. 22. 2.3.2 Data: duration 0 500 1000 1500 UH UM (N) UM (V) Duration(ms) r = 0.330 200 400 600 200 400 600 8 V Duration (ms) NDuration(ms) UM
  23. 23. 2.3.3 Data: acoustic tests § input features for initial tests § i.e. to assess which dimensions best discriminate Ss § uh § F1, F2 & F3 midpoints (+50% step) § F1, F2 & F3 quadratic coefficients § vowel duration § um § as above + nasal duration
  24. 24. 2.4 Data: testing approach § calculate log likelihood ratios (LRs) § expresses value for similarity of samples relative to general population § simplified example § take 2 samples, e.g. of speech (ASR) or phonetic feature § measure variable in sample § compare against each other & population at large § 20 speakers used as ‘reference population’
  25. 25. 2.4 Data: analysis § example: F2 § offender: § F2 @ 1600 Hz § suspect sample: § p = 0.0035 of F2 @ 1600 Hz § reference (background) model: § p = 0.0018 of F2 @ 1600 Hz
  26. 26. 2.4 Data: analysis § example: F2 § LR = 0.0035 0.0018 = 1.94 § data ~ 2x as likely from suspect than at random from pop. § LR ≥ 1: samples classed as Same Speaker (SS) § LR ≤ 1: samples classed as Different Speaker (DS)
  27. 27. 2.4 Data: analysis § paired sample tests (Same/Different speaker) § recordings split in two; 8 HES per S per half § 20 SS, 190 DS ‘suspect’ ‘offender’ classification 1 1 SS 1 2 DS 1 n DS … … 2 2 SS 2 3 DS n n/n’
  28. 28. 2.4 Data: analysis § calculating errors § false hit Different S pairs classed as Same S § miss Same S pairs classed as Different S
  29. 29. 2.4 Data: analysis § binary correct/incorrect § adjustable: system could accept all or reject all § EER = equal error rate to balance false hits & misses § Cllr (log likelihood cost) § quantifies magnitude of error § 0 is good, > 1 is bad!
  30. 30. 2.5 test results § uh § best results § F1+F2+F3 § + duration § midpoints § EER = 5.9% § Cllr = 0.23 V Duration wo/ Duration 10 20 30 10 20 30 MidpointsQuadratic 0.4 0.6 0.8 0.4 0.6 0.8 Log LR Cost (Cllr) EER(%) Formant F1 F2 F3 F1,F2 F2,F3 F1,F3 F1~F3 UH
  31. 31. 2.5 test results § um § best results § F1+F2+F3 § + V duration § + N duration § dynamics § EER = 4.1% § Cllr = 0.12 N Duration V Duration V+N Duration wo/ Duration 10 20 30 10 20 30 MidpointsQuadratic 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 Log LR Cost (Cllr) EER(%) Formant F1 F2 F3 F1,F2 F2,F3 F1,F3 F1~F3 UM um performs better than uh
  32. 32. 2.5 test results § initial testing revealed: § um better than uh § best results with dynamic data (quadratic trajectories) § F1 + F2 + F3 + V duration + N duration § thus subsequent analysis uses data for um to compare against ASR
  33. 33. 0. OVERVIEW 1. forensic voice analysis 2. data 3. comparison of ASR and acoustics 4. discussion
  34. 34. 3.1 Data § for ASR vs. um § compare DyViS Tasks 1 (‘suspect’) and 2 (‘offender’) § 73 speakers § ASR: whole recordings § um: 14 tokens per S. per recording (28 total) § LRs calculated using: § MVKD (Aitken and Lucy, 2004) § GMM-UBM (Reynolds et al., 2000)
  35. 35. 3.2 ASR § reference models (background population) § many difficulties with this (Hughes 2014 etc) § two main options: § GMM-UBM: industry standard § iVectors: new state-of-the-art development § both are ways of quantifying information about, and distances between, individuals
  36. 36. 3.2.1 Results: ASR § (almost) perfect…
  37. 37. 3.2.2 ASR + acoustics § so why push phonetic analysis? § GMM-UBM does make errors (3% versus 4% for um) § materials too easy & forensically unrealistic § studio quality § no channel mismatch § fairly long…
  38. 38. 3.3 Results: um § not so perfect… but good DS SS um F1+F2+F3 dynamics + durations MVKD EER = 7.17% Cllr = 0.338
  39. 39. 3.4 Results: ASR + acoustics § correlations of LRs – none, or negative § ∴ potential improvement in ASR with inclusion of um
  40. 40. 3.4 Results: ASR + acoustics § combining the ASR and acoustic data § fusion § mathematical combination of variables § takes account of correlation between variables § not without difficulty, especially with phonetic data (multiple correlations etc; Gold & Hughes 2014)
  41. 41. 3.4 Results: ASR + acoustics § fusing results for ASR with um § improvement in EER and Cllr Single Feature Fused ASR + um EER (%) Cllr EER (%) Cllr MFCC / iVector 0 0.003 0 0.002 MFCC / GMM-UBM 3 0.097 0.5 0.058 um / MVKD 7.17 0.338 ↓ 40%↓ 29%↓ 88%
  42. 42. 0. OVERVIEW 1. forensic voice analysis 2. data 3. comparison of ASR and acoustics 4. discussion
  43. 43. 4.1 summary § um, uh work pretty well as solo variables for speaker discrimination § best: EER 7.17%, Cllr 0.338 for um § as predicted § very good relative to other studies of single linguistic variables (for assessment of guilt, divine) § ASR copes even better with this data set § 0% error with iVectors
  44. 44. 4.1 summary § but no correlations between outputs for ASR & acoustic data § improvement through fusing systems § stronger correct LRs = better outcome § …and thus more robust evidence in court
  45. 45. 4.2 discussion § bear in mind we’re comparing analysis of entire speech sample (ASR) versus one phonetic variable § dozens of other phonetic variables available § vocal tract configuration (static) § dynamic implementation of speech strings § temporal information § abstract mathematical representations of the spectrum at static points § dynamics across these points
  46. 46. 4.2 discussion § lack of correlation implies ASR & phonetics provide different types of speaker-specific information § clear from differences in error profiles § see also Gonzalez-Rodriguez et al. (2014) § so far the focus has just been on the supralaryngeal vocal tract output § potential benefit from looking at the larynx § Voice and Identity grant
  47. 47. 4.2 discussion § increasing moves toward integration of ASR and forensic phonetics § evidence shows improvement via combination § in principle underlying methods are the same § features & methods of analysis § error rates, comparing systems, combination... § view both ASR and phonetic analysis as tools in the FVC toolkit!
  48. 48. thanks, kia ora, ta questions?
  49. 49. To uuuhhh is human, for guilt divine Integrating linguistic-phonetic and automatic methods in forensic voice comparison Paul Foulkes & Vincent Hughes

×