Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Effects of variation on the computation of numerical likelihood ratios for forensic voice comparison

63 views

Published on

Hughes, V. and Foulkes, P. (2012) Effects of variation on the computation of numerical likelihood ratios for forensic voice comparison. Paper presented at International Association for Forensic Phonetics and Acoustics (IAFPA) Conference, Universidad Internacional Menédez Pelayo, Santander. 5-8 August 2012.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Effects of variation on the computation of numerical likelihood ratios for forensic voice comparison

  1. 1. Effects of variation on the computation of numerical likelihood ratios for forensic voice comparison Vincent Hughes Paul Foulkes Department of Language and Linguistic Science
  2. 2. 1. introduction • Likelihood Ratio (LR) = “logically and legally correct framework” for assessing forensic comparison evidence (Rose & Morrison 2009: 143) p(E|Hp) p(E|Hd) 2 Hughes & Foulkes IAFPA 2012 LR =
  3. 3. • assessment of similarity of observed features in the criminal and known samples, and their typicality • typicality = dependent on patterns in the relevant population (Aitken & Taroni 2004) – definedby the defencehypothesis – quantified relative to a sampled sub-section of that population(reference data) 3 Hughes & Foulkes IAFPA 2012 Hp Hd from Berger (2012)
  4. 4. • Rose (2004: 4) default Hd: “same-sexspeaker(s) of the language” • ‘logical relevance’ (Kaye 2004, 2008) 4 Hughes & Foulkes IAFPA 2012 Study Feature REFERENCE DATA Speech style N speakers Age Language Rose et al (2003) /ɕ/ /o/ /N/ Read 60 20-50 Japanese Rose et al (2006) /aI/ Read 166 19-64 Australian English Morrison (2008) /aI/ Read 27 19-64 Australian English Kinoshita et al (2009) f0 Controlled spontaneous 201 No control Japanese
  5. 5. Hughes & Foulkes IAFPA 2012 5 • collecting referencedata – bespokecase-by-casedata – ‘off-the-shelf’data • inevitable mismatch between the off-the-shelf data and the facts of the case at trial • LRs necessary vary with different reference data
  6. 6. 2. research questions to what extent are LRs affected by… i. varying N speakers in the reference data? ii. varying N tokens per speaker in the reference data? iii. dialect mismatch between target voice and reference data? 6 Hughes & Foulkes IAFPA 2012
  7. 7. 7 Hughes & Foulkes IAFPA 2012 7 Raw LR Log10 LR Verbal expression >10000 5 Very strong evidence 1000-10000 4 Strong evidence 100-1000 3 Moderately strong evidence 10-100 2 Moderate evidence 1-10 1 Limited evidence 1-0.1 -1 Limited evidence 0.1-0.01 -2 Moderate evidence 0.01-0.001 -3 Moderately strong evidence 0.001-0.0001 -4 Strong evidence <0.0001 -5 Very strong evidence Champod and Evett (2000) Hp Hd
  8. 8. 3. method 8 Hughes & Foulkes IAFPA 2012 • 1 set of reference data • 4 sets of test data • GOOSE /u:/ • dynamic time-normalised measurements of F1 and F2 (McDougall 2004, 2006) F2 F1
  9. 9. • reference data: – New Zealand English (NZE) from Canterbury Corpus (ONZE) – 120 male speakers (born 1932-1987) – min 10 tokensper speaker (codedfor context) – auto-generatedformant data 9 Hughes & Foulkes IAFPA 2012 • test data: – NZE/ Manchester/ Newcastle/York – 8 male speakers per set (aged 16-31) – 16 tokens per speaker (coded for context)
  10. 10. • why GOOSE /u:/? – not a regional stereotype (Labov 1971) of any of the test set dialects 10 Hughes & Foulkes IAFPA 2012 200 300 400 500 600 700 800 05001000150020002500 F1 (Hz) F2 (Hz) Manchester Newcastle York ONZE
  11. 11. 11 Hughes & Foulkes IAFPA 2012 • data reduction using quadratic polynomials • LR calculated using Multivariate Kernel Density formula (Aitken and Lucy 2004, Morrison 2007) • accuracy of output assessed using log likelihood-ratio cost function (Cllr) (Brümmer and du Preez 2006) 01 2 2 axaxay ++=
  12. 12. 4. results i. number of reference speakers 12 Hughes & Foulkes IAFPA 2012 – test data combined • 32 same-speaker comparisons • 992 different-speaker comparisons – starting with 120 speakers • 10 tokens per speaker – ten speakers removed at a time
  13. 13. N speakers 13 Hughes & Foulkes IAFPA 2012 0 20 40 60 80 100 120 -6 -5 -4 -3 -2 -1 0 1 2 3 4 Number of Speakers Log10 LR Log1o LR Verbal expression +/- 1 Limited evidence +/- 2 Moderate evidence +/- 3 Moderately strong evidence +/- 4 Strong evidence +/- 5 Very strong evidence same-speaker pairs Mean Log10 LR Standard deviation
  14. 14. N speakers 14 Hughes & Foulkes IAFPA 2012 0 20 40 60 80 100 120 -6 -5 -4 -3 -2 -1 0 1 2 3 4 Number of Speakers Log10 LR • stablemean > 20 speakers • increasedvariance< 40 speakers same-speaker pairs Mean Log10 LR Standard deviation
  15. 15. 15 Hughes & Foulkes IAFPA 2012 N speakers 0 20 40 60 80 100 120 -6 -5 -4 -3 -2 -1 0 1 2 3 4 Number of Speakers Log10 LR Mean Log10 LR Standard deviation different-speaker pairs
  16. 16. 16 Hughes & Foulkes IAFPA 2012 N speakers 0 20 40 60 80 100 120 -6 -5 -4 -3 -2 -1 0 1 2 3 4 Number of Speakers Log10 LR • stablemean • Cllr: - Lowest = 0.606 (120 speakers) - Highest = 1.203 (10 speakers) Mean Log10 LR Standard deviation
  17. 17. 17 Hughes & Foulkes IAFPA 2012 ii. number of tokens per speaker in the reference data –test data combined • 32 same-speakercomparisons • 992 different-speakercomparisons –max N tokens shared by 102 speakers = 13 –LRs calculated 11 times with 1 token per reference speaker removed at each stage
  18. 18. 18 Hughes & Foulkes IAFPA 2012 N tokens 0 1 2 3 4 5 6 7 8 9 10 11 12 13 -20 -10 0 2 4 -18 -16 -14 -12 -8 -6 -4 -2 Number of Tokens per Speaker Log10 LR • mean LRs = stable • standard deviation = stable Mean Log10 LR Standard deviation same-speaker pairs
  19. 19. 19 Hughes & Foulkes IAFPA 2012 N tokens • continual increase in mean & SD as N tokens decreases 0 1 2 3 4 5 6 7 8 9 10 11 12 13 -20 -10 0 2 4 -18 -16 -14 -12 -8 -6 -4 -2 Number of Tokens per Speaker Log10 LR Mean Log10 LR Standard deviation different-speaker pairs
  20. 20. 20 Hughes & Foulkes IAFPA 2012 N tokens • massive increase in strength of evidence Mean Log10 LR Standard deviation • Cllr: - Lowest = 0.648 (13 tokens) - Highest = 0.762 (5 tokens) different-speaker pairs
  21. 21. 21 Hughes & Foulkes IAFPA 2012 iii. dialect mismatch – 4 independent test sets • ONZE, Manchester, Newcastle and York – 102 speakers in the reference data – 13 tokens per reference speaker
  22. 22. -5 -4 -3 -2 -1 0 1 2 3 4 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Log10 Likelihood Ratio Cumulative Proportion 22 Hughes & Foulkes IAFPA 2012 22 Support for prosecution (same speaker) Support for defence (different speakers) dialect mismatch
  23. 23. dialect mismatch: F1 and F2 23 Hughes & Foulkes IAFPA 2012 same-speaker pairs different-speaker pairs ONZE (match) Newcastle Manchester York
  24. 24. -12 -10 -8 -6 -4 -2 0 2 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Log10 Likelihood Ratio Cumulative Proportion dialect mismatch: F1 and F2 24 Hughes & Foulkes IAFPA 2012 same-speaker pairs different-speaker pairs ONZE (match) Newcastle Manchester York
  25. 25. -12 -10 -8 -6 -4 -2 0 2 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Log10 Likelihood Ratio Cumulative Proportion dialect mismatch: F1 and F2 25 Hughes & Foulkes IAFPA 2012 same-speaker pairs different-speaker pairs 71% 58% ONZE (match) Newcastle Manchester York
  26. 26. 5. discussion 26 Hughes & Foulkes IAFPA 2012 i. number of reference speakers – evidenceof “population size effect” (Ishihara and Kinoshita 2008) • misrepresentative estimation of the strength of evidence with small N speakers in reference data – mean LRs & variance stable > ca. 40 speakers • different-speaker pairs more sensitive
  27. 27. 27 Hughes & Foulkes IAFPA 2012 ii. number of tokens per reference speaker – mean and SD for same-speaker pairs robust – different-speaker pairs very sensitive to the removal of even a single token – What if your reference data doesn’t match the case at trial?
  28. 28. 28 Hughes & Foulkes IAFPA 2012 iii. dialect mismatch - same-speakerstrengthof evidenceoverestimated • generally equivalent to one verbal category - multitudeof issueswith different-speakerpairs • overestimation of LRs for York (BUT issues of between- speaker variation) • high levels of contrary to fact support for the prosecution for Manchester and Newcastle • potential miscarriages of justice
  29. 29. 29 Hughes & Foulkes IAFPA 2012 5. conclusion • positive practical implications - mean and variance of LRs stable until only small N speakers in the reference data - good Cllr, even with relatively small N tokens per speaker - but the more speakers and the more tokens the better • predictably dialect matters - even for features which aren’t expected to display considerable variation according to region - default Hd needs to account for this - how narrowly do we need to define dialect? - what about other ‘logically relevant’ class factors?
  30. 30. Thanks Questions? Hughes & Foulkes IAFPA 2012 30 Vincent Hughes vh503@york.ac.uk
  31. 31. References Aitken, C. G. G. and Taroni, F. (2004) Statistics and the evaluation of evidence for forensic scientists (2nd edition). Chichester: John Wiley & Sons. Berger, C. (2012) Modern evidential interpretation, reporting and fallacies. Lecture given at the BBfor2 Summer School in Forensic Evidence Evaluation and Validation. Universidad Autonoma de Madrid, Spain. 18-21 July 2012. Brümmer, N. and du Preez, J. (2006) Application independent evaluation of speaker detection. Computer Speech and Language 20: 230-275. Champod, C. and Evett, I. W. (2000) Commentary on A. P. A. Broeders (1999) ‘Some observations on the use of probability scales in forensic identification’. Forensic Linguistics 7(2): 238-243. Ishihara, S. and Kinoshita, Y. (2008) How many do we need? Exploration of the Population Size Effect on the performance of forensic speaker classification. Paper presented at the 9th Annual Conference of the International Speech Communication Association (Interspeech). Brisbane, Australia. 1941-1944. Kaye, D. H. (2004) Logical relevance: problems with the reference population and DNA mixtures in People v. Pizarro. Law, Probability and Risk 3: 211-220. Kaye, D. H. (2008) DNA probabilities in People v. Prince: When are racial and ethnic statistics relevant? In Speed, T. And Nolan, D. (eds.) Probability and Statistics: Essays in Honour of David A Freedman. Beachwood, OH: Institute of Mathematical Statistics. 289-301. 31 Hughes & Foulkes IAFPA 2012
  32. 32. 32 Hughes & Foulkes IAFPA 2012 Kinoshita, Y., Ishihara, S. and Rose, P. (2009) Exploring the discriminatory potential of F0 distribution parameters in traditional speaker recognition. International Journal of Speech, Language and the Law 16(1): 91-111. Labov, W. (1971) The study of language in its social context. In Fishman, J. A. (ed.) Advances in the Sociology of Language (vol. 1). The Hague: Mouton. 152-216. Loakes, D. (2006) A forensic phonetic investigation into the speech patterns of identical and non-identical twins. PhD Dissertation, University of Melbourne. McDougall, K. (2004) Speaker-specific formant dynamics: An experiment on Australian English /aɪ/. International Journalof Speech, Language and the Law 11(1): 103-130. McDougall, K. (2006) Dynamic features of speech and the characterisation of speakers: towards a new approach using formant frequencies. International Journal of Speech, Language and the Law 13(1): 89-126. Morrison, G. S. (2007) Matlab implementation of Aitken and Lucy’s (2004) Forensic Likelihood-Ratio Software Using Multivariate-Kernel-Density Estimation [software]. Available: http://geoff-morrison.net. Morrison, G. S. (2008) Forensic voice comparison using likelihood ratios based on polynomial curves fitted to the formant trajectories of Australian English /aI/. International Journalof Speech, Language and the Law 5(2): 249-266.
  33. 33. 33 Hughes & Foulkes IAFPA 2012 Rose, P. (2004) Technical Forensic Speaker Identification from a Bayesian Linguist's Perspective. Keynote paper, Forensic Speaker Recognition Workshop, Speaker Odyssey ’04. 31 May - 3 June 2004, Toledo, Spain. 3-10. Rose, P. (2011) Forensic voice comparison with Japanese vowel acoustics – a likelihood ratio- based approach using segmental cepstra. Proceedings of the 17th International Congress of Phonetic Sciences. 17-21 August 2011, Hong Kong. 1718-1721. Rose, P., Osanai, T. and Kinoshita, Y. (2003) Strength of forensic speaker identification evidence multispeaker formant- and cepstrum-based segmental discrimination with a Bayesian likelihood ratio as threshold. Forensic Linguistics 10(2): 179-202. Rose, P., Kinoshita, Y. and Alderman, T. (2006) Realistic extrinsic forensic speaker discrimination with the diphthong /aI/. Proceedings of the 10th Australian Conference on Speech Science and Technology, 8-10 December 2004, Sydney: Macquarie University. 329-334 Rose, P. and Morrison, G. S. (2009) A response to the UK Position Statement on forensic speaker comparison. International Journal of Speech, Language and the Law 16(1): 139-163. Wells, J. C. (1982) Accents of English (3 vols). Cambridge: Cambridge University Press.

×