Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Bayesian approach to forensic voice comparison: applications and limitations

Hughes, V. (2013) The Bayesian approach to forensic voice comparison: applications and limitations. New Zealand Institute of Language Brain and Behaviour (NZILBB) Seminar, University of Canterbury, Christchurch, NZ. 7 February 2013. (INVITED TALK)

  • Login to see the comments

The Bayesian approach to forensic voice comparison: applications and limitations

  1. 1. The Bayesian approach to forensic voice comparison: applications and limitations Vincent Hughes Department of Language and Linguistic Science New Zealand Institute of Language, Brain and Behaviour (NZILBB) Seminar University of Canterbury 7th February 2013
  2. 2. 0. Outline focus of this talk: • application of Bayesian principles to the assessment of strength of expert evidence • complexity of speech as evidence • practical and theoretical issues with the application of the likelihood ratio (LR) to speech • potential solutions and directions for future NZILBB Seminar 7th February 2013 2
  3. 3. 1. Introduction
  4. 4. • forensic voice comparison (FVC) = voice of criminal (disputed) vs. voice of suspect (known) – disputed (DS) = threatening phone calls, bomb threat... – known (KS) = police interview recording (in the UK) • range of parameters analysed (Gold and French 2011): – segmental (vowels, consonants) – suprasegmental (f0, intonation, articulation rate) – higher-order linguistic (lexical choice, syntax) – voice quality/ vocal setting NZILBB Seminar 7th February 2013 4 1. Introduction
  5. 5. • “ultimate issue” (Lynch and McNally 2003:96): do the known and disputed recordings contain the voice of the same or different individuals? – courts dependent on notions of conditional probability – formal way to think about conditional probability = Bayes’ Theorem (1763) NZILBB Seminar 7th February 2013 5 1. Introduction prior odds x likelihood ratio = posterior probability (evidence)
  6. 6. NZILBB Seminar 7th February 2013 6 p(Hp|E) p(Hd|E) posterior probability = 1. Introduction Broeders (1999:229)/ French and Baldwin (1990:10) p = probability Hp = prosecution hypothesis (guilty) Hd = defence hypothesis (innocent) E = evidence | = ‘given’ ✗
  7. 7. NZILBB Seminar 7th February 2013 7 p(Hp|E) p(Hd|E) posterior probability = 1. Introduction ✗ • trier of fact (jury/ judge) responsible for determining whether the defendant is innocent or guilty • p(H|E) is dependent on all of the evidence in the case – not available to the expert
  8. 8. p(E|Hp) p(E|Hd) likelihood ratio (LR) = 1. Introduction • gradient assessment of strength of evidence – LR > 1 = support for prosecution – LR < 1 = support for defence • logically and legally correct innocent beyond a reasonable doubt guiltyadapted from Berger (2012) ✓
  9. 9. • LR = similarity and typicality – it matters “whether the values found matching (…) are vanishingly rare, or sporadic, or near universal” (Nolan 2001:16) – typicality of values within- and between-speakers • typicality = dependent on patterns in the “relevant population” (Aitken and Taroni 2004) – quantified relative to a sample of the population – distributions modelled statistically to generate numerical output NZILBB Seminar 7th February 2013 9 1. Introduction
  10. 10. 2. Complexity of speech
  11. 11. 2.1 Within-speaker variability • speech is inherently variable within individuals – two utterances produced by the same individual will never be exactly the same • therefore, unlike DNA, p(E|Hp) can never be 1 • multiple sources of within-speaker variability – long term = aging/ habitual behaviour (smoking etc.) – short term = stylistic factors/ time of day/ emotion NZILBB Seminar 7th February 2013 11
  12. 12. 2.2 Between-speaker variation • between-speaker variation is constrained by: – anatomical factors – phonological factors – regional/ social factors • all of these interact with each other and affect different linguistic-phonetic parameters in different ways NZILBB Seminar 7th February 2013 12
  13. 13. 2.3 Types of data • discrete (th-fronting/ h-dropping) • continuous (formant frequencies/ f0) • normal (lots but rarely tested formally) • non-normal – within- and between-speaker variation for the same parameter may not show the same type of distribution NZILBB Seminar 7th February 2013 13
  14. 14. • speech parameters form highly correlated sub-systems – within- and between-parameters • some of these are predictable: – f0 and F1 (Assmann and Neary 2007) – F2 for FLEECE and GOOSE (Gold and Hughes 2012) • some not so predictable: – TH-fronting and labial-r (Milroy 1996) NZILBB Seminar 7th February 2013 14 2.4 Correlations
  15. 15. • generally poor recording quality • transmission/ technical effects – mismatch between DS and KS – artificial altering of formant frequencies due to telephone bandpass restrictions (Künzel 2001, Byrne and Foulkes 2004) – mobile phone codecs (Gold 2009, Enzinger 2010) NZILBB Seminar 7th February 2013 15 2.5 Real forensic material
  16. 16. 3. Issues with LRs and FVC
  17. 17. 3.0 Development of the LR for FVC • arguments for a move away from expressing conclusions as p(H|E) (Broeders 1999, Champod and Meuwly 2000) – “paradigm shift” (Saks and Koehler 2005) • since 2001: considerable amount of research on FVC and the LR (largely thanks to a small community of researchers) *but much of this research has underestimated, overlooked or ignored the inherent complexity of speech NZILBB Seminar 7th February 2013 17
  18. 18. 3.1 What to analyse? • good discriminants = low within-speaker variability/ high between-speaker variation (Nolan 1983) – variance ratio (VR) = • low VRs are common for linguistic-phonetic parameters – they’re not very good discriminants (individually)! NZILBB Seminar 7th February 2013 18 between-speaker variation within-speaker variation
  19. 19. • current approach in the LR literature – small N parameters = representative of the ‘voice’ • almost exclusively vowel formants (Zhang et al 2008, Morrison 2009, Rose et al 2006) – narrowly defined contexts • Rose (2012): “not too bad” (f0 and formants) • conservative LR = 300,000 (!!) – argued that this is consistent with small proportion of genome analysed for forensic DNA evidence NZILBB Seminar 7th February 2013 19 3.1 What to analyse?
  20. 20. • alternative view (Nolan 2001, French et al 2010): we have a duty to analyse as much as we can – relying on small proportion of speech could lead to misrepresentation of strength of evidence NZILBB Seminar 7th February 2013 20
  21. 21. • to compute an LR we need to generate models from our data • these are based on distributions – data converted to probability density functions • limited in terms of the formulas we can apply NZILBB Seminar 7th February 2013 21 3.2 Statistical modelling
  22. 22. current options (1) Lindley (1977) • requires continuous data • models within- and between-speaker variation using a normal distribution • assumes variance in DS and KS are equal • for use with univariate parameters *limited application for most ling-phon parameters NZILBB Seminar 7th February 2013 22 3.2 Statistical modelling
  23. 23. (2) Aitken and Lucy (2004) • developed for refractive indices of glass fragments • requires continuous data • accounts for multivariate parameters (but designed for 3 or 4 features per parameter) • models within-speaker variation using an assumption of normality and between-speaker variation with a Gaussian kernel *applied to most continuous acoustic phon parameters NZILBB Seminar 7th February 2013 23 3.2 Statistical modelling
  24. 24. (3) Reynolds et al (2000) • requires continuous data • person independent UBM for background data generated using GMM • speaker specific GMM forms suspect model • no assumption of normality *commonly applied to ASR, but variable performance using ling-phon parameters NZILBB Seminar 7th February 2013 24 3.2 Statistical modelling
  25. 25. problems: • using models which were never developed for ling-phon parameters • forces us to make assumptions about our data which aren’t necessarily appropriate (normality?) – may explain general preference for formants! • means lots of useful ling-phon parameters can’t be analysed NZILBB Seminar 7th February 2013 25 3.2 Statistical modelling
  26. 26. solutions: • developing models which fit the data (specific to ling-phon parameters) • some progress in this area: – Nair et al (2012): • principle component analysis to account for multiple features of a single parameter – Aitken and Gold (2012): • LR for discrete data based on click rate – Foulkes et al: “Modelling features for forensic speaker comparison” NZILBB Seminar 7th February 2013 26 3.2 Statistical modelling
  27. 27. • in theory defined by the defence hypothesis – but often no more than “it wasn’t the defendant” • without knowing who the offender is we can’t know the population of which he is a member NZILBB Seminar 7th February 2013 27 3.3 What’s the relevant population?
  28. 28. default assumptions: (1) Rose (2004): non-contemporaneous recordings of “same-sex speakers of the language” • ‘logical relevance’ (Kaye 2004, 2008) • naïve view of variation • sources of within-speaker variation in KS and DS not captured by ‘non-contemporaneity’ • why sex and language above other sources of between-speaker variation? how narrowly do define things like regional background? NZILBB Seminar 7th February 2013 28 3.3 What’s the relevant population?
  29. 29. default assumptions: (2) Morrison et al (2012): similar sounding speakers to the offender as judged by lay listeners • what factors do we control in our listeners? • what do the listeners hear? - some controls on the part of the expert (usually sex and language again) • lay listeners are linguistically-erratic when it comes to assessing speaker-similarity (McDougall 2011) 29 3.3 What’s the relevant population? NZILBB Seminar 7th February 2013
  30. 30. problems: • current approaches reveal a naïve view of variation in production and perception – within- and between-speaker variation • little understanding of how the LR is affected by variation in the reference data • which factors to control and which to ignore? (there are so many!) NZILBB Seminar 7th February 2013 30 3.3 What’s the relevant population?
  31. 31. (theoretical) solution: • underlying assumption that the relevant population consists of similar sounding speakers is probably right – but it should be linguistically grounded – speakers who could objectively sound like the offender – reduces all potential grouping variables to ‘similarity’ NZILBB Seminar 7th February 2013 31 3.3 What’s the relevant population?
  32. 32. NZILBB Seminar 7th February 2013 32 3.4 Collecting reference data • case by case basis (Rose 2007) – need reference data for every parameter we would want to analyse – another reason limited selection of parameter? • use an existing database: forensic or non-forensic problem: inevitable mismatch between any reference data and the facts of the case at trial – how much does it matter?
  33. 33. NZILBB Seminar 7th February 2013 33 3.5 Size of the sample • how big does our sample of the relevant population need to be to generate meaningful LRs? • two issues: – is it representative? – is there enough of it? • some systematic research on the first question for N speakers (Ishihara and Kinoshita 2008)
  34. 34. NZILBB Seminar 7th February 2013 34 PRICE (F1, F2 and F3) (Hughes, in prep) - same-speaker pairs Mean +/- 1 SD
  35. 35. NZILBB Seminar 7th February 2013 35 3.5 Size of the sample • Monte Carlo simulations (MCS) offer a way of investigating Q2 – use MCS to generate synthetic data from a sample of raw data – requires raw data to be representative (!!) – properties of the distribution of synthetic data defined by the raw data – Rose (2012): no real differences in LRs between 30 and 10,000 speakers
  36. 36. NZILBB Seminar 7th February 2013 36 3.5 Size of the sample
  37. 37. • Hughes (in prep): MCS to test N speakers for local articulation rate (AR) – data from Gold (in prep) – raw data = 79 speakers/ 26 ‘tokens’ per speaker – generated synthetic mean and SD values for new speakers (up to 10,000 speakers) – then generated 26 tokens per speaker from each of the synthetic normal distributions NZILBB Seminar 7th February 2013 37 3.5 Size of the sample
  38. 38. NZILBB Seminar 7th February 2013 38 3.5 Size of the sample
  39. 39. what does this mean for LRs? • best to avoid of small samples (<20 models of within- and between-speaker variation not representative) • all cases may require some sample size testing (MCS?) - but MCS no solution to small N (!!) • different parameters behave in different way - dependent on inherent speaker discriminatory power NZILBB Seminar 7th February 2013 39 3.5 Size of the sample
  40. 40. • componential approach requires multiple parameters to be combined into overall LR • naïve Bayes (Kononenko 1990): – simple multiplication of LRs for non-correlated parameters – but speech is complex! NZILBB Seminar 7th February 2013 40 3.6 Accounting for correlations
  41. 41. • in early LR literature naïve Bayes applied despite correlations (Kinoshita 2002, Rose et al 2003) • fusion (Brümmer et al 2007) = currently preferred method – developed for ASR – “back-end processing” (Rose and Winter 2010) – attaches weights based on correlations of LRs NZILBB Seminar 7th February 2013 41 3.6 Accounting for correlations
  42. 42. problems: • fusion doesn’t necessarily capture correlations in the raw data • only accounts for linear correlations between pairs of LRs • not very efficient NZILBB Seminar 7th February 2013 42 3.6 Accounting for correlations
  43. 43. solutions: • ‘front-end’ account of correlations – i.e. considering correlations prior to analysis • Bayesian networking offers a way of doing this (Taroni et al 2006) – build a model of the complex interrelations between parameters – factorise parameters which are counted multiple times – Gold and Hughes: “Identifying correlations between speech parameters for forensic speaker comparisons” NZILBB Seminar 7th February 2013 43 3.6 Accounting for correlations
  44. 44. 4. Discussion
  45. 45. 4. Discussion • plenty of arguments why the LR is the logically and legally correct framework for forensic evidence – keeps the role of expert and trier of fact separate – forces the expert to analyse only the specific piece of evidence NZILBB Seminar 7th February 2013 45
  46. 46. • implicit in current LR-based FVC research is that the data should be forced to fit the numerical LR framework (in it’s current form) even if this means: – analysing only a small sub-set of potential parameters – making unrealistic assumptions about the distribution of our data – not accounting for the inherent complexity NZILBB Seminar 7th February 2013 46 4. Discussion
  47. 47. 4. Discussion • models and procedures that we apply were often not developed to account for speech – this is, of course, a challenge… – but also an opportunity! • DNA = seen as “setting the standard” (Baldwin 2005:55) for forensic evidence NZILBB Seminar 7th February 2013 47
  48. 48. 4. Discussion • plenty of other forensic disciplines experiencing similar problems • speech can lead the way in developing new procedures for computing LRs NZILBB Seminar 7th February 2013 48
  49. 49. Thanks! Questions? Acknowledgements: ESRC, Paul Foulkes, Erica Gold, Peter French, Dom Watt, Ashley Brereton, FSS Research Group (York) NZILBB Seminar 7th February 2013 49

×