Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Regional	variation	and	the	definition	of	
the	relevant	population	in	likelihood	
ratio-based	forensic	voice	comparison	
us...
2
1.	Introduction
• forensic	voice	comparison	(FVC)	=	voice	of	offender	
(unknown)	vs.	voice	of	suspect	(known)
is	the	voi...
3
1.	Introduction
Likelihood	Ratio	(LR)	=
• gradient	assessment	of	the	strength	of	evidence
• (log)	LR	=	value	centered	on...
4
1.	Introduction
• LR	=	similarity and	typicality
– it	matters	“whether	the	values	found	matching	…	are	
vanishingly	rare...
5
1.	Introduction
automatic	speaker	recognition	(ASR):
i. cepstral	coefficients	(CCs)/derivatives
– cepstrum:	spectral	rep...
6
1.	Introduction
• commercial	ASR	systems	claimed	to	be	
“language	and	speech	independent	and	thus	
deliver	results	irres...
7
1.	Introduction
• 19	MFCCs	and	delta	coefficients
• essentially	no	difference	in	EER	across	systems
• “dialect	influence...
8
1.	Introduction
• Harrison	and	French	(2010):
– analysed distances	based	on	MFCCs	produced	by	Batvox
(118	speakers	of	Br...
9
1.	Introduction
• Harrison	and	French	(2010):
– analysed distances	based	on	MFCCs	produced	by	Batvox
(118	speakers	of	Br...
10
2.	Method:	Data
• data	extracted	from	sub-sets	of	7	of	the	8	dialect	
regions	(DRs)	in	TIMIT	(Garofolo	et	al.,	1993)
SS...
11
2.	Method:	Data
• test	speakers	(SS	and	DS	comparisons)	=	25	
speakers	from	North	Midland	(DR3)
systems	(acting	as	trai...
12
2.	Method:	Linguistic	factors
• DRs	consistent	with	classification	of	AmEng	
regional	dialects	in	ANAE	(Labov	et	al.,	2...
13
2.	Method:	Feature	extraction
• 10	sentences	of	read	speech	per	speaker
– 5/5	division	for	suspect	and	offender	samples...
14
2.	Method:	Feature-to-score
• Gaussian	Mixture	Model	– Universal	Background	
Model	approach (Reynolds	et	al.,	2000)
– s...
15
2.	Method:	Score-to-LR
• scores	from	cross-validated	comparisons	for	
Matched/Mismatched/Mixedsets	used	to	train	
separ...
16
2.	Method:	System	performance
• validity	(accuracy)
– Equal	Error	Rate	(EER): point	at	which	%	of	false	hits	(DS	as	
SS...
17
3.	Results:	Validity
SST	Conference,	Christchurch
3rd
December	2014
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.02 0.04 0.06 0.08 0....
18
3.	Results:	Validity
SST	Conference,	Christchurch
3rd
December	2014
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.02 0.04 0.06 0.08 0....
19
3.	Results:	Reliability
SST	Conference,	Christchurch
3rd
December	2014
Mean	95%	CIs
MFCCs =	±1.88
LPCCs =	±1.80
• overall	extremely	good	performance	in	terms	of	
validity
– read	speech
– contemporaneous	samples
– no	technical	mismatch...
• relatively	large	imprecision	in	the	LLRs	from	
individual	comparisons	across	systems
– larger	magnitude	LLRs	produce	the...
• no	evidence	of	differences	in	system	validity	
according	to	regional	variation
– ceiling	effect?
• but evidence	of	sensi...
Thanks	– ta	– kia ora
Vincent	Hughes
Paul	Foulkes
Department of Language and Linguistic Science
Upcoming SlideShare
Loading in …5
×

Regional variation and the definition of the relevant population in likelihood ratio-based forensic voice comparison using cepstral coefficients

163 views

Published on

Hughes, V. and Foulkes, P. (2014) Regional variation and the definition of the relevant population in likelihood ratio-based forensic voice comparison using cepstral coefficients. Paper presented at 15th Australasian Conference on Speech Science and Technology (ASSTA), University of Canterbury, Christchurch, NZ. 3-5 December 2014.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Regional variation and the definition of the relevant population in likelihood ratio-based forensic voice comparison using cepstral coefficients

  1. 1. Regional variation and the definition of the relevant population in likelihood ratio-based forensic voice comparison using cepstral coefficients Vincent Hughes Paul Foulkes Department of Language and Linguistic Science
  2. 2. 2 1. Introduction • forensic voice comparison (FVC) = voice of offender (unknown) vs. voice of suspect (known) is the voice in the criminal recording the same as the voice in the suspect recording? • the expert cannot answer this question: – this is the role of the trier-of-fact (judge/jury) – requires access to information beyond that of the speech evidence SST Conference, Christchurch 3rd December 2014
  3. 3. 3 1. Introduction Likelihood Ratio (LR) = • gradient assessment of the strength of evidence • (log) LR = value centered on 0 where: – support for prosecution = > 0 – support for defence= < 0 p(E|Hp) p(E|Hd) p = probability E = evidence | = ‘given’ Hp = prosecution hyp Hd = defence hyp SST Conference, Christchurch 3rd December 2014
  4. 4. 4 1. Introduction • LR = similarity and typicality – it matters “whether the values found matching … are vanishingly rare … or near universal” (Nolan, 2001: 16) • typicality = dependent on patterns in the “relevant population” (Aitken & Taroni 2004) – defined according to the defencehypothesis • “it wasn’t the suspect, it was…” • what is the appropriate relevant population? • paradox: without knowing who the offender is, we can’t know the population of which (s)he is a member SST Conference, Christchurch 3rd December 2014
  5. 5. 5 1. Introduction automatic speaker recognition (ASR): i. cepstral coefficients (CCs)/derivatives – cepstrum: spectral representation of the signal – capture long and short term properties of the vocal tract and articulatory configuration ii. “automatic” analysis: – data extracted from frames across entire sample • i.e. not from linguistically meaningful units of speech (e.g. phonemes) (*although see Rose, 2011, 2013) SST Conference, Christchurch 3rd December 2014
  6. 6. 6 1. Introduction • commercial ASR systems claimed to be “language and speech independent and thus deliver results irrespective of the language or accent used by the speaker” (Batvox, 2013) • Moreno et al. (2006) – single set of 43 AndalusianSpanish test speakers (same- and different-speaker comparisons) – reference data: • Matched = 50 Andalusian Spanish speakers • 2 x Mismatched = 50 Castilian/ Galician Spanish speakers SST Conference, Christchurch 3rd December 2014
  7. 7. 7 1. Introduction • 19 MFCCs and delta coefficients • essentially no difference in EER across systems • “dialect influence is not a relevant variable for (A)SR systems … due to the fact that (A)SR uses low level acoustic characteristics not affected by differences in dialects” (Moreno et al., 2006) but… – small-scale study – calibration? – effect on Cllr or system reliability? – extent of dialect differences? SST Conference, Christchurch 3rd December 2014
  8. 8. 8 1. Introduction • Harrison and French (2010): – analysed distances based on MFCCs produced by Batvox (118 speakers of British English) – speakers from the same regional backgrounds generally closest to each other – possible evidence that CCs capture regional differences in long-term vocal setting RQ: to what extent is the validity and reliability of a generic CC-based ASR system affected by the definition of the relevant population according to regional background?
  9. 9. 9 1. Introduction • Harrison and French (2010): – analysed distances based on MFCCs produced by Batvox (118 speakers of British English) – speakers from the same regional backgrounds generally closest to each other – possible evidence that CCs capture regional differences in long-term vocal setting RQ: to what extent is the validity and reliability of a generic CC-based ASR system affected by the definition of the relevant population according to regional background?
  10. 10. 10 2. Method: Data • data extracted from sub-sets of 7 of the 8 dialect regions (DRs) in TIMIT (Garofolo et al., 1993) SST Conference, Christchurch 3rd December 2014
  11. 11. 11 2. Method: Data • test speakers (SS and DS comparisons) = 25 speakers from North Midland (DR3) systems (acting as training and reference data): SST Conference, Christchurch 3rd December 2014 System N Sets Speakers Matched 1 28 (DR3) Mismatched 6 28 (DRs 1,2,4,5,6,7) Mixed 1 4 x DRs 1,2,3,4,5,6,7
  12. 12. 12 2. Method: Linguistic factors • DRs consistent with classification of AmEng regional dialects in ANAE (Labov et al., 2006) • DR3 test data (North Midland) linguistically most similar to DRs 4 (South Midland), 5 (Southern) and 7 (Western) – COT~CAUGHT merger/ GOAT fronting… • DR3 most dissimilar from: – DR2 (Northern): Northern Cities Shift – DR6 (New York): /r/ vocalisation & /ɔː/ lowering SST Conference, Christchurch 3rd December 2014
  13. 13. 13 2. Method: Feature extraction • 10 sentences of read speech per speaker – 5/5 division for suspect and offender samples for test speakers (c. 15s total per sample) – 16kHz sampling rate • 12 MFCCs/ 12 LPCCs extracted from speech- active portions of samples: – pre-emphasis filter = 0.97 coefficient value – 20ms Hamming window – 10ms overlap between windows (50% overlap) SST Conference, Christchurch 3rd December 2014
  14. 14. 14 2. Method: Feature-to-score • Gaussian Mixture Model – Universal Background Model approach (Reynolds et al., 2000) – suspect models = GMMs based on the raw suspect data – 32 Gaussians per model (based on Reynolds 1995) • training scores: cross-validated SS (28) and DS (756) LR scores computed for each system • test scores: parallel sets of SS (25) and DS (600) LR scores using matched/mismatched/mixed sets as reference data SST Conference, Christchurch 3rd December 2014
  15. 15. 15 2. Method: Score-to-LR • scores from cross-validated comparisons for Matched/Mismatched/Mixedsets used to train separate logistic regression models (Brümmer & du Preez 2006) – calibration coefficients derived from model • coefficients applied to scores for each of the sets of parallel test set scores – scores converted to calibrated log LRs (LLRs) – systems analysed based on LLR output SST Conference, Christchurch 3rd December 2014
  16. 16. 16 2. Method: System performance • validity (accuracy) – Equal Error Rate (EER): point at which % of false hits (DS as SS) and % of false misses (SS as DS) is equal • categorical error metric based on hard accept-reject decisions – Log LR Cost (Cllr): (Brümmer & du Preez 2006) • gradient error metric which penalises the system based on the magnitude of the errors • reliability (precision) – 95% credible intervals (CIs) (non-parametric; Morrison et al., 2010) • posterior density which captures the variability across the calibrated LLRs from the same comparisons across the eight systems SST Conference, Christchurch 3rd December 2014
  17. 17. 17 3. Results: Validity SST Conference, Christchurch 3rd December 2014 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 Log LR Cost (Cllr ) EER(%) Matched Mismatched(1) Mismatched(2) Mismatched(4) Mismatched(5) Mismatched(6) Mismatched(7) Mixed MFCCs
  18. 18. 18 3. Results: Validity SST Conference, Christchurch 3rd December 2014 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 Log LR Cost (Cllr ) EER(%) Matched Mismatched(1) Mismatched(2) Mismatched(4) Mismatched(5) Mismatched(6) Mismatched(7) Mixed 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 Log LR Cost (Cllr ) EER(%) Matched Mismatched(1) Mismatched(2) Mismatched(4) Mismatched(5) Mismatched(6) Mismatched(7) Mixed LPCCs
  19. 19. 19 3. Results: Reliability SST Conference, Christchurch 3rd December 2014 Mean 95% CIs MFCCs = ±1.88 LPCCs = ±1.80
  20. 20. • overall extremely good performance in terms of validity – read speech – contemporaneous samples – no technical mismatch • little difference between systems in terms of categorical/ gradient validity – ceiling effect due to forensically unrealistic data? – no evidence of linguistic differences manifested in differences in system validity 20 4. Discussion SST Conference, Christchurch 3rd December 2014
  21. 21. • relatively large imprecision in the LLRs from individual comparisons across systems – larger magnitude LLRs produce the greatest variability – variability occurs so far away from threshold that it has no effect on validity • potentially evidence of the effects of regional background mismatch – greater than the imprecision across LRs from different samples of the same population? 21 4. Discussion SST Conference, Christchurch 3rd December 2014
  22. 22. • no evidence of differences in system validity according to regional variation – ceiling effect? • but evidence of sensitivity the definition of the relevant population manifested in the magnitude of calibrated LLRs • future work: – more forensically realistic data – regional variation in British English • a priori expectations for regional differences in long-term vocal setting may be manifested in CCs (Harrison and French 2010) 22 5. Conclusion SST Conference, Christchurch 3rd December 2014
  23. 23. Thanks – ta – kia ora Vincent Hughes Paul Foulkes Department of Language and Linguistic Science

×