The fundamental problem of Forensic Statistics

The fundamental problem of
Forensic Statistics
How to assess the evidential value
of a rare type match
Giulia Cereda, Université de Lausanne
Richard D. Gill, University of Leiden

The problem
• A crime
• A piece of evidence found at the crime scene
(DNA, fingerprint, footprint, hand writing, etc.)
• A suspect (identified independently)
• A match between suspect’s characteristic and
evidence’s characteristic.
• A database which counts the frequency of each
characteristic.
• Database frequency of the crime (and the
suspect) characteristic is 0

Example
• A DNA stain is found on the victim’s body.
• Y-STR profile of type h.
• A suspect is identified, which is also of Y-STR type
h.
• The Y-STR database of reference does not contain
type h
Small databases

Generalized-Good. Non parametric Good-type
estimator based on Good (1953).
DiscLap-method (Andersen et al. 2013)
Explore other methods (Brenner 2010, Roewer
2000, …)
How to evaluate this kind of evidence?

The Likelihood Ratio
E is the evidence to be evaluated
B is the background information
Hp: the suspect left the stain
Hd: someone else left the stain
Many possible
choices
THE likelihood ratio does not exists

Typical choice
• E= the particular haplotype of the suspect
and of the crime stain
• B=the list of haplotypes in the database
e.g. Discrete Laplace Method

This frequency is not known. It can only be estimated
Uncertainty
e.g.DiscLapmethod

A different choice
• E=number of times the haplotypes of the
suspect (hs) and the haplotype of the crime-
stain (hc) are in the data-base and whether or
not they are the same haplotype.
• B= the frequencies of the frequencies of the
database.
Ignore information about the particular
haplotype

• D database
Gotham City, 12,13,30,24,10,11,13
Gotham City, 12,13,30,24,10,11,14
Gotham City, 13,12,30,24,10,11,13
Gotham City, 13,13,29,23,10,11,13
Gotham City, 13,13,29,24,10,11,14
Gotham City, 13,13,29,24,11,13,13
Gotham City, 13,13,29,24,11,13,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
D’ database count
Gotham City, 12,13,30,24,10,11,13
1
Gotham City, 12,13,30,24,10,11,14
1
Gotham City, 13,12,30,24,10,11,13
1
Gotham City, 13,13,29,23,10,11,13
1
Gotham City, 13,13,29,24,10,11,14
1
Gotham City, 13,13,29,24,11,13,13
2
Gotham City, 13,13,30,24,10,11,13
4
The frequencies of frequencies
N1 5
N2 1
N3 0
N4 1
Df frequencies of frequencies
Information
is discarded
N1 is the number of haplotypes which occur
once in D (singletons)
N2 is the number of duplets
Etc.

A database D of size N
Gotham City, 12,13,30,24,10,11,13
Gotham City, 12,13,30,24,10,11,14
Gotham City, 13,12,30,24,10,11,13
Gotham City, 13,13,29,23,10,11,13
Gotham City, 13,13,29,24,10,11,14
Gotham City, 13,13,29,24,11,13,13
Gotham City, 13,13,29,24,11,13,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
can be considered as an
i.i.d. sample (Y1, Y2, …, YN ) from
species {1,2,…,s} with
probabilities (p1, p2, … ps).
The database count
Gotham City, 12,13,30,24,10,11,13 1
Gotham City, 12,13,30,24,10,11,14 1
Gotham City, 13,12,30,24,10,11,13
1
Gotham City, 13,13,29,23,10,11,13 1
Gotham City, 13,13,29,24,10,11,14 1
Gotham City, 13,13,29,24,11,13,13 2
Gotham City, 13,13,30,24,10,11,13 4
is a realization of r.v. (X1, X2, …, Xs),
defined Xj=#{i|Yi=j}.
The frequencies of frequencies
is made of (N1, N2,… )
where Nj=#{i|Xi=j}
N1 5
N2 1
N3 0
N4 1

• E=numbers of times the haplotypes of the
suspect (hs) and the haplotype of the crime-
stain (hc) are in the data-base and whether or
not they are the same haplotype.
• B= the frequencies of the frequencies of the
database (Df)

unbiased estimator for the numerator
unbiased estimator
for the denominator
It is more sensible to estimate instead of .
is approximately unbiased for .
This suggests to use
as an estimator for

How well estimates the true (unknown) ?
Take a big database of size 12,727.
Consider it as the world population. C1=0, C2=0.
Then,
1. Sample a little databases of size N=100+1+1.
2. If the 101th type is a new one in the small database increase
C1=C1+1
3. Check if the 101th is a new type equal to the 102th. C2=C2+1
4. Repeat steps 1-3 M=10,000 times.
P1=C1/M, P2=C2/M,
distribution of over many replications of small
databases (size N=100) sampled from a bigger one (size N=12,727)
which we pretend is the population.
And from which we obtain a value for 2.603:

We sample 1000 databases of size 100 from the big one, and for
each we calculate the estimate :
Performance of the GG-method
We know .

We know .
We sample 1000 databases of size 100 from the big one, and for
each we calculate the estimate :
Performance of the GG-method

How well estimates the true (unknown) ?
distribution over many replications of small databases (size N=100)
and new haplotype sampled from a bigger one (size N=12,727).
For each database sampled, the true frequency of the new
haplotype h is taken equal to its frequency in the big database.
The estimated frequency is calculated using the Discrete
Laplace method with default options (iterations, init_y …).
We calculate the distribution of and for each
database and new haplotype sampled.

Performance of the DiscLap-method
Comparing the distribution of

0 200 400 600 800 1000
0246
Comparing the errors of the two methods
DiscLap-method GG-method
0 200 400 600 800 1000
0246
log10(Ratio_Gill)

−10123456
−10123456
log10(Ratio_Gill)
Comparing the errors of the two methods
DiscLap-method GG-method

Remarks
Two more levels of uncertainty:
• whether or not the model M that we are
assuming for Pr is “correct enough”
• whether or not parameters of Pr in the model
M are “correct enough”
Basic uncertainty:
• whether or not the trace comes from the
suspect

Maybe DiscLap was never intended it to be used for such
small databases.
Maybe DiscLap does better for our purpose when used in
more clever (targeted for our purpose) ways.
The error in the DiscLap method is given by two levels of
uncertainty:
• Population vs DiscLap
• Parameter estimation (within Disclap)
The GG is a “model-free” method which thus has only one
level of uncertainty.

Conclusions
• The situation is more complex than it appears.
• Using more information less accurate LR.
• Assuming less gives more reliable LR.

You want to discuss? Know more?
Collaborate? Give suggestions?
You are welcome!
Giulia.cereda@unil.ch

The fundamental problem of Forensic Statistics

Recommended

Recommended

More Related Content

Similar to The fundamental problem of Forensic Statistics

Similar to The fundamental problem of Forensic Statistics (20)

Recently uploaded

Recently uploaded (20)

The fundamental problem of Forensic Statistics