When using Y-chromosome DNA profiles, it often happens that a DNA profile found on a crime scene and matching the suspect’s profile does not appear in the relevant data-base. This creates a big challenge to the analyst who is required to supply a likelihood ratio (LR) or match-probability in order to quantify the evidential value of the match. Sensible estimation of the LR seems to rely on sensible estimation of the population frequency of this previously unseen haplotype.
There are three existing proposals of quite different nature: Roewer et al. (2000), based on Bayesian estimation of the haplotype frequency with a Beta prior; Brenner (2010), based on the number of singletons observed in the database; and Andersen et al. (2013) using a mixture of independent discrete Laplace distributions as a parametric approximation of the distribution of allelic frequencies.
We add two new methods. One is similar to Brenner’s, and like Brenner’s is strongly related to the Good-Turing estimator. A second method is based on Anevski, Gill and Zohren’s (arXiv.org/math.ST:1312.1200) study of a non-parametric maximum-likelihood estimator. It is somehow intermediate between the parametric approach of Andersen and non-parametric methods based on Good-Turing estimators. We believe that it avoids the disadvantages of those while moreover providing a supplementary means of evaluating their accuracy.
For all methods it is imperative to assess two more levels of uncertainty, beyond the uncertainty about which hypothesis is true given the evidence, which would hold if we knew everything about the population probability distribution. LR is a ratio of probabilities which are usually based on a model which is at best only a good approximation to the truth. Moreover we only estimate parameters of that model by fitting it to the data in our database.
1. The fundamental problem of
Forensic Statistics
How to assess the evidential value
of a rare type match
Giulia Cereda, Université de Lausanne
Richard D. Gill, University of Leiden
2. The problem
• A crime
• A piece of evidence found at the crime scene
(DNA, fingerprint, footprint, hand writing, etc.)
• A suspect (identified independently)
• A match between suspect’s characteristic and
evidence’s characteristic.
• A database which counts the frequency of each
characteristic.
• Database frequency of the crime (and the
suspect) characteristic is 0
3. Example
• A DNA stain is found on the victim’s body.
• Y-STR profile of type h.
• A suspect is identified, which is also of Y-STR type
h.
• The Y-STR database of reference does not contain
type h
Small databases
4. Generalized-Good. Non parametric Good-type
estimator based on Good (1953).
DiscLap-method (Andersen et al. 2013)
Explore other methods (Brenner 2010, Roewer
2000, …)
How to evaluate this kind of evidence?
5. The Likelihood Ratio
E is the evidence to be evaluated
B is the background information
Hp: the suspect left the stain
Hd: someone else left the stain
Many possible
choices
THE likelihood ratio does not exists
6. Typical choice
• E= the particular haplotype of the suspect
and of the crime stain
• B=the list of haplotypes in the database
e.g. Discrete Laplace Method
7. This frequency is not known. It can only be estimated
Uncertainty
e.g.DiscLapmethod
8. A different choice
• E=number of times the haplotypes of the
suspect (hs) and the haplotype of the crime-
stain (hc) are in the data-base and whether or
not they are the same haplotype.
• B= the frequencies of the frequencies of the
database.
Ignore information about the particular
haplotype
9. • D database
Gotham City, 12,13,30,24,10,11,13
Gotham City, 12,13,30,24,10,11,14
Gotham City, 13,12,30,24,10,11,13
Gotham City, 13,13,29,23,10,11,13
Gotham City, 13,13,29,24,10,11,14
Gotham City, 13,13,29,24,11,13,13
Gotham City, 13,13,29,24,11,13,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
D’ database count
Gotham City, 12,13,30,24,10,11,13
1
Gotham City, 12,13,30,24,10,11,14
1
Gotham City, 13,12,30,24,10,11,13
1
Gotham City, 13,13,29,23,10,11,13
1
Gotham City, 13,13,29,24,10,11,14
1
Gotham City, 13,13,29,24,11,13,13
2
Gotham City, 13,13,30,24,10,11,13
4
The frequencies of frequencies
N1 5
N2 1
N3 0
N4 1
Df frequencies of frequencies
Information
is discarded
N1 is the number of haplotypes which occur
once in D (singletons)
N2 is the number of duplets
Etc.
10. A database D of size N
Gotham City, 12,13,30,24,10,11,13
Gotham City, 12,13,30,24,10,11,14
Gotham City, 13,12,30,24,10,11,13
Gotham City, 13,13,29,23,10,11,13
Gotham City, 13,13,29,24,10,11,14
Gotham City, 13,13,29,24,11,13,13
Gotham City, 13,13,29,24,11,13,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
can be considered as an
i.i.d. sample (Y1, Y2, …, YN ) from
species {1,2,…,s} with
probabilities (p1, p2, … ps).
The database count
Gotham City, 12,13,30,24,10,11,13 1
Gotham City, 12,13,30,24,10,11,14 1
Gotham City, 13,12,30,24,10,11,13
1
Gotham City, 13,13,29,23,10,11,13 1
Gotham City, 13,13,29,24,10,11,14 1
Gotham City, 13,13,29,24,11,13,13 2
Gotham City, 13,13,30,24,10,11,13 4
is a realization of r.v. (X1, X2, …, Xs),
defined Xj=#{i|Yi=j}.
The frequencies of frequencies
is made of (N1, N2,… )
where Nj=#{i|Xi=j}
N1 5
N2 1
N3 0
N4 1
11. • E=numbers of times the haplotypes of the
suspect (hs) and the haplotype of the crime-
stain (hc) are in the data-base and whether or
not they are the same haplotype.
• B= the frequencies of the frequencies of the
database (Df)
12.
13. unbiased estimator for the numerator
unbiased estimator
for the denominator
It is more sensible to estimate instead of .
is approximately unbiased for .
This suggests to use
as an estimator for
14. How well estimates the true (unknown) ?
Take a big database of size 12,727.
Consider it as the world population. C1=0, C2=0.
Then,
1. Sample a little databases of size N=100+1+1.
2. If the 101th type is a new one in the small database increase
C1=C1+1
3. Check if the 101th is a new type equal to the 102th. C2=C2+1
4. Repeat steps 1-3 M=10,000 times.
P1=C1/M, P2=C2/M,
distribution of over many replications of small
databases (size N=100) sampled from a bigger one (size N=12,727)
which we pretend is the population.
And from which we obtain a value for 2.603:
15. We sample 1000 databases of size 100 from the big one, and for
each we calculate the estimate :
Performance of the GG-method
We know .
16. We know .
We sample 1000 databases of size 100 from the big one, and for
each we calculate the estimate :
Performance of the GG-method
17. How well estimates the true (unknown) ?
distribution over many replications of small databases (size N=100)
and new haplotype sampled from a bigger one (size N=12,727).
For each database sampled, the true frequency of the new
haplotype h is taken equal to its frequency in the big database.
The estimated frequency is calculated using the Discrete
Laplace method with default options (iterations, init_y …).
We calculate the distribution of and for each
database and new haplotype sampled.
21. Remarks
Two more levels of uncertainty:
• whether or not the model M that we are
assuming for Pr is “correct enough”
• whether or not parameters of Pr in the model
M are “correct enough”
Basic uncertainty:
• whether or not the trace comes from the
suspect
22. Maybe DiscLap was never intended it to be used for such
small databases.
Maybe DiscLap does better for our purpose when used in
more clever (targeted for our purpose) ways.
The error in the DiscLap method is given by two levels of
uncertainty:
• Population vs DiscLap
• Parameter estimation (within Disclap)
The GG is a “model-free” method which thus has only one
level of uncertainty.
23. Conclusions
• The situation is more complex than it appears.
• Using more information less accurate LR.
• Assuming less gives more reliable LR.