Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The fundamental problem
of forensic statistics
Richard Gill, Giulia Cereda
ICFIS 2014
Taking account of three levels of un...
If you are a cat / dog …
(thanks to Paul Roberts)
statisticians/lawyers = cats/dogs/people?
If you are a cat / dog …
• Google “Ben Geen”
• http://thejusticegap.com/2014/08/become-
convicted-serial-killer-without-ki...
The fundamental problem
of forensic statistics
Richard Gill, Giulia Cereda
ICFIS 2014
Taking account of three levels of un...
Fundamental problem of forensic mathematics—The evidential value of a rare
haplotype
Charles H. Brenner a,b,
*
a
School of...
Three levels of uncertainty
• Who did it? (prosecution vs. defence)
• The probability model
• The probabilities in the mod...
The dogma
• E = Evidence
• B = Background
• H, K = hypotheses of prosecution and defence
• LR = Pr( E | B, H) / Pr( E | B,...
The dogma
• The court informs the expert what is E, what is B,
what are H and K
• The expert simply computes LR, ie the ex...
The practice
• The expert selects from all the information given to him,
on the basis of his own expertise (common practic...
Different experts make different
choices, get different answers
• In practice, an expert’s choices are all made
simultaneo...
My thesis: let’s make this
bug into a feature
• Depending on the size and quality of the data-
base, different choices of ...
Extreme example: the rare
haplotype problem
• Database - let’s pretend it is a random sample of Y-
STR haplotypes from a g...
#	
  Supplementary	
  data	
  to	
  the	
  following	
  publica7on
#	
  Roewer	
  L,	
  Croucher	
  PJ,	
  Willuweit	
  S,...
●
●●
●
●
●
●●
●●
●●●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●...
Dogma & current best
practice
• LR = 1 / Pr(haplotype)
• Enormous literature with all kinds of ways to estimate Pr(haploty...
Less is more
• Consider the data-base as part of the evidence
• Forget the haplotypes !!!
• We are throwing away very rele...
Forget the haplotypes
• Database reduces to ordered list (from large to
small) of frequencies of different observed
haplot...
Good-Turing type estimators
• LR = ratio of
• Pr( on extending sample from size N to N+1, we
see one new species)
• Pr( on...
Good-Turing type estimators
• What happens when we increase database from size N
to size N+1 (or to N+2) has almost same p...
My Brenner-inspired LR
N · N(1)
2 · N(2)
P1
k=1(1 pk)N
pk
P1
k=1(1 pk)N p2
k
N = size of database, N(1) = # singletons, N(...
Another option
• Estimate spectrum of haplotype probabilities from
spectrum of observed haplotype frequencies by
MLE – Orl...
Example (Ex. 2)
• We (visitors from Mars) go on safari. We see N = 3
mammals. Two are of one species and one of
another (e...
frequency 27 23 16 14 13 12 11 10 9 8 7 6 5 4 3 2 1
replicates 1 1 2 1 1 1 2 2 1 3 2 6 11 33 71 253 1434
Biometrika(2002),...
An experiment
• Compare DISCLAP and Good-Turing
• *Known* population
• Small database
• A new rare-haplotype match case
#	
  Supplementary	
  data	
  to	
  the	
  following	
  publica7on
#	
  Roewer	
  L,	
  Croucher	
  PJ,	
  Willuweit	
  S,...
●
●●
●
●
●
●●
●●
●●●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●...
An experiment
• Truth = 2005 Roewer et al. 7 locus database with
each of the 12738 duplicated the same, v. large
number of...
• The DISCLAP model is not “true”, except as an ever better
approximation with larger and larger number of subpopulations
...
Performance of Good-Turing
2.6
2.3
3.6
y-axis: log10 LR; x-axis: replicate
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●...
Performance of DISCLAP
estimated log10 LR
true log10 LR
<< true log10 LR Good-Turing
Superimposed Histograms
x
Frequency
2...
Comparison DISCLAP :
Good-Turing
y-axis: estimated minus true log10 LR
x-axis: 1000 replicates
●
●●●
●
●●
●
●●
●
●●●
●
●
●...
Conclusions (1)
• Throwing away the haplotype information reduces “true”
weight of evidence (log10 LR) typically from abou...
Conclusions (2)
• Much more work to do …
• Publications in preparation (Giulia Cereda and me)
• For MLE work on spectrum (...
Upcoming SlideShare
Loading in …5
×

The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match

645 views

Published on

Two new estimators of the evidential value of a rare haplotype match for a Y-STR dna profile are proposed. One is based on the Good-Turing coverage estimator, the other based on Orlitsky et al. MLE of the spectrum of a probability distribution (vector of probabilities ordered from large to small) based on the observed spectrum (vector of observed frequencies ordered from large to small). These two are compared to a model-based approach "discrete Laplace mixture model". "Less is more": it can pay off to discard some of the information (the actual haplotypes) ... the true likelihood ratio is lower, but the precision with which it can be estimated is much, much better.

Published in: Science
  • Be the first to comment

The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match

  1. 1. The fundamental problem of forensic statistics Richard Gill, Giulia Cereda ICFIS 2014 Taking account of three levels of uncertainty ! When less can be more
  2. 2. If you are a cat / dog … (thanks to Paul Roberts) statisticians/lawyers = cats/dogs/people?
  3. 3. If you are a cat / dog … • Google “Ben Geen” • http://thejusticegap.com/2014/08/become- convicted-serial-killer-without-killing-anyone/ • How to become a convicted serial killer (without killing anyone), by yours truly • http://bengeen.wordpress.com/
  4. 4. The fundamental problem of forensic statistics Richard Gill, Giulia Cereda ICFIS 2014 Taking account of three levels of uncertainty
  5. 5. Fundamental problem of forensic mathematics—The evidential value of a rare haplotype Charles H. Brenner a,b, * a School of Public Health, Forensic Science Group, U.C. Berkeley, Berkeley, CA United States b DNAÁVIEW, 6801 Thornhill Drive, Oakland, CA 94611-1336, United States 1. Introduction 1.1. Mitochondrial DNA and Y-chromosomes In recent years there has been increasing interest in using Y- chromosomal haplotypes [1–5] or mtDNA for forensic identifica- tion. These haplotype systems are also much used for body identification, especially for old graves [6,7]. The advantages for some kinds of problems are considerable. Both methods are rape sample. However, an mtDNA or a Y-chromosomal haplotype must be treated mathematically as a single indivisible (‘‘atomic’’) trait; so unlike those traditional DNA methods which examine several traits that are approximately independent of one another, no multiplication of probabilities is possible. Therefore it is vital to have a sound fundamental understanding of atomic trait matching probabilities in order to make a reasonable assessment of the strength of identification evidence. Forensic Science International: Genetics 4 (2010) 281–291 A R T I C L E I N F O Article history: Received 16 June 2009 Received in revised form 20 October 2009 Accepted 21 October 2009 Keywords: Haplotype Stain matching mtDNA Y-haplotype Forensic mathematics Likelihood ratio Matching probability A B S T R A C T Y-chromosomal and mitochondrial haplotyping offer special advantages for criminal (and other) identification. For different reasons, each of them is sometimes detectable in a crime stain for which autosomal typing fails. But they also present special problems, including a fundamental mathematical one: When a rare haplotype is shared between suspect and crime scene, how strong is the evidence linking the two? Assume a reference population sample is available which contains n À 1 haplotypes. The most interesting situation as well as the most common one is that the crime scene haplotype was never observed in the population sample. The traditional tools of product rule and sample frequency are not useful when there are no components to multiply and the sample frequency is zero. A useful statistic is the fraction k of the population sample that consists of ‘‘singletons’’ – of once-observed types. A simple argument shows that the probability for a random innocent suspect to match a previously unobserved crime scene type is (1 À k)/n – distinctly less than 1/n, likely ten times less. The robust validity of this model is confirmed by testing it against a range of population models. This paper hinges above all on one key insight: probability is not frequency. The common but erroneous ‘‘frequency’’ approach adopts population frequency as a surrogate for matching probability and attempts the intractable problem of guessing how many instances exist of the specific haplotype at a certain crime. Probability, by contrast, depends by definition only on the available data. Hence if different haplotypes but with the same data occur in two different crimes, although the frequencies are different (and are hopelessly elusive), the matching probabilities are the same, and are not hard to find. ß 2009 Elsevier Ireland Ltd. All rights reserved. Contents lists available at ScienceDirect Forensic Science International: Genetics journal homepage: www.elsevier.com/locate/fsig (2014) (2010) Forensic Population Genetics - Original Research Understanding Y haplotype matching probability Charles H. Brenner a,b, * a Human Rights Center, U.C. Berkeley, Berkeley, CA, United States b DNAÁVIEW, 6801 Thornhill Drive, Oakland, CA 94611-1336, United States Forensic Science International: Genetics 8 (2014) 233–243 A R T I C L E I N F O Article history: Received 21 March 2013 Received in revised form 6 October 2013 Accepted 19 October 2013 Keywords: Haplotype Y-haplotype Likelihood ratio Weight of evidence calculation Probability Model A B S T R A C T The Y haplotype population-genetic terrain is better explored from a fresh perspective rather than by analogy with the more familiar autosomal ideas. For haplotype matching probabilities, versus for autosomal matching probabilities, explicit attention to modelling – such as how evolution got us where we are – is much more important while consideration of population frequency is much less so. This paper explores, extends, and explains some of the concepts of ‘‘Fundamental problem of forensic mathematics – the evidential strength of a rare haplotype match’’ [1]. That earlier paper presented and validated a ‘‘kappa method’’ formula for the evidential strength when a suspect matches a previously unseen haplotype (such as a Y-haplotype) at the crime scene. Mathematical implications of the kappa method are intuitive and reasonable. Suspicions to the contrary raised in [2] rest on elementary errors. Critical to deriving the kappa method or any sensible evidential calculation is understanding that thinking about haplotype population frequency is a red herring; the pivotal question is one of matching Contents lists available at ScienceDirect Forensic Science International: Genetics journal homepage: www.elsevier.com/locate/fsig perties en in mpulse vals is matical not; it taintly n. If D blem – iation. ion of ving a so the vals or y the 7. Thinking about models is vital for understanding and develop- ing a Y haplotype matching probability approach. Some autosomal practice acknowledges population substructure – good, but a mere academic exercise by comparison with the importance of theory for Y. 8. Finally there is another distinction outside the ambit of this paper, the common concern about possible geographical clustering of Y haplotypes. To what extent clustering compli- cates evidential calculation and what to do about it are topics to explore. Acknowledgements I thank David DeGusta for invaluable advice with the manuscript and the referees for prodding me to greater clarity. References [1] C. Brenner, Fundamental problem of forensic mathematics – the evidential value of a rare haplotype, Forensic Sci. Int. Genet. 4 (2010) 281–291. [2] J. Buckleton, M. Krawczak, B. Weir, The interpretation of lineage markers in forensic DNA testing, Forensic Sci. Int. Genet. 5 (2011) 78–83. [3] M.M. Andersen, P.S. Eriksen, N. Morling, The discrete Laplace exponential family and estimation of Y-STR haplotype frequencies, J. Theor. Biol. 329 (2013) 39–51. [4] C. Cockerham, Variance of gene frequencies, Evolution 23 (1) (1969) 72–84.
  6. 6. Three levels of uncertainty • Who did it? (prosecution vs. defence) • The probability model • The probabilities in the model
  7. 7. The dogma • E = Evidence • B = Background • H, K = hypotheses of prosecution and defence • LR = Pr( E | B, H) / Pr( E | B, K)! • Pr = Probability
  8. 8. The dogma • The court informs the expert what is E, what is B, what are H and K • The expert simply computes LR, ie the expert “knows” Pr
  9. 9. The practice • The expert selects from all the information given to him, on the basis of his own expertise (common practice), what he is going to take as E, B etc • She uses background knowledge (convention, genetics) to specify a family of possible probabilities, in other words a model M = ( Pr( … | 𝜽 ) : 𝜽 in 𝚹 ) • She uses a data-base D to estimate 𝜽, giving estimate 𝜽 • She “computes” LR= Pr( E | B, H, 𝜽) / Pr( E | B, K, 𝜽)ˆ ˆ ˆ ˆ Apologies to all Bayesians here for my shamelessly frequentist point of view
  10. 10. Different experts make different choices, get different answers • In practice, an expert’s choices are all made simultaneously (convention, convenience, tractability, … ) • The fact that M is only a model and 𝜽 only a “best guess” is swept under the carpet • Sometimes that could be reasonable (i.e., not misleading) ˆ
  11. 11. My thesis: let’s make this bug into a feature • Depending on the size and quality of the data- base, different choices of E and B, different choices of M, and different ways to estimate 𝜽 might give results which are more or less reliable • “Less is more”: (sometimes) by only taking account of less of the evidence, the final result may be much more reliable
  12. 12. Extreme example: the rare haplotype problem • Database - let’s pretend it is a random sample of Y- STR haplotypes from a given population • Evidence - match of haplotype suspect, perpetrator • Rare haplotype problem: it’s a new haplotype • Prosecution - the suspect is the perpetrator • Defence - the suspect and the perpetrator are different; the match due purely by chance
  13. 13. #  Supplementary  data  to  the  following  publica7on #  Roewer  L,  Croucher  PJ,  Willuweit  S,  Lu  TT,  Kayser  M,  Lessig  R,   #  de  Knijff  P,  Jobling  MA,  Tyler-­‐Smith  C,  Krawczak  M.  'Signature  of #  recent  historical  events  in  the  European  Y-­‐chromosomal  STR  haplotype #  distribu7on.'  (2005)  Hum  Genet.  2005  Jan  20; #  LINK:  hYp://dx.doi.org/10.1007/s00439-­‐004-­‐1201-­‐z #  Licensed  under  the  Academic  Free  License  v.  2.1 #  LINK:  hYp://www.opensource.org/licenses/afl-­‐2.1.php pop dys19 dys389i dys389ii dys390 dys391 dys392 dys393 1 Albania 12 13 30 24 10 11 13 2 Albania 12 13 30 24 10 11 14 3 Albania 13 12 30 24 10 11 13 4 Albania 13 13 29 23 10 11 13 5 Albania 13 13 29 24 10 11 14 6 Albania 13 13 29 24 11 13 13 8 Albania 13 13 30 24 10 11 13 9 Albania 13 13 30 24 10 11 13 10 Albania 13 13 30 24 10 11 13 12734 Anatolia,  Turkey 16 14 31 24 10 11 13 12735 Anatolia,  Turkey 16 14 32 24 9 11 13 12736 Anatolia,  Turkey 17 12 28 21 10 11 15 12737 Anatolia,  Turkey 17 13 30 21 9 11 13 12738 Anatolia,  Turkey 17 13 30 25 10 11 13 Example (Ex. 1)
  14. 14. ● ●● ● ● ● ●● ●● ●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 500 1000 1500 2000 2500 0100200300400500600 Y−chromosome data−base, N=12727 index orderedfrequency N=12797 # distinct profiles: 2489 # singletons: 1397 # pairs: 379
  15. 15. Dogma & current best practice • LR = 1 / Pr(haplotype) • Enormous literature with all kinds of ways to estimate Pr(haplotype) • Current hi-tech solution: Andersen et al. (2013) • Model: mixture of subpopulations; within each subpopulation, all locus repeat numbers independent discrete Laplace distributed • Given # subpopulations, estimate model parameters by EM • Number of subpopulations estimated by BIC • LR calculated by plug-in
  16. 16. Less is more • Consider the data-base as part of the evidence • Forget the haplotypes !!! • We are throwing away very relevant data but we don’t know how to use it precisely (esp. when data-base is small, case haplotype is new) • Discrete Laplace model is “just” a model • By increasing the number of subpopulations we can in the limit approximate any distribution … finding a good fit to the data in the database is not the same as finding a good estimate of the probability of an unusual new datapoint
  17. 17. Forget the haplotypes • Database reduces to ordered list (from large to small) of frequencies of different observed haplotypes • Evidence: database plus fact of a new haplotype observed once (prosecution), twice (defence) Database D is reduced to “spectrum of frequencies” Unknown parameter 𝜽 = “spectrum of probabilities”
  18. 18. Good-Turing type estimators • LR = ratio of • Pr( on extending sample from size N to N+1, we see one new species) • Pr( on extending sample from size N to N+2, we see one new species, twice)
  19. 19. Good-Turing type estimators • What happens when we increase database from size N to size N+1 (or to N+2) has almost same probability as when increasing from N–1 (or N – 2) to N • Ignoring that distinction: • All permutations of the original data-base were equally likely to have occurred, so # singletons / N is an unbiased estimator of the prosecution’s probability • (2 # duplets / N) x (1 / (N–1)) is an unbiased estimator of the defence’s probability
  20. 20. My Brenner-inspired LR N · N(1) 2 · N(2) P1 k=1(1 pk)N pk P1 k=1(1 pk)N p2 k N = size of database, N(1) = # singletons, N(2) = # duplets ˆ
  21. 21. Another option • Estimate spectrum of haplotype probabilities from spectrum of observed haplotype frequencies by MLE – Orlitsky et al. (2004); Anevksi, Gill, Zohren (2013)! • Then estimate prosecution, defence probabilities by plug-in from MLE N · N(1) 2 · N(2) LR = P1 k=1(1 pk)N pk P1 k=1(1 pk)N p2 k N · N(1) 2 · N(2) LR = P1 k=1(1 pk)N pk P1 k=1(1 pk)N p2 k dLR = P1 k=1(1 bpk)N bpk P1 k=1(1 bpk)N bp2 k
  22. 22. Example (Ex. 2) • We (visitors from Mars) go on safari. We see N = 3 mammals. Two are of one species and one of another (e.g., two lions, one tiger) • *Naive estimator* of spectrum of probabilities is (2/3, 1/3, 0, 0, …) • MLE is (1/2, 1/2, 0, 0, …) N · N(1) 2 · N(2) LR = P1 k=1(1 pk)N pk P1 k=1(1 pk)N p2 k dLR = P1 k=1(1 bpk)N bpk P1 k=1(1 bpk)N bp2 k lik(p1, p2, p3, ...) = p2 1p2 + p2 2p1 + . . . ; p1 p2 p3 . . . , X k pk = 1 Orlitsky et al. (2004) Anevski et al. (2013)
  23. 23. frequency 27 23 16 14 13 12 11 10 9 8 7 6 5 4 3 2 1 replicates 1 1 2 1 1 1 2 2 1 3 2 6 11 33 71 253 1434 Biometrika(2002), 89, 3, pp.669-681 ? 2002 Biometrika Trust Printedin GreatBritain A Poisson model for the coverageproblemwith a genomic application BY CHANG XUAN MAO InterdepartmentalGroupinBiostatistics, Universityof California,Berkeley, 367 EvansHall, Berkeley,California94720-3860, U.S.A. cmao@stat.berkeley.edu AND BRUCE G. LINDSAY Departmentof Statistics, PennsylvaniaState University,UniversityPark, Pennsylvania16802-2111, U.S.A. bgl@psu.edu SUMMARY Suppose a population has infinitely many individuals and is partitioned into unknown N disjoint classes. The sample coverage of a random sample from the population is the total proportion of the classes observed in the sample. This paper uses a nonparametric Poisson mixture model to give new understanding and results for inference on the sample coverage. The Poisson mixture model provides a simplified framework for inferring any generalabundance-f coverage, the sum of the proportions of those classes that contribute exactly k individuals in the sample for some k in *, with * being a set of nonnegative integers. A new moment-based derivation of the well-known Turing estimators is pre- sented.As an application, a gene-categorisation problem in genomic researchis addressed. Since Turing'sapproach is a moment-based method, maximum likelihood estimation and minimum distance estimation are indicated as alternatives for the coverage problem. Finally, it will be shown that any Turing estimator is asymptotically fully efficient. Somekey words:Digital gene expression;Poisson mixture;Sample coverage; Species. 1. INTRODUCTION Consider a population composed of infinitelymany individuals, which can be considered as an approximation of real populations with finitely many individuals under specific situ- ations, in particular when the number of individuals in a target population is very large. The population has been partitioned into N disjoint classes indexed by i=1,2,..., N, with ic being the proportion of the ith class. The identity of each class and the parameter N are assumed to be unknown prior to the experiment. The unknown m,'sare subject to the constraint 7~•{1 = 1. A random sample of individuals is taken from the population. Let X, be the number of individuals from the ith class, called the frequency in the sample. If X, = 0, then the ith class is not observed in the sample. It will be assumed that these zero frequencies are 'missed' random variables so that N, the 'sample size' of all the Xi's, is unknown. Let nk be the number of classes with frequency k and s be the number of Example (Ex. 3) A Poisson model for the coverageproblemwith a genomic application BY CHANG XUAN MAO InterdepartmentalGroupinBiostatistics, Universityof California,Berkeley, 367 EvansHall, Berkeley,California94720-3860, U.S.A. cmao@stat.berkeley.edu AND BRUCE G. LINDSAY Departmentof Statistics, PennsylvaniaState University,UniversityPark, Pennsylvania16802-2111, U.S.A. bgl@psu.edu SUMMARY Suppose a population has infinitely many individuals and is partitioned into unknown N disjoint classes. The sample coverage of a random sample from the population is the total proportion of the classes observed in the sample. This paper uses a nonparametric Poisson mixture model to give new understanding and results for inference on the sample coverage. The Poisson mixture model provides a simplified framework for inferring any generalabundance-f coverage, the sum of the proportions of those classes that contribute exactly k individuals in the sample for some k in *, with * being a set of nonnegative integers. A new moment-based derivation of the well-known Turing estimators is pre- sented.As an application, a gene-categorisation problem in genomic researchis addressed. Since Turing'sapproach is a moment-based method, maximum likelihood estimation and minimum distance estimation are indicated as alternatives for the coverage problem. Finally, it will be shown that any Turing estimator is asymptotically fully efficient. Somekey words:Digital gene expression;Poisson mixture;Sample coverage; Species. 1. INTRODUCTION Consider a population composed of infinitelymany individuals, which can be considered as an approximation of real populations with finitely many individuals under specific situ- ations, in particular when the number of individuals in a target population is very large. The population has been partitioned into N disjoint classes indexed by i=1,2,..., N, with ic being the proportion of the ith class. The identity of each class and the parameter N are assumed to be unknown prior to the experiment. The unknown m,'sare subject to the constraint 7~•{1 = 1. A random sample of individuals is taken from the population. Let X, be the number of individuals from the ith class, called the frequency in the sample. If X, = 0, then the ith class is not observed in the sample. It will be assumed that these zero frequencies are 'missed' random variables so that N, the 'sample size' of all the Xi's, is unknown. Let nk be the number of classes with frequency k and s be the number of – naive – mle – naive! – mle LR = 2200! ! LR = 7400! Good-Turing = 7300 N = 2586; 2.1 x 107 iterations of SA-MH-EM (right panel: y-axis logarithmic) ˆ ˆ (mle has mass 0.3 spread thinly beyond 2000 – for optimisation, taken infinitely thin, inf. far) (Tomato genes ordered by frequency of expression pk = probability gene k is being expressed) 0 500 1000 1500 2000 0.0000.0020.0040.0060.0080.010 0 500 1000 1500 2000 1e−042e−045e−041e−032e−035e−031e−02
  24. 24. An experiment • Compare DISCLAP and Good-Turing • *Known* population • Small database • A new rare-haplotype match case
  25. 25. #  Supplementary  data  to  the  following  publica7on #  Roewer  L,  Croucher  PJ,  Willuweit  S,  Lu  TT,  Kayser  M,  Lessig  R,   #  de  Knijff  P,  Jobling  MA,  Tyler-­‐Smith  C,  Krawczak  M.  'Signature  of #  recent  historical  events  in  the  European  Y-­‐chromosomal  STR  haplotype #  distribu7on.'  (2005)  Hum  Genet.  2005  Jan  20; #  LINK:  hYp://dx.doi.org/10.1007/s00439-­‐004-­‐1201-­‐z #  Licensed  under  the  Academic  Free  License  v.  2.1 #  LINK:  hYp://www.opensource.org/licenses/afl-­‐2.1.php pop dys19 dys389i dys389ii dys390 dys391 dys392 dys393 1 Albania 12 13 30 24 10 11 13 2 Albania 12 13 30 24 10 11 14 3 Albania 13 12 30 24 10 11 13 4 Albania 13 13 29 23 10 11 13 5 Albania 13 13 29 24 10 11 14 6 Albania 13 13 29 24 11 13 13 8 Albania 13 13 30 24 10 11 13 9 Albania 13 13 30 24 10 11 13 10 Albania 13 13 30 24 10 11 13 12734 Anatolia,  Turkey 16 14 31 24 10 11 13 12735 Anatolia,  Turkey 16 14 32 24 9 11 13 12736 Anatolia,  Turkey 17 12 28 21 10 11 15 12737 Anatolia,  Turkey 17 13 30 21 9 11 13 12738 Anatolia,  Turkey 17 13 30 25 10 11 13 Example (Ex. 1)
  26. 26. ● ●● ● ● ● ●● ●● ●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 500 1000 1500 2000 2500 0100200300400500600 Y−chromosome data−base, N=12727 index orderedfrequency N=12797 # distinct profiles: 2489 # singletons: 1397 # pairs: 379
  27. 27. An experiment • Truth = 2005 Roewer et al. 7 locus database with each of the 12738 duplicated the same, v. large number of times • Data-base = random sample of size 100 • Case = a new random haplotype different from all in data-base • Repeat this, 1000 times
  28. 28. • The DISCLAP model is not “true”, except as an ever better approximation with larger and larger number of subpopulations • So it suffers from both further levels of uncertainty: “wrong model”; “unknown model parameters” • The Good-Turing model is “true”, so only suffers from one level of uncertainty: “unknown model parameters” • The database is very small (N = 100) • We were “dumb users” - used DISCLAP out of the box, default parameters The experiment is biased against DISCLAP
  29. 29. Performance of Good-Turing 2.6 2.3 3.6 y-axis: log10 LR; x-axis: replicate ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● 0 200 400 600 800 1000 2.42.62.83.03.23.43.6 Index log10(LRg_est) 2.42.62.83.03.23.43.6 log10(LRg_est)
  30. 30. Performance of DISCLAP estimated log10 LR true log10 LR << true log10 LR Good-Turing Superimposed Histograms x Frequency 2 4 6 8 10 050100150200250
  31. 31. Comparison DISCLAP : Good-Turing y-axis: estimated minus true log10 LR x-axis: 1000 replicates ● ●●● ● ●● ● ●● ● ●●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●●● ● ● ● ●● ●●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ●● ●● ● ● ●●● ●●●●●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ●● ● ● ● ●● ●●●●● ● ● ● ●● ●●● ● ● ● ● ● ● ● ●●● ● ● ●●● ●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ●●●● ● ● ●● ● ●● ● ● ●● ●● ● ●●● ● ●● ● ● ● ●●●●● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ●● ●● ● ● ● ●●●●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●●● ●●●●● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ●●● ● ●● ● ●●● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ●●● ● ●● ●● ●● ● ● ● ●●● ●●● ●● ● ● ●● ● ●● ● ●●● ● ● ● ●●●●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●●●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ●●● ●● ●● ●●●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●●●●●●●●●● ●● ● ● ●●● ●● ●●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●●●●● ● ●● ● ●● ●● ● ●●●●● ● ● ● ● ● ●● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ●●●● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ●●● ● ●●● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ●●● ●● ● ●● ● ●● ● ●● ● ● ● ● ●●●●●●●● ●● ● ● ●● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ● ● 0 200 400 600 800 1000 0246 Index log10(Ratio_Gill) ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● 0 200 400 600 800 1000 0246 Index log10(Ratio_Andersen)
  32. 32. Conclusions (1) • Throwing away the haplotype information reduces “true” weight of evidence (log10 LR) typically from about 3.5 to about 2.5 • Good-Turing is fairly unbiased and has rather small variance (typical error about 0.2, rarely 0.5) • DISCLAP “out of the box” is biased in favour of the prosecution by an order of magnitude (error in log10 LR of about 1.0) and the errors are heavily skewed to the right • The defence could therefore object to use of DISCLAP (in this way…) and their objection ought to be accepted
  33. 33. Conclusions (2) • Much more work to do … • Publications in preparation (Giulia Cereda and me) • For MLE work on spectrum (cf. Mao example…) ! ! • Something completely different, for the cats/dogs here: Google “Ben Geen” http://arxiv.org/abs/1312.1200 Estimating a probability mass function with unknown labels Dragi Anevski, Richard D. Gill, Stefan Zohren ! (Building on Orlitsky et al., 2004)

×