Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Adventures in Forensic Statistics -... by Australian Bioinf... 1397 views
- The use of statistics in forensic s... by MarinMartins 79 views
- Quality Control by Carul Push 15205 views
- Quality Assurance Vs Quality Control by Yogita patil 41575 views
- Quality control by Grace Falcis 27863 views
- Quality control and quality assurance by Leola Ramirez 30085 views

645 views

Published on

Published in:
Science

No Downloads

Total views

645

On SlideShare

0

From Embeds

0

Number of Embeds

17

Shares

0

Downloads

13

Comments

0

Likes

2

No embeds

No notes for slide

- 1. The fundamental problem of forensic statistics Richard Gill, Giulia Cereda ICFIS 2014 Taking account of three levels of uncertainty ! When less can be more
- 2. If you are a cat / dog … (thanks to Paul Roberts) statisticians/lawyers = cats/dogs/people?
- 3. If you are a cat / dog … • Google “Ben Geen” • http://thejusticegap.com/2014/08/become- convicted-serial-killer-without-killing-anyone/ • How to become a convicted serial killer (without killing anyone), by yours truly • http://bengeen.wordpress.com/
- 4. The fundamental problem of forensic statistics Richard Gill, Giulia Cereda ICFIS 2014 Taking account of three levels of uncertainty
- 5. Fundamental problem of forensic mathematics—The evidential value of a rare haplotype Charles H. Brenner a,b, * a School of Public Health, Forensic Science Group, U.C. Berkeley, Berkeley, CA United States b DNAÁVIEW, 6801 Thornhill Drive, Oakland, CA 94611-1336, United States 1. Introduction 1.1. Mitochondrial DNA and Y-chromosomes In recent years there has been increasing interest in using Y- chromosomal haplotypes [1–5] or mtDNA for forensic identiﬁca- tion. These haplotype systems are also much used for body identiﬁcation, especially for old graves [6,7]. The advantages for some kinds of problems are considerable. Both methods are rape sample. However, an mtDNA or a Y-chromosomal haplotype must be treated mathematically as a single indivisible (‘‘atomic’’) trait; so unlike those traditional DNA methods which examine several traits that are approximately independent of one another, no multiplication of probabilities is possible. Therefore it is vital to have a sound fundamental understanding of atomic trait matching probabilities in order to make a reasonable assessment of the strength of identiﬁcation evidence. Forensic Science International: Genetics 4 (2010) 281–291 A R T I C L E I N F O Article history: Received 16 June 2009 Received in revised form 20 October 2009 Accepted 21 October 2009 Keywords: Haplotype Stain matching mtDNA Y-haplotype Forensic mathematics Likelihood ratio Matching probability A B S T R A C T Y-chromosomal and mitochondrial haplotyping offer special advantages for criminal (and other) identiﬁcation. For different reasons, each of them is sometimes detectable in a crime stain for which autosomal typing fails. But they also present special problems, including a fundamental mathematical one: When a rare haplotype is shared between suspect and crime scene, how strong is the evidence linking the two? Assume a reference population sample is available which contains n À 1 haplotypes. The most interesting situation as well as the most common one is that the crime scene haplotype was never observed in the population sample. The traditional tools of product rule and sample frequency are not useful when there are no components to multiply and the sample frequency is zero. A useful statistic is the fraction k of the population sample that consists of ‘‘singletons’’ – of once-observed types. A simple argument shows that the probability for a random innocent suspect to match a previously unobserved crime scene type is (1 À k)/n – distinctly less than 1/n, likely ten times less. The robust validity of this model is conﬁrmed by testing it against a range of population models. This paper hinges above all on one key insight: probability is not frequency. The common but erroneous ‘‘frequency’’ approach adopts population frequency as a surrogate for matching probability and attempts the intractable problem of guessing how many instances exist of the speciﬁc haplotype at a certain crime. Probability, by contrast, depends by deﬁnition only on the available data. Hence if different haplotypes but with the same data occur in two different crimes, although the frequencies are different (and are hopelessly elusive), the matching probabilities are the same, and are not hard to ﬁnd. ß 2009 Elsevier Ireland Ltd. All rights reserved. Contents lists available at ScienceDirect Forensic Science International: Genetics journal homepage: www.elsevier.com/locate/fsig (2014) (2010) Forensic Population Genetics - Original Research Understanding Y haplotype matching probability Charles H. Brenner a,b, * a Human Rights Center, U.C. Berkeley, Berkeley, CA, United States b DNAÁVIEW, 6801 Thornhill Drive, Oakland, CA 94611-1336, United States Forensic Science International: Genetics 8 (2014) 233–243 A R T I C L E I N F O Article history: Received 21 March 2013 Received in revised form 6 October 2013 Accepted 19 October 2013 Keywords: Haplotype Y-haplotype Likelihood ratio Weight of evidence calculation Probability Model A B S T R A C T The Y haplotype population-genetic terrain is better explored from a fresh perspective rather than by analogy with the more familiar autosomal ideas. For haplotype matching probabilities, versus for autosomal matching probabilities, explicit attention to modelling – such as how evolution got us where we are – is much more important while consideration of population frequency is much less so. This paper explores, extends, and explains some of the concepts of ‘‘Fundamental problem of forensic mathematics – the evidential strength of a rare haplotype match’’ [1]. That earlier paper presented and validated a ‘‘kappa method’’ formula for the evidential strength when a suspect matches a previously unseen haplotype (such as a Y-haplotype) at the crime scene. Mathematical implications of the kappa method are intuitive and reasonable. Suspicions to the contrary raised in [2] rest on elementary errors. Critical to deriving the kappa method or any sensible evidential calculation is understanding that thinking about haplotype population frequency is a red herring; the pivotal question is one of matching Contents lists available at ScienceDirect Forensic Science International: Genetics journal homepage: www.elsevier.com/locate/fsig perties en in mpulse vals is matical not; it taintly n. If D blem – iation. ion of ving a so the vals or y the 7. Thinking about models is vital for understanding and develop- ing a Y haplotype matching probability approach. Some autosomal practice acknowledges population substructure – good, but a mere academic exercise by comparison with the importance of theory for Y. 8. Finally there is another distinction outside the ambit of this paper, the common concern about possible geographical clustering of Y haplotypes. To what extent clustering compli- cates evidential calculation and what to do about it are topics to explore. Acknowledgements I thank David DeGusta for invaluable advice with the manuscript and the referees for prodding me to greater clarity. References [1] C. Brenner, Fundamental problem of forensic mathematics – the evidential value of a rare haplotype, Forensic Sci. Int. Genet. 4 (2010) 281–291. [2] J. Buckleton, M. Krawczak, B. Weir, The interpretation of lineage markers in forensic DNA testing, Forensic Sci. Int. Genet. 5 (2011) 78–83. [3] M.M. Andersen, P.S. Eriksen, N. Morling, The discrete Laplace exponential family and estimation of Y-STR haplotype frequencies, J. Theor. Biol. 329 (2013) 39–51. [4] C. Cockerham, Variance of gene frequencies, Evolution 23 (1) (1969) 72–84.
- 6. Three levels of uncertainty • Who did it? (prosecution vs. defence) • The probability model • The probabilities in the model
- 7. The dogma • E = Evidence • B = Background • H, K = hypotheses of prosecution and defence • LR = Pr( E | B, H) / Pr( E | B, K)! • Pr = Probability
- 8. The dogma • The court informs the expert what is E, what is B, what are H and K • The expert simply computes LR, ie the expert “knows” Pr
- 9. The practice • The expert selects from all the information given to him, on the basis of his own expertise (common practice), what he is going to take as E, B etc • She uses background knowledge (convention, genetics) to specify a family of possible probabilities, in other words a model M = ( Pr( … | 𝜽 ) : 𝜽 in 𝚹 ) • She uses a data-base D to estimate 𝜽, giving estimate 𝜽 • She “computes” LR= Pr( E | B, H, 𝜽) / Pr( E | B, K, 𝜽)ˆ ˆ ˆ ˆ Apologies to all Bayesians here for my shamelessly frequentist point of view
- 10. Different experts make different choices, get different answers • In practice, an expert’s choices are all made simultaneously (convention, convenience, tractability, … ) • The fact that M is only a model and 𝜽 only a “best guess” is swept under the carpet • Sometimes that could be reasonable (i.e., not misleading) ˆ
- 11. My thesis: let’s make this bug into a feature • Depending on the size and quality of the data- base, different choices of E and B, different choices of M, and different ways to estimate 𝜽 might give results which are more or less reliable • “Less is more”: (sometimes) by only taking account of less of the evidence, the ﬁnal result may be much more reliable
- 12. Extreme example: the rare haplotype problem • Database - let’s pretend it is a random sample of Y- STR haplotypes from a given population • Evidence - match of haplotype suspect, perpetrator • Rare haplotype problem: it’s a new haplotype • Prosecution - the suspect is the perpetrator • Defence - the suspect and the perpetrator are different; the match due purely by chance
- 13. # Supplementary data to the following publica7on # Roewer L, Croucher PJ, Willuweit S, Lu TT, Kayser M, Lessig R, # de Knijﬀ P, Jobling MA, Tyler-‐Smith C, Krawczak M. 'Signature of # recent historical events in the European Y-‐chromosomal STR haplotype # distribu7on.' (2005) Hum Genet. 2005 Jan 20; # LINK: hYp://dx.doi.org/10.1007/s00439-‐004-‐1201-‐z # Licensed under the Academic Free License v. 2.1 # LINK: hYp://www.opensource.org/licenses/aﬂ-‐2.1.php pop dys19 dys389i dys389ii dys390 dys391 dys392 dys393 1 Albania 12 13 30 24 10 11 13 2 Albania 12 13 30 24 10 11 14 3 Albania 13 12 30 24 10 11 13 4 Albania 13 13 29 23 10 11 13 5 Albania 13 13 29 24 10 11 14 6 Albania 13 13 29 24 11 13 13 8 Albania 13 13 30 24 10 11 13 9 Albania 13 13 30 24 10 11 13 10 Albania 13 13 30 24 10 11 13 12734 Anatolia, Turkey 16 14 31 24 10 11 13 12735 Anatolia, Turkey 16 14 32 24 9 11 13 12736 Anatolia, Turkey 17 12 28 21 10 11 15 12737 Anatolia, Turkey 17 13 30 21 9 11 13 12738 Anatolia, Turkey 17 13 30 25 10 11 13 Example (Ex. 1)
- 14. ● ●● ● ● ● ●● ●● ●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 500 1000 1500 2000 2500 0100200300400500600 Y−chromosome data−base, N=12727 index orderedfrequency N=12797 # distinct proﬁles: 2489 # singletons: 1397 # pairs: 379
- 15. Dogma & current best practice • LR = 1 / Pr(haplotype) • Enormous literature with all kinds of ways to estimate Pr(haplotype) • Current hi-tech solution: Andersen et al. (2013) • Model: mixture of subpopulations; within each subpopulation, all locus repeat numbers independent discrete Laplace distributed • Given # subpopulations, estimate model parameters by EM • Number of subpopulations estimated by BIC • LR calculated by plug-in
- 16. Less is more • Consider the data-base as part of the evidence • Forget the haplotypes !!! • We are throwing away very relevant data but we don’t know how to use it precisely (esp. when data-base is small, case haplotype is new) • Discrete Laplace model is “just” a model • By increasing the number of subpopulations we can in the limit approximate any distribution … ﬁnding a good ﬁt to the data in the database is not the same as ﬁnding a good estimate of the probability of an unusual new datapoint
- 17. Forget the haplotypes • Database reduces to ordered list (from large to small) of frequencies of different observed haplotypes • Evidence: database plus fact of a new haplotype observed once (prosecution), twice (defence) Database D is reduced to “spectrum of frequencies” Unknown parameter 𝜽 = “spectrum of probabilities”
- 18. Good-Turing type estimators • LR = ratio of • Pr( on extending sample from size N to N+1, we see one new species) • Pr( on extending sample from size N to N+2, we see one new species, twice)
- 19. Good-Turing type estimators • What happens when we increase database from size N to size N+1 (or to N+2) has almost same probability as when increasing from N–1 (or N – 2) to N • Ignoring that distinction: • All permutations of the original data-base were equally likely to have occurred, so # singletons / N is an unbiased estimator of the prosecution’s probability • (2 # duplets / N) x (1 / (N–1)) is an unbiased estimator of the defence’s probability
- 20. My Brenner-inspired LR N · N(1) 2 · N(2) P1 k=1(1 pk)N pk P1 k=1(1 pk)N p2 k N = size of database, N(1) = # singletons, N(2) = # duplets ˆ
- 21. Another option • Estimate spectrum of haplotype probabilities from spectrum of observed haplotype frequencies by MLE – Orlitsky et al. (2004); Anevksi, Gill, Zohren (2013)! • Then estimate prosecution, defence probabilities by plug-in from MLE N · N(1) 2 · N(2) LR = P1 k=1(1 pk)N pk P1 k=1(1 pk)N p2 k N · N(1) 2 · N(2) LR = P1 k=1(1 pk)N pk P1 k=1(1 pk)N p2 k dLR = P1 k=1(1 bpk)N bpk P1 k=1(1 bpk)N bp2 k
- 22. Example (Ex. 2) • We (visitors from Mars) go on safari. We see N = 3 mammals. Two are of one species and one of another (e.g., two lions, one tiger) • *Naive estimator* of spectrum of probabilities is (2/3, 1/3, 0, 0, …) • MLE is (1/2, 1/2, 0, 0, …) N · N(1) 2 · N(2) LR = P1 k=1(1 pk)N pk P1 k=1(1 pk)N p2 k dLR = P1 k=1(1 bpk)N bpk P1 k=1(1 bpk)N bp2 k lik(p1, p2, p3, ...) = p2 1p2 + p2 2p1 + . . . ; p1 p2 p3 . . . , X k pk = 1 Orlitsky et al. (2004) Anevski et al. (2013)
- 23. frequency 27 23 16 14 13 12 11 10 9 8 7 6 5 4 3 2 1 replicates 1 1 2 1 1 1 2 2 1 3 2 6 11 33 71 253 1434 Biometrika(2002), 89, 3, pp.669-681 ? 2002 Biometrika Trust Printedin GreatBritain A Poisson model for the coverageproblemwith a genomic application BY CHANG XUAN MAO InterdepartmentalGroupinBiostatistics, Universityof California,Berkeley, 367 EvansHall, Berkeley,California94720-3860, U.S.A. cmao@stat.berkeley.edu AND BRUCE G. LINDSAY Departmentof Statistics, PennsylvaniaState University,UniversityPark, Pennsylvania16802-2111, U.S.A. bgl@psu.edu SUMMARY Suppose a population has infinitely many individuals and is partitioned into unknown N disjoint classes. The sample coverage of a random sample from the population is the total proportion of the classes observed in the sample. This paper uses a nonparametric Poisson mixture model to give new understanding and results for inference on the sample coverage. The Poisson mixture model provides a simplified framework for inferring any generalabundance-f coverage, the sum of the proportions of those classes that contribute exactly k individuals in the sample for some k in *, with * being a set of nonnegative integers. A new moment-based derivation of the well-known Turing estimators is pre- sented.As an application, a gene-categorisation problem in genomic researchis addressed. Since Turing'sapproach is a moment-based method, maximum likelihood estimation and minimum distance estimation are indicated as alternatives for the coverage problem. Finally, it will be shown that any Turing estimator is asymptotically fully efficient. Somekey words:Digital gene expression;Poisson mixture;Sample coverage; Species. 1. INTRODUCTION Consider a population composed of infinitelymany individuals, which can be considered as an approximation of real populations with finitely many individuals under specific situ- ations, in particular when the number of individuals in a target population is very large. The population has been partitioned into N disjoint classes indexed by i=1,2,..., N, with ic being the proportion of the ith class. The identity of each class and the parameter N are assumed to be unknown prior to the experiment. The unknown m,'sare subject to the constraint 7~•{1 = 1. A random sample of individuals is taken from the population. Let X, be the number of individuals from the ith class, called the frequency in the sample. If X, = 0, then the ith class is not observed in the sample. It will be assumed that these zero frequencies are 'missed' random variables so that N, the 'sample size' of all the Xi's, is unknown. Let nk be the number of classes with frequency k and s be the number of Example (Ex. 3) A Poisson model for the coverageproblemwith a genomic application BY CHANG XUAN MAO InterdepartmentalGroupinBiostatistics, Universityof California,Berkeley, 367 EvansHall, Berkeley,California94720-3860, U.S.A. cmao@stat.berkeley.edu AND BRUCE G. LINDSAY Departmentof Statistics, PennsylvaniaState University,UniversityPark, Pennsylvania16802-2111, U.S.A. bgl@psu.edu SUMMARY Suppose a population has infinitely many individuals and is partitioned into unknown N disjoint classes. The sample coverage of a random sample from the population is the total proportion of the classes observed in the sample. This paper uses a nonparametric Poisson mixture model to give new understanding and results for inference on the sample coverage. The Poisson mixture model provides a simplified framework for inferring any generalabundance-f coverage, the sum of the proportions of those classes that contribute exactly k individuals in the sample for some k in *, with * being a set of nonnegative integers. A new moment-based derivation of the well-known Turing estimators is pre- sented.As an application, a gene-categorisation problem in genomic researchis addressed. Since Turing'sapproach is a moment-based method, maximum likelihood estimation and minimum distance estimation are indicated as alternatives for the coverage problem. Finally, it will be shown that any Turing estimator is asymptotically fully efficient. Somekey words:Digital gene expression;Poisson mixture;Sample coverage; Species. 1. INTRODUCTION Consider a population composed of infinitelymany individuals, which can be considered as an approximation of real populations with finitely many individuals under specific situ- ations, in particular when the number of individuals in a target population is very large. The population has been partitioned into N disjoint classes indexed by i=1,2,..., N, with ic being the proportion of the ith class. The identity of each class and the parameter N are assumed to be unknown prior to the experiment. The unknown m,'sare subject to the constraint 7~•{1 = 1. A random sample of individuals is taken from the population. Let X, be the number of individuals from the ith class, called the frequency in the sample. If X, = 0, then the ith class is not observed in the sample. It will be assumed that these zero frequencies are 'missed' random variables so that N, the 'sample size' of all the Xi's, is unknown. Let nk be the number of classes with frequency k and s be the number of – naive – mle – naive! – mle LR = 2200! ! LR = 7400! Good-Turing = 7300 N = 2586; 2.1 x 107 iterations of SA-MH-EM (right panel: y-axis logarithmic) ˆ ˆ (mle has mass 0.3 spread thinly beyond 2000 – for optimisation, taken inﬁnitely thin, inf. far) (Tomato genes ordered by frequency of expression pk = probability gene k is being expressed) 0 500 1000 1500 2000 0.0000.0020.0040.0060.0080.010 0 500 1000 1500 2000 1e−042e−045e−041e−032e−035e−031e−02
- 24. An experiment • Compare DISCLAP and Good-Turing • *Known* population • Small database • A new rare-haplotype match case
- 25. # Supplementary data to the following publica7on # Roewer L, Croucher PJ, Willuweit S, Lu TT, Kayser M, Lessig R, # de Knijﬀ P, Jobling MA, Tyler-‐Smith C, Krawczak M. 'Signature of # recent historical events in the European Y-‐chromosomal STR haplotype # distribu7on.' (2005) Hum Genet. 2005 Jan 20; # LINK: hYp://dx.doi.org/10.1007/s00439-‐004-‐1201-‐z # Licensed under the Academic Free License v. 2.1 # LINK: hYp://www.opensource.org/licenses/aﬂ-‐2.1.php pop dys19 dys389i dys389ii dys390 dys391 dys392 dys393 1 Albania 12 13 30 24 10 11 13 2 Albania 12 13 30 24 10 11 14 3 Albania 13 12 30 24 10 11 13 4 Albania 13 13 29 23 10 11 13 5 Albania 13 13 29 24 10 11 14 6 Albania 13 13 29 24 11 13 13 8 Albania 13 13 30 24 10 11 13 9 Albania 13 13 30 24 10 11 13 10 Albania 13 13 30 24 10 11 13 12734 Anatolia, Turkey 16 14 31 24 10 11 13 12735 Anatolia, Turkey 16 14 32 24 9 11 13 12736 Anatolia, Turkey 17 12 28 21 10 11 15 12737 Anatolia, Turkey 17 13 30 21 9 11 13 12738 Anatolia, Turkey 17 13 30 25 10 11 13 Example (Ex. 1)
- 26. ● ●● ● ● ● ●● ●● ●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 500 1000 1500 2000 2500 0100200300400500600 Y−chromosome data−base, N=12727 index orderedfrequency N=12797 # distinct proﬁles: 2489 # singletons: 1397 # pairs: 379
- 27. An experiment • Truth = 2005 Roewer et al. 7 locus database with each of the 12738 duplicated the same, v. large number of times • Data-base = random sample of size 100 • Case = a new random haplotype different from all in data-base • Repeat this, 1000 times
- 28. • The DISCLAP model is not “true”, except as an ever better approximation with larger and larger number of subpopulations • So it suffers from both further levels of uncertainty: “wrong model”; “unknown model parameters” • The Good-Turing model is “true”, so only suffers from one level of uncertainty: “unknown model parameters” • The database is very small (N = 100) • We were “dumb users” - used DISCLAP out of the box, default parameters The experiment is biased against DISCLAP
- 29. Performance of Good-Turing 2.6 2.3 3.6 y-axis: log10 LR; x-axis: replicate ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● 0 200 400 600 800 1000 2.42.62.83.03.23.43.6 Index log10(LRg_est) 2.42.62.83.03.23.43.6 log10(LRg_est)
- 30. Performance of DISCLAP estimated log10 LR true log10 LR << true log10 LR Good-Turing Superimposed Histograms x Frequency 2 4 6 8 10 050100150200250
- 31. Comparison DISCLAP : Good-Turing y-axis: estimated minus true log10 LR x-axis: 1000 replicates ● ●●● ● ●● ● ●● ● ●●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●●● ● ● ● ●● ●●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ●● ●● ● ● ●●● ●●●●●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ●● ● ● ● ●● ●●●●● ● ● ● ●● ●●● ● ● ● ● ● ● ● ●●● ● ● ●●● ●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ●●●● ● ● ●● ● ●● ● ● ●● ●● ● ●●● ● ●● ● ● ● ●●●●● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ●● ●● ● ● ● ●●●●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●●● ●●●●● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ●●● ● ●● ● ●●● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ●●● ● ●● ●● ●● ● ● ● ●●● ●●● ●● ● ● ●● ● ●● ● ●●● ● ● ● ●●●●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●●●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ●●● ●● ●● ●●●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●●●●●●●●●● ●● ● ● ●●● ●● ●●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●●●●● ● ●● ● ●● ●● ● ●●●●● ● ● ● ● ● ●● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ●●●● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ●●● ● ●●● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ●●● ●● ● ●● ● ●● ● ●● ● ● ● ● ●●●●●●●● ●● ● ● ●● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ● ● 0 200 400 600 800 1000 0246 Index log10(Ratio_Gill) ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● 0 200 400 600 800 1000 0246 Index log10(Ratio_Andersen)
- 32. Conclusions (1) • Throwing away the haplotype information reduces “true” weight of evidence (log10 LR) typically from about 3.5 to about 2.5 • Good-Turing is fairly unbiased and has rather small variance (typical error about 0.2, rarely 0.5) • DISCLAP “out of the box” is biased in favour of the prosecution by an order of magnitude (error in log10 LR of about 1.0) and the errors are heavily skewed to the right • The defence could therefore object to use of DISCLAP (in this way…) and their objection ought to be accepted
- 33. Conclusions (2) • Much more work to do … • Publications in preparation (Giulia Cereda and me) • For MLE work on spectrum (cf. Mao example…) ! ! • Something completely different, for the cats/dogs here: Google “Ben Geen” http://arxiv.org/abs/1312.1200 Estimating a probability mass function with unknown labels Dragi Anevski, Richard D. Gill, Stefan Zohren ! (Building on Orlitsky et al., 2004)

No public clipboards found for this slide

Be the first to comment