Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chenoweth os bridge 2015 pp


Published on

DNA testing has become the "gold standard" of forensics, but linking an item of evidence to a person of interest isn't always clear cut. New open source tools allow DNA analysts to give statistical weight to evidentiary profiles that were previously unusable, letting juries weigh the evidence for themselves. This talk will discuss the Lab Retriever software package for probabilistic genotyping.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Chenoweth os bridge 2015 pp

  1. 1. Calculating Guilt: Using open-source software in forensic DNA testing Sarah Chenoweth @sarahquaint
  2. 2. Disclaimer • All opinions are my own. • Dammit, Jim, I’m a chemist, not a programmer. • …or a statistician. •
  3. 3. Gameplan • Forensic DNA 101 • What sort of profiles do I obtain? • Statistics: giving weight to those profiles • Open-or-not software for calculating these statistics
  4. 4. Anne Arundel County Police Crime Lab DNA Technical Leader source: Wikimedia commons
  5. 5. Rosalind was robbed. • 23 pairs of chromosome • >3 billion base pairs • ~2% is coding DNA (genes) • ~20-40% is regulatory • ~50% is highly repetitive
  6. 6. AATGAATGAATGAATGAATGAATGAATG <— 7 repeats AATGAATGAATGAATGAATGAATGAATGAATGAATGAAT <— 9.3 repeats STR = Short Tandem Repeat On chromosome 11, there is an area called TH01, where the STR “AATG” repeats over an over again. On the chromosome from my mother, it repeats 7 times, and on the one from my father, it repeats 9.3 times. Source: National Human Genome Research Institute
  7. 7. STR = Short Tandem Repeat
  8. 8. You are not a special snowflake. • Most of your DNA, including your genes, is “highly conserved” • All humans are 99.9% identical • Of course, 0.1% of 3 billion = 3 million base pairs of variation
  9. 9. This is me.
  10. 10. It’s like an EAN on the back of a book… • A forensic DNA profile is the length of 23 STRs, each between 100-500 base pairs in length • <3% of 1% of 1% of your genome • Unique “barcode”, except for identical siblings.
  11. 11. Receive evidence.
  12. 12. Sample. Extract.
  13. 13. Quantitate.
  14. 14. Amplify.
  15. 15. Measure.
  16. 16. What we receive.
  17. 17. What gives useful results.
  18. 18. Included or excluded? • Single-source profiles are simple. But we mostly see mixtures. • DNA is the gold standard, carries a lot of weight. • Must characterize all inclusions with a statistic. • Make the qualitative statement (excluded, or matches), characterize it with a quantitative statistic, and let the trier of fact evaluate.
  19. 19. Nice, 2 person mixture
  20. 20. Nice, 2 person mixture
  21. 21. • 4 alleles at Penta E: 5,7,9,13 • Say this is an assault. We can assume that the victim is present, and we know the victim is 7,9. • So: what are the odds that a random person in the population is a 5,13? Likelihood Ratio (LHR)
  22. 22. Likelihood Ratio (LHR) • How frequently do we see the 5 allele? About 4% • How frequently do we see the 13 allele? About 5% • At this one locus: 360 times more likely it’s Sarah & Robert than Sarah & someone picked at random from the population. • Calculate this at all 22 loci, and multiply together: 1.6 x 1023 (160,000,000,000,000,000,000,000)
  23. 23. The world is a dirty place.
  24. 24. The world is a dirty place.
  25. 25. A wretched hive of scum and villainy.
  26. 26. A wretched hive of scum and villainy.
  27. 27. “A reasonable degree of scientific certainty.” • DNA is a living, biological substance = messy • Our testing procedure is super-sensitive. <10 cells • The law wants a clear line between guilty and not guilty; science is full of, “Well, maybe; it depends.” • Our classic statistical tools can’t handle these incomplete mixtures.
  28. 28. Nice, 2 person mixture…
  29. 29. Same… except.
  30. 30. That one little allele. The 9 allele is just below the threshold.
  31. 31. …now what? • Only use the loci where the suspect is present? That’s horribly biased. • Throw up our hands and refuse to draw conclusions on partial data? Also biased! • The least awful solution is to only use the loci that we know have complete info: the ones with two minor loci. source: my sister, who is the biological mother of this pouty kid.
  32. 32. The loci with 2 minor alleles : 4 out of 22: 18% of the data. LRH: 1,400,000
  33. 33. Understating is just as bad as overstating. • Well, almost. The justice system is designed to err on the side of caution, and benefit the defendant. • Take a conservative approach. • But not using all the data isn’t always conservative: what if that was exculpatory information?
  34. 34. Probabilistic genotyping Semi-continuous • Considers the probability of drop out when calculating the LHR. • Open source. Fast. • Still doesn’t use all the data (peak height ratios, stutter). Scenario: The victim is: 20,20 The suspect is: 19,22 What is the probability the suspect is a contributor, but the 19 dropped out?
  35. 35. Lab Retriever • •
  36. 36. Lab Retriever
  37. 37. Lab Retriever
  38. 38. Lab Retriever • if we had a complete mixture =1.6 x 1023 160,000,000,000,000,000,000,000 • partial mixture, so we only use 4 loci for LHR = 1.4 x 106 = 1,400,000 • same partial mixture, semi-continuous LHR = 7.3 x 1020 = 730,000,000,000,000,000,000
  39. 39. Probabilistic genotyping Continuous • Markov-chain Monte Carlo (MCMC) simulations. • Uses all of the data, with fewer assumptions. • Doesn’t just give you the best estimate: gives you a range. Probable genotype of the minor contributor: AC: 40% BC: 25% CC: 20% CQ: 15%
  40. 40. STRMix • Developed by the ESR (Environmental Science and Research, NZ) and FSSA (Forensic Science South Australia) • Increasingly becoming the standard • 20K USD initially, 5K/yr support contract
  41. 41. The justice system does not embrace open source. • The data is reliable: but is my interpretation? • But I don’t tell “the whole truth, and nothing but the truth.” I can only answer the questions I’m asked. • Prosecutor misstatement: “That means there’s a one in a quadrillion chance it’s someone else!” • Defense misstatement: “She didn’t test the DNA of a quadrillion people, so there’s no way that’s true!”
  42. 42. Currently, in forensic DNA: • Binary statistics: yes • Semi-continuous: yes • Continuous: no • Frequency databases: yes • Data analysis: no • CODIS: hell to the no source: Wikimedia commons
  43. 43. Statistics are hard. source: Bill Gacey @Flickr
  44. 44. There is too much. Let me sum up: • Transparency is the key to credibility. • I need to document all my observations, results, and calculations so they are reproducible. • Open software are necessary for independent verification.
  45. 45. Thank twitter: sarahquaint