DNA testing has become the "gold standard" of forensics, but linking an item of evidence to a person of interest isn't always clear cut. New open source tools allow DNA analysts to give statistical weight to evidentiary profiles that were previously unusable, letting juries weigh the evidence for themselves. This talk will discuss the Lab Retriever software package for probabilistic genotyping.
2. Disclaimer
• All opinions are my own.
• Dammit, Jim, I’m a chemist, not a programmer.
• …or a statistician.
• slideshare.net/dreamwidth
3. Gameplan
• Forensic DNA 101
• What sort of profiles do I obtain?
• Statistics: giving weight to those profiles
• Open-or-not software for calculating these statistics
5. Rosalind was robbed.
• 23 pairs of chromosome
• >3 billion base pairs
• ~2% is coding DNA (genes)
• ~20-40% is regulatory
• ~50% is highly repetitive
6. AATGAATGAATGAATGAATGAATGAATG <— 7 repeats
AATGAATGAATGAATGAATGAATGAATGAATGAATGAAT <— 9.3 repeats
STR = Short Tandem Repeat
On chromosome 11, there is an area called TH01,
where the STR “AATG” repeats over an over again.
On the chromosome from my mother, it repeats 7 times,
and on the one from my father, it repeats 9.3 times.
Source: National Human
Genome Research Institute
8. You are not a special
snowflake.
• Most of your DNA, including your genes, is “highly
conserved”
• All humans are 99.9% identical
• Of course, 0.1% of 3 billion = 3 million base pairs of
variation
10. It’s like an EAN on the back
of a book…
• A forensic DNA profile is the length of 23 STRs,
each between 100-500 base pairs in length
• <3% of 1% of 1% of your genome
• Unique “barcode”, except for identical siblings.
18. Included or excluded?
• Single-source profiles are simple. But we mostly see
mixtures.
• DNA is the gold standard, carries a lot of weight.
• Must characterize all inclusions with a statistic.
• Make the qualitative statement (excluded, or
matches), characterize it with a quantitative
statistic, and let the trier of fact evaluate.
21. • 4 alleles at Penta E: 5,7,9,13
• Say this is an assault. We can
assume that the victim is present,
and we know the victim is 7,9.
• So: what are the odds that a random person in the
population is a 5,13?
Likelihood Ratio (LHR)
22. Likelihood Ratio (LHR)
• How frequently do we see the 5 allele? About 4%
• How frequently do we see the 13 allele? About 5%
• At this one locus: 360 times more likely it’s Sarah &
Robert than Sarah & someone picked at random
from the population.
• Calculate this at all 22 loci, and multiply together:
1.6 x 1023 (160,000,000,000,000,000,000,000)
27. “A reasonable degree of
scientific certainty.”
• DNA is a living, biological substance = messy
• Our testing procedure is super-sensitive. <10 cells
• The law wants a clear line between guilty and not
guilty; science is full of, “Well, maybe; it depends.”
• Our classic statistical tools can’t handle these
incomplete mixtures.
30. That one little allele.
The 9 allele is just
below the threshold.
31. …now what?
• Only use the loci where the
suspect is present? That’s
horribly biased.
• Throw up our hands and refuse
to draw conclusions on partial
data? Also biased!
• The least awful solution is to
only use the loci that we know
have complete info: the ones
with two minor loci.
source: my sister, who is the biological mother of this pouty kid.
32. The loci with 2 minor alleles :
4 out of 22: 18% of the data.
LRH: 1,400,000
33. Understating is just as bad
as overstating.
• Well, almost. The justice system is designed to err
on the side of caution, and benefit the defendant.
• Take a conservative approach.
• But not using all the data isn’t always conservative:
what if that was exculpatory information?
34. Probabilistic genotyping
Semi-continuous
• Considers the probability of drop out when calculating the LHR.
• Open source. Fast.
• Still doesn’t use all the data (peak height ratios, stutter).
Scenario:
The victim is: 20,20
The suspect is: 19,22
What is the probability the suspect is
a contributor, but the 19 dropped out?
38. Lab Retriever
• if we had a complete mixture =1.6 x 1023
160,000,000,000,000,000,000,000
• partial mixture, so we only use 4 loci for LHR =
1.4 x 106 = 1,400,000
• same partial mixture, semi-continuous LHR = 7.3
x 1020 = 730,000,000,000,000,000,000
39. Probabilistic genotyping
Continuous
• Markov-chain Monte Carlo (MCMC) simulations.
• Uses all of the data, with fewer assumptions.
• Doesn’t just give you the best estimate: gives you a range.
Probable genotype of
the minor contributor:
AC: 40%
BC: 25%
CC: 20%
CQ: 15%
40. STRMix
• Developed by the ESR (Environmental Science and
Research, NZ) and FSSA (Forensic Science South
Australia)
• Increasingly becoming the standard
• 20K USD initially, 5K/yr support contract
41. The justice system does not
embrace open source.
• The data is reliable: but is my interpretation?
• But I don’t tell “the whole truth, and nothing but the
truth.” I can only answer the questions I’m asked.
• Prosecutor misstatement: “That means there’s a one
in a quadrillion chance it’s someone else!”
• Defense misstatement: “She didn’t test the DNA of a
quadrillion people, so there’s no way that’s true!”
42. Currently, in forensic DNA:
• Binary statistics: yes
• Semi-continuous: yes
• Continuous: no
• Frequency databases: yes
• Data analysis: no
• CODIS: hell to the no
source: Wikimedia commons
44. There is too much.
Let me sum up:
• Transparency is the key to credibility.
• I need to document all my observations, results, and
calculations so they are reproducible.
• Open software are necessary for independent
verification.
Good afternoon! Okay, I should say off the bat: DNA cannot be used to calculate guilt. What you can calculate are statistics to give weight whenever you include in individual in a forensic DNA mixture.
This talk is not a tutorial, or a how-to. What I’d like to do is familiarize you with a pretty complex topic — forensic DNA testing — and show you how it relates to open source software in ways you might not be familiar with.
All opinions are my own: the police department I work for doesn’t know I’m here. Also, all photos are my own unless otherwise cited, and everything is licensed for noncommercial use.
This is a lot of data crammed into 40 minutes, and some these slides are rather dense. They are available on slideshare. Please feel free to raise you hand during the talk and let me know if I’ve lost you, but for longer questions, wait, and grab me afterwards.
I will not be speaking about any specific cases or details, but I will make general mention of some violent crimes, including sexual assaults. Also, when I use the terms ‘male’ and ‘female,’ I’m referring to the chromosomal make up, not gender identity.
So: I’m going to explain, as briefly as I can, what a forensic DNA profile is.
I’ll show you examples of forensic DNA profiles, including the good, the bad, and the ugly.
And finally, I’ll talk about what current and emerging options I have for calculating statistics, and why open source software is important to forensics.
This is the county I work for, with a population of about 500,000. I’m a native and current Baltimorian, which just to the north, and D.C. is the diamond-shaped void is just to the west.
I’m not speaking in an official capacity. But I do have fifteen years experience, which means I’ve had time to develop some strong opinions.
You have, in nearly every cell of your body, a complete copy of your genome, all 23 chromosomes of it. That’s a lot of information: over 3 billion base pairs, made up of only 4 nucleotides. Surprisingly, only about 2% of the genome are genes, which are actual recipes for proteins. About half are repetitive sections with no known function. These repetitive sections are a few nucleotides repeated over and over: AATG, AATG, AATG, for long stretches.
In fact, we have an acronym for those repetitive bits: STRs. Please note that I’m a scientist, and I work for the police, so I’m doubly obliged to have acronyms for everything.
So: you can have complete repeats, or partial repeats, which are noted with a decimal point.
Everyone has two copies, usually different lengths, sometimes the same length.
This is an electropherogram. This the actual output I see in the lab. I don’t see the A’s, T’s, C’s, and G’s, because the actual nucleotides don’t really matter. But I can measure how many times they repeat. This is the length of the STR, and that’s the top number in the box.
Most of your DNA is “highly conserved”. Highly conserved means mostly the same from generation to generation. This indicates it has an important function, so there’s less variation. And that’s not useful for forensic identification. We want to measure highly variable areas, so we can distinguish between people. Because the STRs have no known function, mutations have no detrimental effect, so they tend to be highly variable.
Now, I mentioned one STR, named TH01…
I use a commercially-available DNA kit that looks at 24 specific areas of the genome. Includes 22 STRs (like I just described) a sex-marker gene, and one STR that’s only found on the Y chromosome.
These peaks are called alleles. Where you see one tall peak, I have two copies that are the same length. Homozygous, vs heterozygous.
Now, these STRs represent a very small proportion of your genome, and because they aren’t genes, they don’t provide information about your physical appearance. Just like a UPC or EAN barcode, it’s not useful in and of itself. You need to compare it to known DNA profiles. To do this, we also test oral swabs from individuals related to the case.
So: how do I actually get from an item of evidence — like, say, an empty water bottle left in a stolen vehicle — to a forensic DNA profile? Well, we receive over 400 cases a year, with an incredible variety of items. This represents nearly all the items tested over a two year period (does not include bloodstains or semen).
The first step is to open each item, one at a time, on a sterile surface, and take a cutting or a swabbing. Then we purify the DNA by breaking open the cells with a detergent, and washing away all the membranes, proteins, dirt, etc.
Then, because the concentration of cells on different items is highly variable, we use up a bit of the purified DNA to measure how concentrated it is.
This is a thermal cycler, which is essentially a chemical photocopier. It unzips the double strand of DNA, then uses each strand to make a copy. Repeat this over and over, and you exponentially increase the amount of DNA in about two hours. We also tag each copy with a fluorescent dye.
Like a regular, paper photocopier, GIGO.
Our analysis instrument uses capillary electrophoresis: the capillaries are those thin copper wires, and at one end is a platinum cathode, and at the other end, a platinum anode, and a high voltage runs through it. It separates the piece by size, with shorter pieces moving faster than longer ones. Behind the black door is a laser, which excites the fluorescent tags on the STR copies, which are measured by a CCD camera.
So: to recap, this is what I receive…
…and this is what actually produces interpretable profiles.
Okay! Congratulations, we’ve reached the end of the Science portion! You are all welcome to visit my lab if you’re ever in town. Now: onto the Math-y portion of the talk. I’m not going to hit you with formulas, or anything, just give you a high level overview of the concepts.
Okay, great. When I obtain a clean, single source profile (like my own profile, which I showed you a few slides back), it’s a very simple matter to determine if it matches or doesn’t match the reference standards.
But most evidence yields a mixture. Mixtures are always more complicated, because it’s usually impossible to tell with certainty which alleles belong to which person. With mixtures, it’s particularly important to provide a statistic, to give people an idea of what percentage of the population could fit into that mixture.
In order to calculate any statistics, you need to know the approximate frequency of each allele in the population. The frequency database that I use was tabulated by a team at NIST (the national institute of standards and technology), based on testing of several thousand unrelated individuals.
You saw a single source profile when I showed you my DNA profile. It looks the same, whether it’s from an oral swab, or from a water bottle, or from a bloodstain.
This is an example of a mixture of 1 part male DNA and 2 parts female DNA. I know this because I made this mixture for a validation study. This is actually me and my coworker Robert.
Here’s the second half. Now, these peaks are all of nice, even height, all quite a bit above the interpretation threshold. This is lovely, very easy to interpret. When I get this kind of profile from actual evidence, I calculate a statistic called a likelihood ratio.
I’m not going to go into the math, here, but the likelihood ratio compares two probabilities of the same event under different hypotheses. In forensics, this means I’m weighing the prosecutor’s hypothesis to the defense attorney’s hypothesis. The prosecutor is theorizing that this is a mixture of the victim and the defendant. The defense postulates this is the victim, and some other person selected at random who happens to have a very similar profile to the defendant. The LHR expresses those odds based on how common each allele is in the population.
Where do these frequencies come from? Studies of large population groups. Specifically, I use a database compiled by NIST — that’s the National Institute of Standards and Technology.
(160 sextillion)
The important point here is that the statistics are very strong when there’s a strong, complete mixture. Unfortunately, I don’t usually get pretty, full mixtures like this.
This is a partial mixture, probably from two people. But there’s a lot of drop out. Take a look at D10 (middle of the blue row). You can see four peaks, but only one is labeled: only one is above the detection threshold for this instrument. The other three “dropped out”. Look right next to it, at D13. All the alleles dropped out. What’s more, though you can kind of see at least two peaks, it’s likely others that dropped out so completely that they’re not registering at all.
This is the other half of that profile.
However, this really isn’t even all that bad! …
…this is a mixture of at least five people. [Look in the middle of the green.] This is a seatbelt from a car, from a homicide case I’m working on.
Again, look at D13: only one allele. Does that mean it’s five homozygous people? No, just a lot of dropout.
“A reasonable degree of scientific certainty.” This is the phrase that’s always echoing through my mind when evaluating data. A lot of the time, I don’t have complete profiles. Where’s the line between reasonably certain and standing on shaky ground? Especially with new, ultra-sensitive techniques, we can’t use the same old statistical models.
This is that same 1:2 mixture of Robert and myself that I showed you…
This is the same mixture, same two people, but a different ratio, of 1 part Robert and 9 parts me. So his peaks are going to to be very short compared to mine. In fact, might be so short that some could drop out. Specifically look at TPOX.
I know I’m a 8,11 at TPOX, and Robert’s a 9,11. But his 9 allele isn’t present. Well, it kind of is, but not above the threshold, so I can’t say it’s there.
Well, that’s just great, when I know this is Robert. But what if it’s an assault case, and the minor component matches the suspect except for one little allele?
Well, I could choose which loci to use for the statistic based on which ones match the suspect. Just leave TPOX out of the calculation. But now I’m interpreting the evidence based on the suspect’s profile. That’s awful.
I could call it inconclusive, but again, how do I actually know if anything’s missing? I have to evaluate the evidence independently of the suspect. The least awful solution…
Now, I still have to make one assumption: that this is a mixture of two people. And I can also assume that I’m one of the two people. But I don’t need to make any assumptions about the suspect. I don’t even need the suspect’s profile in order to decide which loci to use for stats.
Unfortunately, that’s only 4 loci, out of the 22 I tested. I’m using less than a fifth of the data.
That sucks, because…
The solution: better, more sophisticated statistical models. We call these probabilistic genotype models. These are just emerging, just being adopted by forensic labs. There are two classes…
This is the best of the semi-continuous models currently available.
Developed by professors from UC Berkeley, UCLA, and California State University.
Lovely, simple UI, and I figured it how to use it in about four hours last Friday, when I suddenly realized I’d better write this talk. The hardest part was crunching my validation data to determine the probability of drop out.
Here’s that 1:9 mixture of Robert and I…
Calculated the LHR at each loci, for each of the three population groups that it has frequency data for, then the product of all this loci is the final LHR in bold, at the bottom.
160 sextillion, 1.4 million, & 730 quintillion
Even better: I did not decide which loci to use. I gave it the full profile, the drop out frequency determined previously, from an earlier validation study, and it went from there.
This is so much better and less biased. Also, open source. Anyone can reproduce what I did, including the defense, another DNA expert, you (if I gave you the profile).
There is another class of probabilistic genotyping models: the continuous models.
Not going to try and explain this math, because I only vaguely grasp it: it’s a sampling algorithm.
Monte Carlo: a method of predicting the probabilities that various events are likely to occur in the future
Markov-chain: the most common way to build the future states from some present state (we know the frequency of alleles, so we can predict genotypes)
This is the best of the continuous models available for forensic DNA labs.
Why do I care so much about bias, impartiality, and transparency? Because in this adversarial system, I’ve seen both sides misstate, overstate, and flat-out lie about the significance of DNA. So I want to present and explain the information as clearly as possible to all the stakeholders. (I don’t know why I’m never picked for jury duty: I don’t believe either side, because they both try to twist my words.)
So, what’s the current state of open source in forensic DNA testing?
[just before next slide] And in conclusion…
If people understood statistics, Vegas would be a sleepy spot in the desert.
In my opinion, which is an expert opinion, which means I’m allowed to opine in courtrooms: