Chenoweth os bridge 2015 pp

Calculating Guilt:
Using open-source software
in forensic DNA testing
Sarah Chenoweth
sarah@dreamwidth.org
@sarahquaint

Disclaimer
• All opinions are my own.
• Dammit, Jim, I’m a chemist, not a programmer.
• …or a statistician.
• slideshare.net/dreamwidth

Gameplan
• Forensic DNA 101
• What sort of profiles do I obtain?
• Statistics: giving weight to those profiles
• Open-or-not software for calculating these statistics

Anne Arundel
County Police
Crime Lab
DNA Technical Leader
source: Wikimedia commons

Rosalind was robbed.
• 23 pairs of chromosome
• >3 billion base pairs
• ~2% is coding DNA (genes)
• ~20-40% is regulatory
• ~50% is highly repetitive

AATGAATGAATGAATGAATGAATGAATG <— 7 repeats
AATGAATGAATGAATGAATGAATGAATGAATGAATGAAT <— 9.3 repeats
STR = Short Tandem Repeat
On chromosome 11, there is an area called TH01,
where the STR “AATG” repeats over an over again.
On the chromosome from my mother, it repeats 7 times,
and on the one from my father, it repeats 9.3 times.
Source: National Human
Genome Research Institute

You are not a special
snowflake.
• Most of your DNA, including your genes, is “highly
conserved”
• All humans are 99.9% identical
• Of course, 0.1% of 3 billion = 3 million base pairs of
variation

It’s like an EAN on the back
of a book…
• A forensic DNA profile is the length of 23 STRs,
each between 100-500 base pairs in length
• <3% of 1% of 1% of your genome
• Unique “barcode”, except for identical siblings.

Included or excluded?
• Single-source profiles are simple. But we mostly see
mixtures.
• DNA is the gold standard, carries a lot of weight.
• Must characterize all inclusions with a statistic.
• Make the qualitative statement (excluded, or
matches), characterize it with a quantitative
statistic, and let the trier of fact evaluate.

• 4 alleles at Penta E: 5,7,9,13
• Say this is an assault. We can
assume that the victim is present,
and we know the victim is 7,9.
• So: what are the odds that a random person in the
population is a 5,13?
Likelihood Ratio (LHR)

Likelihood Ratio (LHR)
• How frequently do we see the 5 allele? About 4%
• How frequently do we see the 13 allele? About 5%
• At this one locus: 360 times more likely it’s Sarah &
Robert than Sarah & someone picked at random
from the population.
• Calculate this at all 22 loci, and multiply together:
1.6 x 1023 (160,000,000,000,000,000,000,000)

A wretched hive of scum and
villainy.

“A reasonable degree of
scientific certainty.”
• DNA is a living, biological substance = messy
• Our testing procedure is super-sensitive. <10 cells
• The law wants a clear line between guilty and not
guilty; science is full of, “Well, maybe; it depends.”
• Our classic statistical tools can’t handle these
incomplete mixtures.

That one little allele.
The 9 allele is just
below the threshold.

…now what?
• Only use the loci where the
suspect is present? That’s
horribly biased.
• Throw up our hands and refuse
to draw conclusions on partial
data? Also biased!
• The least awful solution is to
only use the loci that we know
have complete info: the ones
with two minor loci.
source: my sister, who is the biological mother of this pouty kid.

The loci with 2 minor alleles :
4 out of 22: 18% of the data.
LRH: 1,400,000

Understating is just as bad
as overstating.
• Well, almost. The justice system is designed to err
on the side of caution, and benefit the defendant.
• Take a conservative approach.
• But not using all the data isn’t always conservative:
what if that was exculpatory information?

Probabilistic genotyping
Semi-continuous
• Considers the probability of drop out when calculating the LHR.
• Open source. Fast.
• Still doesn’t use all the data (peak height ratios, stutter).
Scenario:
The victim is: 20,20
The suspect is: 19,22
What is the probability the suspect is
a contributor, but the 19 dropped out?

Lab Retriever
• scieg.org/lab_retriever.html
• github.com/SCIEG/LabRetriever

Lab Retriever
• if we had a complete mixture =1.6 x 1023
160,000,000,000,000,000,000,000
• partial mixture, so we only use 4 loci for LHR =
1.4 x 106 = 1,400,000
• same partial mixture, semi-continuous LHR = 7.3
x 1020 = 730,000,000,000,000,000,000

Probabilistic genotyping
Continuous
• Markov-chain Monte Carlo (MCMC) simulations.
• Uses all of the data, with fewer assumptions.
• Doesn’t just give you the best estimate: gives you a range.
Probable genotype of
the minor contributor:
AC: 40%
BC: 25%
CC: 20%
CQ: 15%

STRMix
• Developed by the ESR (Environmental Science and
Research, NZ) and FSSA (Forensic Science South
Australia)
• Increasingly becoming the standard
• 20K USD initially, 5K/yr support contract

The justice system does not
embrace open source.
• The data is reliable: but is my interpretation?
• But I don’t tell “the whole truth, and nothing but the
truth.” I can only answer the questions I’m asked.
• Prosecutor misstatement: “That means there’s a one
in a quadrillion chance it’s someone else!”
• Defense misstatement: “She didn’t test the DNA of a
quadrillion people, so there’s no way that’s true!”

Currently, in forensic DNA:
• Binary statistics: yes
• Semi-continuous: yes
• Continuous: no
• Frequency databases: yes
• Data analysis: no
• CODIS: hell to the no
source: Wikimedia commons

Statistics are hard.
source: Bill Gacey @Flickr

There is too much.
Let me sum up:
• Transparency is the key to credibility.
• I need to document all my observations, results, and
calculations so they are reproducible.
• Open software are necessary for independent
verification.

Thank you.sarah@dreamwidth.org
twitter: sarahquaint
slideshare.net/dreamwidth

Chenoweth os bridge 2015 pp

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Similar to Chenoweth os bridge 2015 pp

Similar to Chenoweth os bridge 2015 pp (20)

More from dreamwidth

More from dreamwidth (16)

Recently uploaded

Recently uploaded (20)

Chenoweth os bridge 2015 pp

Editor's Notes