3. Simple Question?
Is a variant present within a genome?
• sequence at a location:
– allele
– haplotype
– genotype
• something like a
– VCF genotype
– HGVS string
– dbSNP/ClinVar/HGMD entry
4. Simple Question?
Is a variant present within a genome?
• Collection of variants (and reference) in
– VCF/gVCF/BCF file
– var/MasterVar file
– dbSNP/ClinVar/HGMD/etc.
– your fancy new file format
– your fancy new database
5. Problem Statement
Is a variant present within a genome?
Surely this must be a solved problem?
7. Is this a simple question?
• It also depends on how we define…
– variant, genome, location, genotype, present
• Can we answer this question?
– Is the location well defined?
– Did we observe reads that location?
– Could we infer a single most-likely genotype at that
location?
– Are we asking about “simple” variation in a “nice”
region of the genome?
• If yes to all of these, then we can almost always
answer our question correctly.
9. Why is this so hard?
• Consider c.2_4delCTAinsGC
– REF: ACTAC
– H1: =G-C=
• It can also be spelled
– c.[2C>G; 3del; 4A>C]
– c.[2C>G; 3T>C; 4del]
– c.[2del; 3T>G; 4A>C]
– …
10. Assumptions and notation
• We have an accurate reference genome
sequence
• Queries are relative to well-defined non-
ambiguous regions of the reference sequence
• Simple sequence query / assertion:
– VCF: (chrom, pos, ref, alts, geno)
• E.g. (chrZ, 55, A, G, 1/1)
– Generic: (chrom, start, stop, alleles)
• E.g. (chrZ, 54, 55, G, G)
– These representations are equivalent modulo some
strange encoding rules for VCF relating to null alleles
11. Most basic model
• A “genome” G is a set of sequence assertions
• A “query” is a proposition q∈G where q is a
sequence assertion
• E.g.
– G = { (chrZ, 55, G, G) }
– Q1 : (chrZ, 55, A, G) ∈ G = False
– Q2 : (chrZ, 55, G, G) ∈ G = True
13. Limitations of the basic model
• Sequence assertions do not have unique
representations
– Alignments are not unique
– Alignment models differ
– Nearby variants / phase information
– Missing data and uncertainty
• Sometimes we aren’t asking the right question
14. Alignments are not unique
• Precedence of insertions, deletions and
mismatches:
– REF: ACAC
– H1: =-G= (AGC)
– H2: =G-= (AGC)
21. Bottom Line
• Is there a canonical form for sequence
assertions?
– If so, then we can normalize our data into that
form and rely on simple set-existential queries
– If not, then we need a better model
– In the mean time, we rely on heuristics to perform
comparisons and understand that they are
imperfect
22. Better models
• Two basic approaches
1. Standardize alignment and representations so
that we can always derive a unique canonical
representation
2. Make the comparison model “spelling agnostic”
23. Reference graph model
• Convert (g)VCF and other file formats into a graph
representation
• Compute whether graph can “generate” the query
haplotype or genotype
– Supporting multiple forms of ambiguity that are inherent
in the biological questions we ask.
Phase constraint
24. Related Problems
• What are all of the differences between two
genomes?
• Collect all alleles observed across multiple
genomes
• Merge genomes into a single coherent
representation
• Efficiently store and query a large number of
genomes
25. Implementation plan
• Build a reference implementation
– Open source, free, and hosted by GA4GH
– Built in Python + Cython
– Include an extensive test suite
• Not inventing any new file formats
• Implementation underway
– VCF processor built on htslib
– Rest of the engine in progress
– Accounting and testing coming soon after
26. Thanks to:
• Justin Zook and the other GIAB organizers
• Geneticists, who have been doing this right all along
• Complete Genomics for their calldiff algorithm
• Great discussions and debates with friends and
colleagues at NCI, NCBI, Invitae, 23andMe, 1000
Genomes, GA4GH, etc.