Jan2015 ga4 gh variant comparison

GA4GH work towards a standardized
variant comparison tool
Kevin Jacobs
1/30/2015

Simple Question
Is a variant present within a genome?

Simple Question?
• sequence at a location:
– allele
– haplotype
– genotype
• something like a
– VCF genotype
– HGVS string
– dbSNP/ClinVar/HGMD entry

Simple Question?
• Collection of variants (and reference) in
– VCF/gVCF/BCF file
– var/MasterVar file
– dbSNP/ClinVar/HGMD/etc.
– your fancy new file format
– your fancy new database

Problem Statement
Surely this must be a solved problem?

Dr. Seuss
• Sometimes the questions
are complicated and the
answers are simple

Is this a simple question?
• It also depends on how we define…
– variant, genome, location, genotype, present
• Can we answer this question?
– Is the location well defined?
– Did we observe reads that location?
– Could we infer a single most-likely genotype at that
location?
– Are we asking about “simple” variation in a “nice”
region of the genome?
• If yes to all of these, then we can almost always
answer our question correctly.

Why is this so hard?
• Consider c.2_4delCTAinsGC
– REF: ACTAC
– H1: =G-C=
• It can also be spelled
– c.[2C>G; 3del; 4A>C]
– c.[2C>G; 3T>C; 4del]
– c.[2del; 3T>G; 4A>C]
– …

Assumptions and notation
• We have an accurate reference genome
sequence
• Queries are relative to well-defined non-
ambiguous regions of the reference sequence
• Simple sequence query / assertion:
– VCF: (chrom, pos, ref, alts, geno)
• E.g. (chrZ, 55, A, G, 1/1)
– Generic: (chrom, start, stop, alleles)
• E.g. (chrZ, 54, 55, G, G)
– These representations are equivalent modulo some
strange encoding rules for VCF relating to null alleles

Most basic model
• A “genome” G is a set of sequence assertions
• A “query” is a proposition q∈G where q is a
sequence assertion
• E.g.
– G = { (chrZ, 55, G, G) }
– Q1 : (chrZ, 55, A, G) ∈ G = False
– Q2 : (chrZ, 55, G, G) ∈ G = True

Basic model extensions
• Simple extensions
– Indels / MNVs
– Reference calls (like gVCF)
– No calls, partial calls
– Arbitrary ploidy
– Phase, quality, filters, etc. (not show)
G = {(chrZ, 0, 24, =, =), (chrZ, 24, 25, G, G),
(chrZ, 25, 53, =, =), (chrZ, 53, 55, NN, NN),
(chrZ, 55, 88, =, =), (chrZ, 88, 92, ATAT, NNNN),
(chrZ, 92, 96, =, =), (chrZ, 96, 98, A, ☐),
(chrZ, 98, 100, =, =)}

Limitations of the basic model
• Sequence assertions do not have unique
representations
– Alignments are not unique
– Alignment models differ
– Nearby variants / phase information
– Missing data and uncertainty
• Sometimes we aren’t asking the right question

Alignments are not unique
• Precedence of insertions, deletions and
mismatches:
– REF: ACAC
– H1: =-G= (AGC)
– H2: =G-= (AGC)

Limitations of the basic model
• Sequence assertions do not have unique
representations
– REF: TCACACACAG
– H1: T--CACACAG (REF, 1, 3, ☐)
– H2: TC--ACACAG (REF, 2, 4, ☐)
– H3: TCA--CACAG (REF, 3, 5, ☐)
– H4: TCAC--ACAG (REF, 4, 6, ☐)
– H5: TCACA--CAG (REF, 5, 7, ☐)
– H6: TCACAC--AG (REF, 6, 8, ☐)
– H7: TCACACA--G (REF, 7, 9, ☐)

Alignments models differ
• Different alignment scoring:
– REF: A--CAC
– H1: =GG--= (REF, 1, 1, ☐, GG)
(REF, 1, 3, CA, ☐)
– H2: =--GG= (REF, 1, 3, CA, GG)
• Base quality aware alignments algorithms are
even more susceptible to non-unique
alignments

Ignoring phase or phase uncertainty
introduces ambiguity
– REF: ACGT
– H1: =A== (REF, 1, 2, C, A)
– H2: ==C= (REF, 2, 3, G, C)
• Vs
– REF: ACGT
– H1: =AC= (REF, 1, 2, C, A)
(REF, 2, 3, G, C)
– H2: ====
• Vs
– REF: ACGT
– H1: =AC= (REF, 1, 3, CG, AC)
– H2: ====

Missing data
G = {(chrZ, 0, 24, =, =), (chrZ, 24, 25, G, G),
(chrZ, 25, 53, =, =), (chrZ, 53, 55, NN, NN),
(chrZ, 55, 88, =, =), (chrZ, 88, 92, ATAT, NNNN),
(chrZ, 92, 96, =, =), (chrZ, 96, 98, A, ☐),
(chrZ, 98, 100, =, =)}
• Q: (chrZ, 54, 55, A, T) ∈ G  False

Multiple alleles/samples
• Remember our friend:
– REF: TCACACACAG
– H1: T--CACACAG (REF, 1, 3, ☐)
– H2: TCACACA--G (REF, 7, 9, ☐)

Multiple alleles/samples
• What left-normalizing H2 will look like in VCF?
– REF: TCACACACAG
– H2: TCACACA--G (REF, 7, 9, ☐)
– H3: TCACACACTG (REF, 8, 9, T)
– H4: TCACTCACAG (REF, 4, 5, T)
– H5: TTACACACAG (REF, 1, 2, T)
– H1: T--CACACAG (REF, 1, 3, ☐)

Bottom Line
• Is there a canonical form for sequence
assertions?
– If so, then we can normalize our data into that
form and rely on simple set-existential queries
– If not, then we need a better model
– In the mean time, we rely on heuristics to perform
comparisons and understand that they are
imperfect

Better models
• Two basic approaches
1. Standardize alignment and representations so
that we can always derive a unique canonical
representation
2. Make the comparison model “spelling agnostic”

Reference graph model
• Convert (g)VCF and other file formats into a graph
representation
• Compute whether graph can “generate” the query
haplotype or genotype
– Supporting multiple forms of ambiguity that are inherent
in the biological questions we ask.
Phase constraint

Related Problems
• What are all of the differences between two
genomes?
• Collect all alleles observed across multiple
genomes
• Merge genomes into a single coherent
representation
• Efficiently store and query a large number of
genomes

Implementation plan
• Build a reference implementation
– Open source, free, and hosted by GA4GH
– Built in Python + Cython
– Include an extensive test suite
• Not inventing any new file formats
• Implementation underway
– VCF processor built on htslib
– Rest of the engine in progress
– Accounting and testing coming soon after

Thanks to:
• Justin Zook and the other GIAB organizers
• Geneticists, who have been doing this right all along
• Complete Genomics for their calldiff algorithm
• Great discussions and debates with friends and
colleagues at NCI, NCBI, Invitae, 23andMe, 1000
Genomes, GA4GH, etc.

Jan2015 ga4 gh variant comparison

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Jan2015 ga4 gh variant comparison

Similar to Jan2015 ga4 gh variant comparison (20)

More from GenomeInABottle

More from GenomeInABottle (20)

Recently uploaded

Recently uploaded (20)

Jan2015 ga4 gh variant comparison