Variation reference graphs and the variation graph toolkit vg
1. Variation reference graphs and
the variation graph toolkit vg
Erik Garrison, Jouni Siren, Eric Dawson, Richard
Durbin
Wellcome Trust Sanger Institute
Adam Novak, Benedict Paten et al., UCSC
and many others
2. Variation Reference
• Go beyond a linear reference
– Why a (quasi)-linear reference and a catalog
of variants which we keeping finding again?
• Local variation: graph reference
– Map to a structure including known variation
– >99% variants per person already seen
• Long range variation: haplotype structure
– Exploit variation sharing – support phasing
– Recombination rate ~ mutation rate
– >99% recombination breakpoints per person
3. Variation Reference
• Go beyond a linear reference
– Why a (quasi)-linear reference and a catalog
of variants which we keeping finding again?
• Local variation: graph reference
– Map to a structure including known variation
– >99% variants per person already seen
• Long range variation: haplotype structure
– Exploit variation sharing – support phasing
– Recombination rate ~ mutation rate
– >99% recombination breakpoints per person
4. Variation graphs: “Pan Genome”
A variation graph represents many genomes in
one non-redundant structure.
Nodes contain sequence and edges between the ends
of nodes represent potential links between successive
sequences
5. Variation graphs and train
tracks
The links in a variation graph are bidirectional.
They behave in many ways like train tracks.
Nodes have positive
and negative strands,
allowing them to be
traversed in either
direction, and can be
connected to form loops
(repeats), inversions
and translocations.
NB There are other ways to do this. One can have
sequence on edges. Or unidirectional graphs (nearly)
twice as big.
6. “Computational Pan-Genomics:
Status, Promises and
Challenges.”
Computational Pan-Genomics
Consortium. Briefings in
Bioinformatics (2016) in press
Essential
operations on
pan-genomes
8. Implementation in vg
• Nodes with sequence, Edges, Paths,
Mappings
• Alignment tools and .gam format
• Serialisation to disk via protobuf, succinct
representation xg, graph building/editing,
extraction, unrolling and DAGification of local
graphs etc.
https://github.com/vgteam/vg
10. Alternative index: GCSA2
• Generalised Compressed Suffix Array
– Jouni Siren, Niko Valimaki, Veli Makinen
• Natural extension of BWT to graphs
– Essentially set of minimal unique k-mers
with one base prefix extension
– Supports compression, FM-index style
search etc.
• Now implemented for vg graph search
– <20GB index, fast SMEM seed and extendJouni Siren talk tomorrow
(Maximal Exact Match)
14. Genotyper output
The genotyper considers support for every bubble based on
embedded paths and emits genotypes as Locus records that
are each a set of alleles represented as paths relative to the
base graph.
Most variants are within the reference.
Also consider new variants by (temporarily)
augmenting the graph to include repeatedly seen
alignment alternatives.
16. Reference
Graph
Augmented
Graph &
Alignments
Alignments, Paths,
Genotypes, and
Annotations Relative
to the Augmented
Graph
Aligned
Reads
Translation
Coordinates in vg are
not stable across graph
edits.
But, we can retain a
mapping from new to
old coordinates when
editing.
This translation
provides a stable
coordinate system for
VGs, solves surjection
problem, and enables
a virtuous feedback
loop!
An architecture supporting
stable coordinates
17. Thank you
Erik Garrison, Jouni Sirén, Eric Dawson,
Jerven Bolleman, Adam Novak, Glen
Hickey, Benedict Paten, Will Jones, Jordan
Eizenga, Toshiaki Katayama, Orion Buske,
Raoul Bonnal, Mike Lin, and many others
who have helped us understand, design,
implement and evaluate vg.