Report

Share

•2 likes•1,641 views

•2 likes•1,641 views

Report

Share

Download to read offline

Presentation by Richard Durbin at the GRC workshop held at the 2016 Genome Informatics meeting at Hinxton, UK

- 1. Variation reference graphs and the variation graph toolkit vg Erik Garrison, Jouni Siren, Eric Dawson, Richard Durbin Wellcome Trust Sanger Institute Adam Novak, Benedict Paten et al., UCSC and many others
- 2. Variation Reference • Go beyond a linear reference – Why a (quasi)-linear reference and a catalog of variants which we keeping finding again? • Local variation: graph reference – Map to a structure including known variation – >99% variants per person already seen • Long range variation: haplotype structure – Exploit variation sharing – support phasing – Recombination rate ~ mutation rate – >99% recombination breakpoints per person
- 3. Variation Reference • Go beyond a linear reference – Why a (quasi)-linear reference and a catalog of variants which we keeping finding again? • Local variation: graph reference – Map to a structure including known variation – >99% variants per person already seen • Long range variation: haplotype structure – Exploit variation sharing – support phasing – Recombination rate ~ mutation rate – >99% recombination breakpoints per person
- 4. Variation graphs: “Pan Genome” A variation graph represents many genomes in one non-redundant structure. Nodes contain sequence and edges between the ends of nodes represent potential links between successive sequences
- 5. Variation graphs and train tracks The links in a variation graph are bidirectional. They behave in many ways like train tracks. Nodes have positive and negative strands, allowing them to be traversed in either direction, and can be connected to form loops (repeats), inversions and translocations. NB There are other ways to do this. One can have sequence on edges. Or unidirectional graphs (nearly) twice as big.
- 6. “Computational Pan-Genomics: Status, Promises and Challenges.” Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016) in press Essential operations on pan-genomes
- 8. Implementation in vg • Nodes with sequence, Edges, Paths, Mappings • Alignment tools and .gam format • Serialisation to disk via protobuf, succinct representation xg, graph building/editing, extraction, unrolling and DAGification of local graphs etc. https://github.com/vgteam/vg
- 9. AGCTCTCCTTGTCCCTCCTACGATCTCTTCACTGGCCTCTTATCTTTACTGTTACCAAATCTTTCCGGAAGCTGCTCTTTC find k-mer subgraphs read k-mers node ids hit clusters cluster ids target subgraph partial order alignment Alignment k-mer based alignment of short reads to a variation graph store results in Graph Alignment Map (GAM) format
- 10. Alternative index: GCSA2 • Generalised Compressed Suffix Array – Jouni Siren, Niko Valimaki, Veli Makinen • Natural extension of BWT to graphs – Essentially set of minimal unique k-mers with one base prefix extension – Supports compression, FM-index style search etc. • Now implemented for vg graph search – <20GB index, fast SMEM seed and extendJouni Siren talk tomorrow (Maximal Exact Match)
- 11. Pilot alignment and variant calling evaluation Slides from Benedict Paten and collaborators
- 14. Genotyper output The genotyper considers support for every bubble based on embedded paths and emits genotypes as Locus records that are each a set of alleles represented as paths relative to the base graph. Most variants are within the reference. Also consider new variants by (temporarily) augmenting the graph to include repeatedly seen alignment alternatives.
- 15. Genotype evaluation mix CHM1/13 Illumina reads – truth from PacBio MHC BRCA1
- 16. Reference Graph Augmented Graph & Alignments Alignments, Paths, Genotypes, and Annotations Relative to the Augmented Graph Aligned Reads Translation Coordinates in vg are not stable across graph edits. But, we can retain a mapping from new to old coordinates when editing. This translation provides a stable coordinate system for VGs, solves surjection problem, and enables a virtuous feedback loop! An architecture supporting stable coordinates
- 17. Thank you Erik Garrison, Jouni Sirén, Eric Dawson, Jerven Bolleman, Adam Novak, Glen Hickey, Benedict Paten, Will Jones, Jordan Eizenga, Toshiaki Katayama, Orion Buske, Raoul Bonnal, Mike Lin, and many others who have helped us understand, design, implement and evaluate vg.