Variation reference graphs and the variation graph toolkit vg

Genome Reference Consortium
Genome Reference ConsortiumGenome Reference Consortium
Variation reference graphs and
the variation graph toolkit vg
Erik Garrison, Jouni Siren, Eric Dawson, Richard
Durbin
Wellcome Trust Sanger Institute
Adam Novak, Benedict Paten et al., UCSC
and many others
Variation Reference
• Go beyond a linear reference
– Why a (quasi)-linear reference and a catalog
of variants which we keeping finding again?
• Local variation: graph reference
– Map to a structure including known variation
– >99% variants per person already seen
• Long range variation: haplotype structure
– Exploit variation sharing – support phasing
– Recombination rate ~ mutation rate
– >99% recombination breakpoints per person
Variation Reference
• Go beyond a linear reference
– Why a (quasi)-linear reference and a catalog
of variants which we keeping finding again?
• Local variation: graph reference
– Map to a structure including known variation
– >99% variants per person already seen
• Long range variation: haplotype structure
– Exploit variation sharing – support phasing
– Recombination rate ~ mutation rate
– >99% recombination breakpoints per person
Variation graphs: “Pan Genome”
A variation graph represents many genomes in
one non-redundant structure.
Nodes contain sequence and edges between the ends
of nodes represent potential links between successive
sequences
Variation graphs and train
tracks
The links in a variation graph are bidirectional.
They behave in many ways like train tracks.
Nodes have positive
and negative strands,
allowing them to be
traversed in either
direction, and can be
connected to form loops
(repeats), inversions
and translocations.
NB There are other ways to do this. One can have
sequence on edges. Or unidirectional graphs (nearly)
twice as big.
“Computational Pan-Genomics:
Status, Promises and
Challenges.”
Computational Pan-Genomics
Consortium. Briefings in
Bioinformatics (2016) in press
Essential
operations on
pan-genomes
github.com/vgteam/vg
Operations
implemented in vg
Implementation in vg
• Nodes with sequence, Edges, Paths,
Mappings
• Alignment tools and .gam format
• Serialisation to disk via protobuf, succinct
representation xg, graph building/editing,
extraction, unrolling and DAGification of local
graphs etc.
https://github.com/vgteam/vg
AGCTCTCCTTGTCCCTCCTACGATCTCTTCACTGGCCTCTTATCTTTACTGTTACCAAATCTTTCCGGAAGCTGCTCTTTC
find k-mer
subgraphs
read
k-mers
node ids
hit clusters
cluster ids
target subgraph
partial order
alignment
Alignment
k-mer based
alignment of
short reads to a
variation graph
store results in
Graph Alignment
Map (GAM) format
Alternative index: GCSA2
• Generalised Compressed Suffix Array
– Jouni Siren, Niko Valimaki, Veli Makinen
• Natural extension of BWT to graphs
– Essentially set of minimal unique k-mers
with one base prefix extension
– Supports compression, FM-index style
search etc.
• Now implemented for vg graph search
– <20GB index, fast SMEM seed and extendJouni Siren talk tomorrow
(Maximal Exact Match)
Pilot alignment and
variant calling
evaluation
Slides from Benedict Paten and collaborators
Variation reference graphs and the variation graph toolkit vg
Variation reference graphs and the variation graph toolkit vg
Genotyper output
The genotyper considers support for every bubble based on
embedded paths and emits genotypes as Locus records that
are each a set of alleles represented as paths relative to the
base graph.
Most variants are within the reference.
Also consider new variants by (temporarily)
augmenting the graph to include repeatedly seen
alignment alternatives.
Genotype evaluation
mix CHM1/13 Illumina reads – truth from PacBio
MHC BRCA1
Reference
Graph
Augmented
Graph &
Alignments
Alignments, Paths,
Genotypes, and
Annotations Relative
to the Augmented
Graph
Aligned
Reads
Translation
Coordinates in vg are
not stable across graph
edits.
But, we can retain a
mapping from new to
old coordinates when
editing.
This translation
provides a stable
coordinate system for
VGs, solves surjection
problem, and enables
a virtuous feedback
loop!
An architecture supporting
stable coordinates
Thank you
Erik Garrison, Jouni Sirén, Eric Dawson,
Jerven Bolleman, Adam Novak, Glen
Hickey, Benedict Paten, Will Jones, Jordan
Eizenga, Toshiaki Katayama, Orion Buske,
Raoul Bonnal, Mike Lin, and many others
who have helped us understand, design,
implement and evaluate vg.
1 of 17

Recommended

The Transforming Genetic Medicine Initiative (TGMI) by
The Transforming Genetic Medicine Initiative (TGMI)The Transforming Genetic Medicine Initiative (TGMI)
The Transforming Genetic Medicine Initiative (TGMI)Genome Reference Consortium
649 views31 slides
Haplotype resolved structural variation assembly with long reads by
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsGenome Reference Consortium
1.8K views43 slides
Graph and assembly strategies for the MHC and ribosomal DNA regions by
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGenome Reference Consortium
564 views27 slides
agbt 2016 workshop lindsay by
agbt 2016 workshop lindsayagbt 2016 workshop lindsay
agbt 2016 workshop lindsayGenome Reference Consortium
1.2K views25 slides
Getting the most from the reference assembly by
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assemblyGenome Reference Consortium
738 views46 slides
Understanding the reference assembly: CSHL Hackathon by
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonGenome Reference Consortium
483 views41 slides

More Related Content

Viewers also liked

Genome in a Bottle by
Genome in a BottleGenome in a Bottle
Genome in a BottleGenome Reference Consortium
1.4K views33 slides
AGBT2017 Reference Workshop: Schneider by
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderGenome Reference Consortium
533 views27 slides
AGBT2017 Reference Workshop: Lindsay by
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayGenome Reference Consortium
490 views34 slides
AGBT2017 Reference Workshop: Fulton by
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonGenome Reference Consortium
1.5K views35 slides
Exploiting long read sequencing technology to build a substantially improved ... by
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Genome Reference Consortium
1.3K views32 slides
Creating Reference-Grade Human Genome Assemblies by
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
349 views20 slides

Viewers also liked(11)

Exploiting long read sequencing technology to build a substantially improved ... by Genome Reference Consortium
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...

Similar to Variation reference graphs and the variation graph toolkit vg

Graph mining seminar_2009 by
Graph mining seminar_2009Graph mining seminar_2009
Graph mining seminar_2009Houw Liong The
2.6K views109 slides
graph_mining_seminar_2009.ppt by
graph_mining_seminar_2009.pptgraph_mining_seminar_2009.ppt
graph_mining_seminar_2009.pptVenkateswara Rao Katevarapu
4 views109 slides
20110524zurichngs 1st pub by
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pubsesejun
1.5K views63 slides
Data Consistency in Distributed Systems with Akka Distributed Data by
Data Consistency in Distributed Systems with Akka Distributed DataData Consistency in Distributed Systems with Akka Distributed Data
Data Consistency in Distributed Systems with Akka Distributed DataDmitry Martyanov
905 views31 slides
B 4 gravty by
B 4 gravtyB 4 gravty
B 4 gravtyLINE Corporation
9.5K views41 slides
Scaling Genomic Analyses by
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analysesfnothaft
893 views19 slides

Similar to Variation reference graphs and the variation graph toolkit vg(20)

Graph mining seminar_2009 by Houw Liong The
Graph mining seminar_2009Graph mining seminar_2009
Graph mining seminar_2009
Houw Liong The2.6K views
20110524zurichngs 1st pub by sesejun
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
sesejun1.5K views
Data Consistency in Distributed Systems with Akka Distributed Data by Dmitry Martyanov
Data Consistency in Distributed Systems with Akka Distributed DataData Consistency in Distributed Systems with Akka Distributed Data
Data Consistency in Distributed Systems with Akka Distributed Data
Dmitry Martyanov905 views
Scaling Genomic Analyses by fnothaft
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
fnothaft893 views
Lecture 17 - Grouping and Segmentation - Vision_Spring2017.pptx by Cuongnc220592
Lecture 17 - Grouping and Segmentation - Vision_Spring2017.pptxLecture 17 - Grouping and Segmentation - Vision_Spring2017.pptx
Lecture 17 - Grouping and Segmentation - Vision_Spring2017.pptx
Cuongnc2205923 views
Galaxy RNA-Seq Analysis: Tuxedo Protocol by Hong ChangBum
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Hong ChangBum22.5K views
Mar2013 Performance Metrics Working Group by GenomeInABottle
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working Group
GenomeInABottle727 views
Fast and Scalable NUMA-based Thread Parallel Breadth-first Search by Yuichiro Yasui
Fast and Scalable NUMA-based Thread Parallel Breadth-first SearchFast and Scalable NUMA-based Thread Parallel Breadth-first Search
Fast and Scalable NUMA-based Thread Parallel Breadth-first Search
Yuichiro Yasui824 views
March 2013 Bioinformatics Working Group by GenomeInABottle
March 2013 Bioinformatics Working GroupMarch 2013 Bioinformatics Working Group
March 2013 Bioinformatics Working Group
GenomeInABottle1.5K views
Review of Liao et al - A draft human pangenome reference - Nature (2023) by Stuart MacGowan
Review of Liao et al - A draft human pangenome reference - Nature (2023)Review of Liao et al - A draft human pangenome reference - Nature (2023)
Review of Liao et al - A draft human pangenome reference - Nature (2023)
Stuart MacGowan107 views
I ♥ Maps: Quantum GIS + Python by Paige Bailey
I ♥ Maps: Quantum GIS + PythonI ♥ Maps: Quantum GIS + Python
I ♥ Maps: Quantum GIS + Python
Paige Bailey2.2K views
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds by Flink Forward
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Flink Forward841 views
An Introduction to NV_path_rendering by Mark Kilgard
An Introduction to NV_path_renderingAn Introduction to NV_path_rendering
An Introduction to NV_path_rendering
Mark Kilgard1.8K views
Outlier Analysis.pdf by H K Yoon
Outlier Analysis.pdfOutlier Analysis.pdf
Outlier Analysis.pdf
H K Yoon59 views
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase by Michael Stack
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Michael Stack1.1K views
Whole exome sequencing data analysis.pptx by Haibo Liu
Whole exome sequencing data analysis.pptxWhole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptx
Haibo Liu23 views

More from Genome Reference Consortium

Previewing GRCm39: Assembly Updates from the GRC by
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCGenome Reference Consortium
7.5K views18 slides
What's new and what's next for the human reference assembly? by
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?Genome Reference Consortium
2.3K views19 slides
Advancements in the human genome reference assembly (GRCh38) by
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Genome Reference Consortium
2.4K views11 slides
Telomere-to-telomere assembly of a complete human chromosomes by
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesGenome Reference Consortium
1.9K views48 slides
Genome variation graphs with the vg toolkit by
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome Reference Consortium
2.1K views17 slides
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project by
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectGenome Reference Consortium
1.4K views21 slides

More from Genome Reference Consortium(18)

Recently uploaded

corticosteroids.pptx by
corticosteroids.pptxcorticosteroids.pptx
corticosteroids.pptxRAJ K. MAURYA
39 views26 slides
swarasa kalpana .pptx by
swarasa kalpana .pptxswarasa kalpana .pptx
swarasa kalpana .pptxAparnaNandakumar12
6 views30 slides
General Anaesthesia by
General Anaesthesia General Anaesthesia
General Anaesthesia P.N.DESHMUKH
8 views8 slides
DRUG REPUROSING SEMINAR.pptx by
DRUG REPUROSING SEMINAR.pptxDRUG REPUROSING SEMINAR.pptx
DRUG REPUROSING SEMINAR.pptxRiya Gagnani
6 views28 slides
DEBATE IN CA BLADDER TMT VS CYSTECTOMY by
DEBATE IN CA BLADDER TMT VS CYSTECTOMYDEBATE IN CA BLADDER TMT VS CYSTECTOMY
DEBATE IN CA BLADDER TMT VS CYSTECTOMYKanhu Charan
40 views42 slides
Lifestyle Measures to Prevent Brain Diseases.pptx by
Lifestyle Measures to Prevent Brain Diseases.pptxLifestyle Measures to Prevent Brain Diseases.pptx
Lifestyle Measures to Prevent Brain Diseases.pptxSudhir Kumar
627 views23 slides

Recently uploaded(20)

DRUG REPUROSING SEMINAR.pptx by Riya Gagnani
DRUG REPUROSING SEMINAR.pptxDRUG REPUROSING SEMINAR.pptx
DRUG REPUROSING SEMINAR.pptx
Riya Gagnani6 views
DEBATE IN CA BLADDER TMT VS CYSTECTOMY by Kanhu Charan
DEBATE IN CA BLADDER TMT VS CYSTECTOMYDEBATE IN CA BLADDER TMT VS CYSTECTOMY
DEBATE IN CA BLADDER TMT VS CYSTECTOMY
Kanhu Charan40 views
Lifestyle Measures to Prevent Brain Diseases.pptx by Sudhir Kumar
Lifestyle Measures to Prevent Brain Diseases.pptxLifestyle Measures to Prevent Brain Diseases.pptx
Lifestyle Measures to Prevent Brain Diseases.pptx
Sudhir Kumar627 views
The Art of naming drugs.pptx by DanaKarem1
The Art of naming drugs.pptxThe Art of naming drugs.pptx
The Art of naming drugs.pptx
DanaKarem111 views
Depression PPT template by EmanMegahed6
Depression PPT templateDepression PPT template
Depression PPT template
EmanMegahed620 views
eTEP -RS Dr.TVR.pptx by Varunraju9
eTEP -RS Dr.TVR.pptxeTEP -RS Dr.TVR.pptx
eTEP -RS Dr.TVR.pptx
Varunraju9141 views
Peptic ulcer.pdf by UVAS
Peptic ulcer.pdfPeptic ulcer.pdf
Peptic ulcer.pdf
UVAS8 views
AntiAnxiety Drugs .pptx by Dr Dhanik Mk
AntiAnxiety Drugs .pptxAntiAnxiety Drugs .pptx
AntiAnxiety Drugs .pptx
Dr Dhanik Mk20 views

Variation reference graphs and the variation graph toolkit vg

  • 1. Variation reference graphs and the variation graph toolkit vg Erik Garrison, Jouni Siren, Eric Dawson, Richard Durbin Wellcome Trust Sanger Institute Adam Novak, Benedict Paten et al., UCSC and many others
  • 2. Variation Reference • Go beyond a linear reference – Why a (quasi)-linear reference and a catalog of variants which we keeping finding again? • Local variation: graph reference – Map to a structure including known variation – >99% variants per person already seen • Long range variation: haplotype structure – Exploit variation sharing – support phasing – Recombination rate ~ mutation rate – >99% recombination breakpoints per person
  • 3. Variation Reference • Go beyond a linear reference – Why a (quasi)-linear reference and a catalog of variants which we keeping finding again? • Local variation: graph reference – Map to a structure including known variation – >99% variants per person already seen • Long range variation: haplotype structure – Exploit variation sharing – support phasing – Recombination rate ~ mutation rate – >99% recombination breakpoints per person
  • 4. Variation graphs: “Pan Genome” A variation graph represents many genomes in one non-redundant structure. Nodes contain sequence and edges between the ends of nodes represent potential links between successive sequences
  • 5. Variation graphs and train tracks The links in a variation graph are bidirectional. They behave in many ways like train tracks. Nodes have positive and negative strands, allowing them to be traversed in either direction, and can be connected to form loops (repeats), inversions and translocations. NB There are other ways to do this. One can have sequence on edges. Or unidirectional graphs (nearly) twice as big.
  • 6. “Computational Pan-Genomics: Status, Promises and Challenges.” Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016) in press Essential operations on pan-genomes
  • 8. Implementation in vg • Nodes with sequence, Edges, Paths, Mappings • Alignment tools and .gam format • Serialisation to disk via protobuf, succinct representation xg, graph building/editing, extraction, unrolling and DAGification of local graphs etc. https://github.com/vgteam/vg
  • 9. AGCTCTCCTTGTCCCTCCTACGATCTCTTCACTGGCCTCTTATCTTTACTGTTACCAAATCTTTCCGGAAGCTGCTCTTTC find k-mer subgraphs read k-mers node ids hit clusters cluster ids target subgraph partial order alignment Alignment k-mer based alignment of short reads to a variation graph store results in Graph Alignment Map (GAM) format
  • 10. Alternative index: GCSA2 • Generalised Compressed Suffix Array – Jouni Siren, Niko Valimaki, Veli Makinen • Natural extension of BWT to graphs – Essentially set of minimal unique k-mers with one base prefix extension – Supports compression, FM-index style search etc. • Now implemented for vg graph search – <20GB index, fast SMEM seed and extendJouni Siren talk tomorrow (Maximal Exact Match)
  • 11. Pilot alignment and variant calling evaluation Slides from Benedict Paten and collaborators
  • 14. Genotyper output The genotyper considers support for every bubble based on embedded paths and emits genotypes as Locus records that are each a set of alleles represented as paths relative to the base graph. Most variants are within the reference. Also consider new variants by (temporarily) augmenting the graph to include repeatedly seen alignment alternatives.
  • 15. Genotype evaluation mix CHM1/13 Illumina reads – truth from PacBio MHC BRCA1
  • 16. Reference Graph Augmented Graph & Alignments Alignments, Paths, Genotypes, and Annotations Relative to the Augmented Graph Aligned Reads Translation Coordinates in vg are not stable across graph edits. But, we can retain a mapping from new to old coordinates when editing. This translation provides a stable coordinate system for VGs, solves surjection problem, and enables a virtuous feedback loop! An architecture supporting stable coordinates
  • 17. Thank you Erik Garrison, Jouni Sirén, Eric Dawson, Jerven Bolleman, Adam Novak, Glen Hickey, Benedict Paten, Will Jones, Jordan Eizenga, Toshiaki Katayama, Orion Buske, Raoul Bonnal, Mike Lin, and many others who have helped us understand, design, implement and evaluate vg.