SlideShare a Scribd company logo
Pan-genomics: theory & practice
Michael Schatz
Sept 20, 2014
GRC Assembly Workshop
#gi2014 / @mike_schatz
Part 1:Theory
Theory and practice of graphical population analysis
Advances in Assembly!
First PacBio RS
@ CSHL
First Hybrid
Assembly
“Perfect”
Microbes
“Perfect”
Fungi
“Perfect”
Model Orgs.
“Perfect”
Simple Ag. Genomes
“Perfect”
Human Assembly
“Perfect”
Higher Euk.
Error correction and assembly complexity of single molecule sequencing reads.
Lee, H*, Gurtowski, J*,Yoo, S, Marcus, S, McCombie,WR, Schatz, MC
http://www.biorxiv.org/content/early/2014/06/18/006395
Pan-Genome Alignment & Assembly
SplitMEM: Graphical pan-genome analysis with suffix skips
Marcus, S, Lee, H, Schatz, MC
http://biorxiv.org/content/early/2014/04/06/003954
Pan-genome colored de Bruijn graph!
•  Encodes all the sequence
relationships between the genomes!
•  How well conserved is a given
sequence? !
•  What are the pan-genome
network properties?!
Time to start considering problems
for which N complete genomes are
the input to study the “pan-genome”!
•  Available today for many microbial
species, near future for higher
eukaryotes!
A"
B"
C"
D"
Graphical pan-genome analysis
Colored de Bruijn graph
•  Node for each distinct kmer
•  Directed edge connects consecutive kmers
•  Nodes overlap by k-1 bp
AGA
GAA
AAG
TCC
GTC
AGT
TAA
ATA
GTT
TTA
AGAAGTCC!
!
ATAAGTTA!
Graphical pan-genome analysis
Colored de Bruijn graph
•  Node for each distinct kmer
•  Directed edge connects consecutive kmers
•  Nodes overlap by k-1 bp
AGAA
AAGT
GTCC
ATAA GTTA
More specifically:!
•  We aim to build the compressed de Bruijn graph as quickly as possible without
considering every distinct kmer!
AGAAGTCC!
!
ATAAGTTA!
Maximal Exact Matches (MEMs)"
to de Bruijn Graphs!
GCA!
T G C AC … G G C A A
Overlapping MEMs
T G C C AT C G C C A AC C AT
T G C C AT C G C C A AC C AT
Suffix Trees & de Bruijn Graphs
Key concepts:
•  Shared sequences form repeats called “maximal exact matches” (MEM)
•  Easy to identify MEMs in a suffix tree, but may be nested within other MEMs
•  Use “suffix skips” to quickly decompose MEMs, add in the missing nodes and edges
B. anthracis pan-genome
(9 strains)
k=25 k=1000
Microbial Pan-Genomes
E. coli (62) and B. anthracis (9) pan-genome analysis!
•  Analyzed all available strains in Genbank!
•  Space and time are effectively linear in the number of genomes!
•  O(n log g) where g is the length of the longest genome!
!
Many possible applications: !
•  Identifying “core” genes present in all strains!
•  Characterizing highly variable regions (+ flanking shared) !
•  Cataloging sequences shared by pathogenic varieties!
62 strain E. coli Pan-Genome Node Sharing!
Part 2: Practice
Genetics of Autism Spectrum Disorders
1.  Constructed database of >1M transmitted and de novo indels in ~1000 families
2.  For practical reasons, analysis is computed relative to the (unpatched) reference genome
•  We use population statistics to “clean” problematic regions
•  We believe we are missing and/or misinterpreting some interesting variants
Accurate de novo and transmitted indel detection in exome-capture data using microassembly.
Narzisi et al. (2014) Nature Methods. doi:10.1038/nmeth.3069
Indica
Total Span: 344.3 Mbp
Contig N50: 22.2kbp
Aus
Total Span: 344.9Mbp
Contig N50: 25.5kbp
Whole genome de novo assemblies of three divergent strains of rice (O. sativa)
documents novel gene space of aus and indica
Schatz, Maron, Stein et al (2014) http://biorxiv.org/content/early/2014/04/02/003764
Nipponbare
Total Span: 354.9Mbp
Contig N50: 21.9kbp
Population structure of Oryza sativa
Pan-genomics of draft assemblies
Strategy:!
1.  Align the genomes to each other
(MUMmer)!
2.  Identify segments of genome A that
do not align anywhere to genome B
(BEDTools)!
! Megabases specific to each genome!!!!!
3.  Screen regions that fail to align with
their k-mer frequencies (jellyfish)!
•  “Genome specific regions”
averaged over 10,000x kmer
coverage while unique regions
were ~50x!
! 100s of KB specific to each genome!!!!
Indica
Aus
Genome-specific Regions
Successfully able to identify many regions specific to each genome (30/30 PCR validation)!
Enriched for genes for disease resistance & other interesting phenotypes!
Pan-genomics Summary
•  Now is the time to study pan-genomes
–  Perfect assemblies of microbes and many smaller eukaryotic genomes are
now routine
–  Expect to rapidly scale up these results to larger genomes soon
•  Algorithms must scale to large collections, be robust to
errors, gaps, and ambiguity
–  Large body of assembly and alignment theory can be repurposed
–  Simple refinements, like k-mer screening, can be very effective even if the
sequence is lacking
•  The “right approach” will depend on the questions you ask
–  We all agree we need to work from a graph, but there is not a clear
consensus of what the graph should represent or how it should be encoded.
–  Ultimately the needs will be driven by applications
•  graph-BLAST, -BWA, -SAMTools, -TopHat/Cufflinks, -IGV, -UCSC, -MAKER, …
Acknowledgements
CSHL
Hannon Lab
Gingeras Lab
Jackson Lab
Hicks Lab
Iossifov Lab
Levy Lab
Lippman Lab
Lyon Lab
Martienssen Lab
McCombie Lab
Tuveson Lab
Ware Lab
Wigler Lab
Pacific Biosciences
Oxford Nanopore
Schatz Lab
Rahul Amin
Tyler Gavin
James Gurtowski
Han Fang
Hayan Lee
Maria Nattestad
Aspyn Palatnick
Srividya
Ramakrishnan
Eric Biggers
Ke Jiang
Shoshana Marcus
Giuseppe Narzisi
Rachel Sherman
Greg Vurture
Alejandro Wences
Thank you
http://schatzlab.cshl.edu
@mike_schatz
Biological Data Sciences
Anne Carpenter, Michael Schatz, Matt Wood
Nov 5 - 8, 2014

More Related Content

What's hot

Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
Genome Reference Consortium
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
Genome Reference Consortium
 
Mane v2 final
Mane v2 finalMane v2 final
hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)
Shaojun Xie
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
Genome Reference Consortium
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
Genome Reference Consortium
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
Genome Reference Consortium
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)
Genome Reference Consortium
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
Genome Reference Consortium
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
Genome Reference Consortium
 
Ashg grc workshop2014_tg
Ashg grc workshop2014_tgAshg grc workshop2014_tg
Ashg grc workshop2014_tg
Genome Reference Consortium
 
Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
Genome Reference Consortium
 
Grc workshop agbt2015_tg
Grc workshop agbt2015_tgGrc workshop agbt2015_tg
Grc workshop agbt2015_tg
Genome Reference Consortium
 
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesThe NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
Genome Reference Consortium
 
GRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slidesGRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slides
Genome Reference Consortium
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
Genome Reference Consortium
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
Genome Reference Consortium
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
Genome Reference Consortium
 
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Saul Kravitz
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
Genome Reference Consortium
 

What's hot (20)

Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
 
Ashg grc workshop2014_tg
Ashg grc workshop2014_tgAshg grc workshop2014_tg
Ashg grc workshop2014_tg
 
Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
 
Grc workshop agbt2015_tg
Grc workshop agbt2015_tgGrc workshop agbt2015_tg
Grc workshop agbt2015_tg
 
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesThe NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
 
GRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slidesGRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slides
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
 
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
 

Similar to Theory and practice of graphical population analysis

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
c.titus.brown
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
c.titus.brown
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
c.titus.brown
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
c.titus.brown
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
fnothaft
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
c.titus.brown
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
fnothaft
 
Clase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdfClase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdf
NoraCRuizGuevara
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
c.titus.brown
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
c.titus.brown
 
Microbiology Assignment Help
Microbiology Assignment HelpMicrobiology Assignment Help
Microbiology Assignment Help
Nursing Assignment Help
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
c.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
c.titus.brown
 
2014 naples
2014 naples2014 naples
2014 naples
c.titus.brown
 
Human genome project
Human genome projectHuman genome project
Human genome project
Shital Pal
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
Bioinformatics and Computational Biosciences Branch
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
c.titus.brown
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Jan Aerts
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
c.titus.brown
 
Comparative genomics and proteomics
Comparative genomics and proteomicsComparative genomics and proteomics
Comparative genomics and proteomics
Nikhil Aggarwal
 

Similar to Theory and practice of graphical population analysis (20)

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
 
Clase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdfClase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdf
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
Microbiology Assignment Help
Microbiology Assignment HelpMicrobiology Assignment Help
Microbiology Assignment Help
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2014 naples
2014 naples2014 naples
2014 naples
 
Human genome project
Human genome projectHuman genome project
Human genome project
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
Comparative genomics and proteomics
Comparative genomics and proteomicsComparative genomics and proteomics
Comparative genomics and proteomics
 

More from Genome Reference Consortium

What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?
Genome Reference Consortium
 
Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
Genome Reference Consortium
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
Genome Reference Consortium
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
Genome Reference Consortium
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
Genome Reference Consortium
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
Genome Reference Consortium
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
Genome Reference Consortium
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
Genome Reference Consortium
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
Genome Reference Consortium
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
Genome Reference Consortium
 
Everyday de novo diploid assembly
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assembly
Genome Reference Consortium
 
Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
Genome Reference Consortium
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Genome Reference Consortium
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
Genome Reference Consortium
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materials
Genome Reference Consortium
 
Graph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regions
Genome Reference Consortium
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Genome Reference Consortium
 

More from Genome Reference Consortium (17)

What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?
 
Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Everyday de novo diploid assembly
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assembly
 
Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materials
 
Graph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regions
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 

Recently uploaded

largeintestinepathologiesconditions-240627071428-3c936a47 (2).pptx
largeintestinepathologiesconditions-240627071428-3c936a47 (2).pptxlargeintestinepathologiesconditions-240627071428-3c936a47 (2).pptx
largeintestinepathologiesconditions-240627071428-3c936a47 (2).pptx
muralinath2
 
Possible Anthropogenic Contributions to the LAMP-observed Surficial Icy Regol...
Possible Anthropogenic Contributions to the LAMP-observed Surficial Icy Regol...Possible Anthropogenic Contributions to the LAMP-observed Surficial Icy Regol...
Possible Anthropogenic Contributions to the LAMP-observed Surficial Icy Regol...
Sérgio Sacani
 
Summer program introduction in Yunnan university
Summer program introduction in Yunnan universitySummer program introduction in Yunnan university
Summer program introduction in Yunnan university
Hayato Shimabukuro
 
MCQ in Electrostatics. for class XII pptx
MCQ in Electrostatics. for class XII  pptxMCQ in Electrostatics. for class XII  pptx
MCQ in Electrostatics. for class XII pptx
ArunachalamM22
 
20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...
20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...
20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...
Sharon Liu
 
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
Sérgio Sacani
 
Gametogenesis: Male gametes Formation Process / Spermatogenesis .pdf
Gametogenesis: Male gametes Formation Process / Spermatogenesis .pdfGametogenesis: Male gametes Formation Process / Spermatogenesis .pdf
Gametogenesis: Male gametes Formation Process / Spermatogenesis .pdf
SELF-EXPLANATORY
 
How Does TaskTrain Integrate Workflow and Project Management Efficiently.pdf
How Does TaskTrain Integrate Workflow and Project Management Efficiently.pdfHow Does TaskTrain Integrate Workflow and Project Management Efficiently.pdf
How Does TaskTrain Integrate Workflow and Project Management Efficiently.pdf
Task Train
 
Science-Technology Quiz (School Quiz 2024)
Science-Technology Quiz (School Quiz 2024)Science-Technology Quiz (School Quiz 2024)
Science-Technology Quiz (School Quiz 2024)
Kashyap J
 
Introduction to Space (Our Solar System)
Introduction to Space (Our Solar System)Introduction to Space (Our Solar System)
Introduction to Space (Our Solar System)
vanshgarg8002
 
Collaborative Team Recommendation for Skilled Users: Objectives, Techniques, ...
Collaborative Team Recommendation for Skilled Users: Objectives, Techniques, ...Collaborative Team Recommendation for Skilled Users: Objectives, Techniques, ...
Collaborative Team Recommendation for Skilled Users: Objectives, Techniques, ...
Hossein Fani
 
HAZARDOUS ENERGIES -LOTO Training Program.pptx
HAZARDOUS ENERGIES -LOTO Training Program.pptxHAZARDOUS ENERGIES -LOTO Training Program.pptx
HAZARDOUS ENERGIES -LOTO Training Program.pptx
ArfanSubhani
 
Deploying DAPHNE Computational Intelligence on EuroHPC Vega for Benchmarking ...
Deploying DAPHNE Computational Intelligence on EuroHPC Vega for Benchmarking ...Deploying DAPHNE Computational Intelligence on EuroHPC Vega for Benchmarking ...
Deploying DAPHNE Computational Intelligence on EuroHPC Vega for Benchmarking ...
University of Maribor
 
poikilocytosis 23765437865210857453257844.pptx
poikilocytosis 23765437865210857453257844.pptxpoikilocytosis 23765437865210857453257844.pptx
poikilocytosis 23765437865210857453257844.pptx
muralinath2
 
Lunar Mobility Drivers and Needs - Artemis
Lunar Mobility Drivers and Needs - ArtemisLunar Mobility Drivers and Needs - Artemis
Lunar Mobility Drivers and Needs - Artemis
Sérgio Sacani
 
seed drying lecture, different types of dryers
seed drying lecture, different types of dryersseed drying lecture, different types of dryers
seed drying lecture, different types of dryers
Rammehargahlot1
 
[1] Data Mining - Concepts and Techniques (3rd Ed).pdf
[1] Data Mining - Concepts and Techniques (3rd Ed).pdf[1] Data Mining - Concepts and Techniques (3rd Ed).pdf
[1] Data Mining - Concepts and Techniques (3rd Ed).pdf
PANDURANGLAWATE1
 
Transmission Spectroscopy of the Habitable Zone Exoplanet LHS 1140 b with JWS...
Transmission Spectroscopy of the Habitable Zone Exoplanet LHS 1140 b with JWS...Transmission Spectroscopy of the Habitable Zone Exoplanet LHS 1140 b with JWS...
Transmission Spectroscopy of the Habitable Zone Exoplanet LHS 1140 b with JWS...
Sérgio Sacani
 
Review Article:- A REVIEW ON RADIOISOTOPES IN CANCER THERAPY
Review Article:- A REVIEW ON RADIOISOTOPES IN CANCER THERAPYReview Article:- A REVIEW ON RADIOISOTOPES IN CANCER THERAPY
Review Article:- A REVIEW ON RADIOISOTOPES IN CANCER THERAPY
niranjangiri009
 
antenna-fundamentals an introductions to basics
antenna-fundamentals an introductions to basicsantenna-fundamentals an introductions to basics
antenna-fundamentals an introductions to basics
drphrao1
 

Recently uploaded (20)

largeintestinepathologiesconditions-240627071428-3c936a47 (2).pptx
largeintestinepathologiesconditions-240627071428-3c936a47 (2).pptxlargeintestinepathologiesconditions-240627071428-3c936a47 (2).pptx
largeintestinepathologiesconditions-240627071428-3c936a47 (2).pptx
 
Possible Anthropogenic Contributions to the LAMP-observed Surficial Icy Regol...
Possible Anthropogenic Contributions to the LAMP-observed Surficial Icy Regol...Possible Anthropogenic Contributions to the LAMP-observed Surficial Icy Regol...
Possible Anthropogenic Contributions to the LAMP-observed Surficial Icy Regol...
 
Summer program introduction in Yunnan university
Summer program introduction in Yunnan universitySummer program introduction in Yunnan university
Summer program introduction in Yunnan university
 
MCQ in Electrostatics. for class XII pptx
MCQ in Electrostatics. for class XII  pptxMCQ in Electrostatics. for class XII  pptx
MCQ in Electrostatics. for class XII pptx
 
20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...
20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...
20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...
 
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
 
Gametogenesis: Male gametes Formation Process / Spermatogenesis .pdf
Gametogenesis: Male gametes Formation Process / Spermatogenesis .pdfGametogenesis: Male gametes Formation Process / Spermatogenesis .pdf
Gametogenesis: Male gametes Formation Process / Spermatogenesis .pdf
 
How Does TaskTrain Integrate Workflow and Project Management Efficiently.pdf
How Does TaskTrain Integrate Workflow and Project Management Efficiently.pdfHow Does TaskTrain Integrate Workflow and Project Management Efficiently.pdf
How Does TaskTrain Integrate Workflow and Project Management Efficiently.pdf
 
Science-Technology Quiz (School Quiz 2024)
Science-Technology Quiz (School Quiz 2024)Science-Technology Quiz (School Quiz 2024)
Science-Technology Quiz (School Quiz 2024)
 
Introduction to Space (Our Solar System)
Introduction to Space (Our Solar System)Introduction to Space (Our Solar System)
Introduction to Space (Our Solar System)
 
Collaborative Team Recommendation for Skilled Users: Objectives, Techniques, ...
Collaborative Team Recommendation for Skilled Users: Objectives, Techniques, ...Collaborative Team Recommendation for Skilled Users: Objectives, Techniques, ...
Collaborative Team Recommendation for Skilled Users: Objectives, Techniques, ...
 
HAZARDOUS ENERGIES -LOTO Training Program.pptx
HAZARDOUS ENERGIES -LOTO Training Program.pptxHAZARDOUS ENERGIES -LOTO Training Program.pptx
HAZARDOUS ENERGIES -LOTO Training Program.pptx
 
Deploying DAPHNE Computational Intelligence on EuroHPC Vega for Benchmarking ...
Deploying DAPHNE Computational Intelligence on EuroHPC Vega for Benchmarking ...Deploying DAPHNE Computational Intelligence on EuroHPC Vega for Benchmarking ...
Deploying DAPHNE Computational Intelligence on EuroHPC Vega for Benchmarking ...
 
poikilocytosis 23765437865210857453257844.pptx
poikilocytosis 23765437865210857453257844.pptxpoikilocytosis 23765437865210857453257844.pptx
poikilocytosis 23765437865210857453257844.pptx
 
Lunar Mobility Drivers and Needs - Artemis
Lunar Mobility Drivers and Needs - ArtemisLunar Mobility Drivers and Needs - Artemis
Lunar Mobility Drivers and Needs - Artemis
 
seed drying lecture, different types of dryers
seed drying lecture, different types of dryersseed drying lecture, different types of dryers
seed drying lecture, different types of dryers
 
[1] Data Mining - Concepts and Techniques (3rd Ed).pdf
[1] Data Mining - Concepts and Techniques (3rd Ed).pdf[1] Data Mining - Concepts and Techniques (3rd Ed).pdf
[1] Data Mining - Concepts and Techniques (3rd Ed).pdf
 
Transmission Spectroscopy of the Habitable Zone Exoplanet LHS 1140 b with JWS...
Transmission Spectroscopy of the Habitable Zone Exoplanet LHS 1140 b with JWS...Transmission Spectroscopy of the Habitable Zone Exoplanet LHS 1140 b with JWS...
Transmission Spectroscopy of the Habitable Zone Exoplanet LHS 1140 b with JWS...
 
Review Article:- A REVIEW ON RADIOISOTOPES IN CANCER THERAPY
Review Article:- A REVIEW ON RADIOISOTOPES IN CANCER THERAPYReview Article:- A REVIEW ON RADIOISOTOPES IN CANCER THERAPY
Review Article:- A REVIEW ON RADIOISOTOPES IN CANCER THERAPY
 
antenna-fundamentals an introductions to basics
antenna-fundamentals an introductions to basicsantenna-fundamentals an introductions to basics
antenna-fundamentals an introductions to basics
 

Theory and practice of graphical population analysis

  • 1. Pan-genomics: theory & practice Michael Schatz Sept 20, 2014 GRC Assembly Workshop #gi2014 / @mike_schatz
  • 4. Advances in Assembly! First PacBio RS @ CSHL First Hybrid Assembly “Perfect” Microbes “Perfect” Fungi “Perfect” Model Orgs. “Perfect” Simple Ag. Genomes “Perfect” Human Assembly “Perfect” Higher Euk. Error correction and assembly complexity of single molecule sequencing reads. Lee, H*, Gurtowski, J*,Yoo, S, Marcus, S, McCombie,WR, Schatz, MC http://www.biorxiv.org/content/early/2014/06/18/006395
  • 5. Pan-Genome Alignment & Assembly SplitMEM: Graphical pan-genome analysis with suffix skips Marcus, S, Lee, H, Schatz, MC http://biorxiv.org/content/early/2014/04/06/003954 Pan-genome colored de Bruijn graph! •  Encodes all the sequence relationships between the genomes! •  How well conserved is a given sequence? ! •  What are the pan-genome network properties?! Time to start considering problems for which N complete genomes are the input to study the “pan-genome”! •  Available today for many microbial species, near future for higher eukaryotes! A" B" C" D"
  • 6. Graphical pan-genome analysis Colored de Bruijn graph •  Node for each distinct kmer •  Directed edge connects consecutive kmers •  Nodes overlap by k-1 bp AGA GAA AAG TCC GTC AGT TAA ATA GTT TTA AGAAGTCC! ! ATAAGTTA!
  • 7. Graphical pan-genome analysis Colored de Bruijn graph •  Node for each distinct kmer •  Directed edge connects consecutive kmers •  Nodes overlap by k-1 bp AGAA AAGT GTCC ATAA GTTA More specifically:! •  We aim to build the compressed de Bruijn graph as quickly as possible without considering every distinct kmer! AGAAGTCC! ! ATAAGTTA!
  • 8. Maximal Exact Matches (MEMs)" to de Bruijn Graphs! GCA! T G C AC … G G C A A
  • 9. Overlapping MEMs T G C C AT C G C C A AC C AT T G C C AT C G C C A AC C AT
  • 10. Suffix Trees & de Bruijn Graphs Key concepts: •  Shared sequences form repeats called “maximal exact matches” (MEM) •  Easy to identify MEMs in a suffix tree, but may be nested within other MEMs •  Use “suffix skips” to quickly decompose MEMs, add in the missing nodes and edges
  • 11. B. anthracis pan-genome (9 strains) k=25 k=1000
  • 12. Microbial Pan-Genomes E. coli (62) and B. anthracis (9) pan-genome analysis! •  Analyzed all available strains in Genbank! •  Space and time are effectively linear in the number of genomes! •  O(n log g) where g is the length of the longest genome! ! Many possible applications: ! •  Identifying “core” genes present in all strains! •  Characterizing highly variable regions (+ flanking shared) ! •  Cataloging sequences shared by pathogenic varieties! 62 strain E. coli Pan-Genome Node Sharing!
  • 14. Genetics of Autism Spectrum Disorders 1.  Constructed database of >1M transmitted and de novo indels in ~1000 families 2.  For practical reasons, analysis is computed relative to the (unpatched) reference genome •  We use population statistics to “clean” problematic regions •  We believe we are missing and/or misinterpreting some interesting variants Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Narzisi et al. (2014) Nature Methods. doi:10.1038/nmeth.3069
  • 15. Indica Total Span: 344.3 Mbp Contig N50: 22.2kbp Aus Total Span: 344.9Mbp Contig N50: 25.5kbp Whole genome de novo assemblies of three divergent strains of rice (O. sativa) documents novel gene space of aus and indica Schatz, Maron, Stein et al (2014) http://biorxiv.org/content/early/2014/04/02/003764 Nipponbare Total Span: 354.9Mbp Contig N50: 21.9kbp Population structure of Oryza sativa
  • 16. Pan-genomics of draft assemblies Strategy:! 1.  Align the genomes to each other (MUMmer)! 2.  Identify segments of genome A that do not align anywhere to genome B (BEDTools)! ! Megabases specific to each genome!!!!! 3.  Screen regions that fail to align with their k-mer frequencies (jellyfish)! •  “Genome specific regions” averaged over 10,000x kmer coverage while unique regions were ~50x! ! 100s of KB specific to each genome!!!! Indica Aus
  • 17. Genome-specific Regions Successfully able to identify many regions specific to each genome (30/30 PCR validation)! Enriched for genes for disease resistance & other interesting phenotypes!
  • 18. Pan-genomics Summary •  Now is the time to study pan-genomes –  Perfect assemblies of microbes and many smaller eukaryotic genomes are now routine –  Expect to rapidly scale up these results to larger genomes soon •  Algorithms must scale to large collections, be robust to errors, gaps, and ambiguity –  Large body of assembly and alignment theory can be repurposed –  Simple refinements, like k-mer screening, can be very effective even if the sequence is lacking •  The “right approach” will depend on the questions you ask –  We all agree we need to work from a graph, but there is not a clear consensus of what the graph should represent or how it should be encoded. –  Ultimately the needs will be driven by applications •  graph-BLAST, -BWA, -SAMTools, -TopHat/Cufflinks, -IGV, -UCSC, -MAKER, …
  • 19. Acknowledgements CSHL Hannon Lab Gingeras Lab Jackson Lab Hicks Lab Iossifov Lab Levy Lab Lippman Lab Lyon Lab Martienssen Lab McCombie Lab Tuveson Lab Ware Lab Wigler Lab Pacific Biosciences Oxford Nanopore Schatz Lab Rahul Amin Tyler Gavin James Gurtowski Han Fang Hayan Lee Maria Nattestad Aspyn Palatnick Srividya Ramakrishnan Eric Biggers Ke Jiang Shoshana Marcus Giuseppe Narzisi Rachel Sherman Greg Vurture Alejandro Wences
  • 20. Thank you http://schatzlab.cshl.edu @mike_schatz Biological Data Sciences Anne Carpenter, Michael Schatz, Matt Wood Nov 5 - 8, 2014