SlideShare a Scribd company logo
1 of 20
Download to read offline
Pan-genomics: theory & practice
Michael Schatz
Sept 20, 2014
GRC Assembly Workshop
#gi2014 / @mike_schatz
Part 1:Theory
Advances in Assembly!
First PacBio RS
@ CSHL
First Hybrid
Assembly
“Perfect”
Microbes
“Perfect”
Fungi
“Perfect”
Model Orgs.
“Perfect”
Simple Ag. Genomes
“Perfect”
Human Assembly
“Perfect”
Higher Euk.
Error correction and assembly complexity of single molecule sequencing reads.
Lee, H*, Gurtowski, J*,Yoo, S, Marcus, S, McCombie,WR, Schatz, MC
http://www.biorxiv.org/content/early/2014/06/18/006395
Pan-Genome Alignment & Assembly
SplitMEM: Graphical pan-genome analysis with suffix skips
Marcus, S, Lee, H, Schatz, MC
http://biorxiv.org/content/early/2014/04/06/003954
Pan-genome colored de Bruijn graph!
•  Encodes all the sequence
relationships between the genomes!
•  How well conserved is a given
sequence? !
•  What are the pan-genome
network properties?!
Time to start considering problems
for which N complete genomes are
the input to study the “pan-genome”!
•  Available today for many microbial
species, near future for higher
eukaryotes!
A"
B"
C"
D"
Graphical pan-genome analysis
Colored de Bruijn graph
•  Node for each distinct kmer
•  Directed edge connects consecutive kmers
•  Nodes overlap by k-1 bp
AGA
GAA
AAG
TCC
GTC
AGT
TAA
ATA
GTT
TTA
AGAAGTCC!
!
ATAAGTTA!
Graphical pan-genome analysis
Colored de Bruijn graph
•  Node for each distinct kmer
•  Directed edge connects consecutive kmers
•  Nodes overlap by k-1 bp
AGAA
AAGT
GTCC
ATAA GTTA
More specifically:!
•  We aim to build the compressed de Bruijn graph as quickly as possible without
considering every distinct kmer!
AGAAGTCC!
!
ATAAGTTA!
Maximal Exact Matches (MEMs)"
to de Bruijn Graphs!
GCA!
T G C AC … G G C A A
Overlapping MEMs
T G C C AT C G C C A AC C AT
T G C C AT C G C C A AC C AT
Suffix Trees & de Bruijn Graphs
Key concepts:
•  Shared sequences form repeats called “maximal exact matches” (MEM)
•  Easy to identify MEMs in a suffix tree, but may be nested within other MEMs
•  Use “suffix skips” to quickly decompose MEMs, add in the missing nodes and edges
B. anthracis pan-genome
(9 strains)
k=25 k=1000
Microbial Pan-Genomes
E. coli (62) and B. anthracis (9) pan-genome analysis!
•  Analyzed all available strains in Genbank!
•  Space and time are effectively linear in the number of genomes!
•  O(n log g) where g is the length of the longest genome!
!
Many possible applications: !
•  Identifying “core” genes present in all strains!
•  Characterizing highly variable regions (+ flanking shared) !
•  Cataloging sequences shared by pathogenic varieties!
62 strain E. coli Pan-Genome Node Sharing!
Part 2: Practice
Genetics of Autism Spectrum Disorders
1.  Constructed database of >1M transmitted and de novo indels in ~1000 families
2.  For practical reasons, analysis is computed relative to the (unpatched) reference genome
•  We use population statistics to “clean” problematic regions
•  We believe we are missing and/or misinterpreting some interesting variants
Accurate de novo and transmitted indel detection in exome-capture data using microassembly.
Narzisi et al. (2014) Nature Methods. doi:10.1038/nmeth.3069
Indica
Total Span: 344.3 Mbp
Contig N50: 22.2kbp
Aus
Total Span: 344.9Mbp
Contig N50: 25.5kbp
Whole genome de novo assemblies of three divergent strains of rice (O. sativa)
documents novel gene space of aus and indica
Schatz, Maron, Stein et al (2014) http://biorxiv.org/content/early/2014/04/02/003764
Nipponbare
Total Span: 354.9Mbp
Contig N50: 21.9kbp
Population structure of Oryza sativa
Pan-genomics of draft assemblies
Strategy:!
1.  Align the genomes to each other
(MUMmer)!
2.  Identify segments of genome A that
do not align anywhere to genome B
(BEDTools)!
! Megabases specific to each genome!!!!!
3.  Screen regions that fail to align with
their k-mer frequencies (jellyfish)!
•  “Genome specific regions”
averaged over 10,000x kmer
coverage while unique regions
were ~50x!
! 100s of KB specific to each genome!!!!
Indica
Aus
Genome-specific Regions
Successfully able to identify many regions specific to each genome (30/30 PCR validation)!
Enriched for genes for disease resistance & other interesting phenotypes!
Pan-genomics Summary
•  Now is the time to study pan-genomes
–  Perfect assemblies of microbes and many smaller eukaryotic genomes are
now routine
–  Expect to rapidly scale up these results to larger genomes soon
•  Algorithms must scale to large collections, be robust to
errors, gaps, and ambiguity
–  Large body of assembly and alignment theory can be repurposed
–  Simple refinements, like k-mer screening, can be very effective even if the
sequence is lacking
•  The “right approach” will depend on the questions you ask
–  We all agree we need to work from a graph, but there is not a clear
consensus of what the graph should represent or how it should be encoded.
–  Ultimately the needs will be driven by applications
•  graph-BLAST, -BWA, -SAMTools, -TopHat/Cufflinks, -IGV, -UCSC, -MAKER, …
Acknowledgements
CSHL
Hannon Lab
Gingeras Lab
Jackson Lab
Hicks Lab
Iossifov Lab
Levy Lab
Lippman Lab
Lyon Lab
Martienssen Lab
McCombie Lab
Tuveson Lab
Ware Lab
Wigler Lab
Pacific Biosciences
Oxford Nanopore
Schatz Lab
Rahul Amin
Tyler Gavin
James Gurtowski
Han Fang
Hayan Lee
Maria Nattestad
Aspyn Palatnick
Srividya
Ramakrishnan
Eric Biggers
Ke Jiang
Shoshana Marcus
Giuseppe Narzisi
Rachel Sherman
Greg Vurture
Alejandro Wences
Thank you
http://schatzlab.cshl.edu
@mike_schatz
Biological Data Sciences
Anne Carpenter, Michael Schatz, Matt Wood
Nov 5 - 8, 2014

More Related Content

What's hot

Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesGenome Reference Consortium
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amGenome Reference Consortium
 
hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)Shaojun Xie
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Genome Reference Consortium
 
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesThe NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesGenome Reference Consortium
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonGenome Reference Consortium
 
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008Saul Kravitz
 

What's hot (20)

Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
 
Ashg grc workshop2014_tg
Ashg grc workshop2014_tgAshg grc workshop2014_tg
Ashg grc workshop2014_tg
 
Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
 
Grc workshop agbt2015_tg
Grc workshop agbt2015_tgGrc workshop agbt2015_tg
Grc workshop agbt2015_tg
 
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesThe NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
 
GRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slidesGRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slides
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
 
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
 

Similar to Theory and practice of graphical population analysis

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotesc.titus.brown
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-researchc.titus.brown
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMfnothaft
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Handsfnothaft
 
Clase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdfClase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdfNoraCRuizGuevara
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talkc.titus.brown
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
Human genome project
Human genome projectHuman genome project
Human genome projectShital Pal
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Jan Aerts
 
Comparative genomics and proteomics
Comparative genomics and proteomicsComparative genomics and proteomics
Comparative genomics and proteomicsNikhil Aggarwal
 

Similar to Theory and practice of graphical population analysis (20)

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
 
Clase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdfClase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdf
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
Microbiology Assignment Help
Microbiology Assignment HelpMicrobiology Assignment Help
Microbiology Assignment Help
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2014 naples
2014 naples2014 naples
2014 naples
 
Human genome project
Human genome projectHuman genome project
Human genome project
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
Comparative genomics and proteomics
Comparative genomics and proteomicsComparative genomics and proteomics
Comparative genomics and proteomics
 

More from Genome Reference Consortium

What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?Genome Reference Consortium
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectGenome Reference Consortium
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsGenome Reference Consortium
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsGenome Reference Consortium
 
Graph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGenome Reference Consortium
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
 

More from Genome Reference Consortium (17)

What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?
 
Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Everyday de novo diploid assembly
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assembly
 
Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materials
 
Graph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regions
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 

Recently uploaded

module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...Monika Rani
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxDiariAli
 
Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Cherry
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...Scintica Instrumentation
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.takadzanijustinmaime
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Cherry
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxRenuJangid3
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.Cherry
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
Site specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfSite specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfCherry
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body Areesha Ahmad
 

Recently uploaded (20)

module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Site specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfSite specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdf
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 

Theory and practice of graphical population analysis

  • 1. Pan-genomics: theory & practice Michael Schatz Sept 20, 2014 GRC Assembly Workshop #gi2014 / @mike_schatz
  • 3.
  • 4. Advances in Assembly! First PacBio RS @ CSHL First Hybrid Assembly “Perfect” Microbes “Perfect” Fungi “Perfect” Model Orgs. “Perfect” Simple Ag. Genomes “Perfect” Human Assembly “Perfect” Higher Euk. Error correction and assembly complexity of single molecule sequencing reads. Lee, H*, Gurtowski, J*,Yoo, S, Marcus, S, McCombie,WR, Schatz, MC http://www.biorxiv.org/content/early/2014/06/18/006395
  • 5. Pan-Genome Alignment & Assembly SplitMEM: Graphical pan-genome analysis with suffix skips Marcus, S, Lee, H, Schatz, MC http://biorxiv.org/content/early/2014/04/06/003954 Pan-genome colored de Bruijn graph! •  Encodes all the sequence relationships between the genomes! •  How well conserved is a given sequence? ! •  What are the pan-genome network properties?! Time to start considering problems for which N complete genomes are the input to study the “pan-genome”! •  Available today for many microbial species, near future for higher eukaryotes! A" B" C" D"
  • 6. Graphical pan-genome analysis Colored de Bruijn graph •  Node for each distinct kmer •  Directed edge connects consecutive kmers •  Nodes overlap by k-1 bp AGA GAA AAG TCC GTC AGT TAA ATA GTT TTA AGAAGTCC! ! ATAAGTTA!
  • 7. Graphical pan-genome analysis Colored de Bruijn graph •  Node for each distinct kmer •  Directed edge connects consecutive kmers •  Nodes overlap by k-1 bp AGAA AAGT GTCC ATAA GTTA More specifically:! •  We aim to build the compressed de Bruijn graph as quickly as possible without considering every distinct kmer! AGAAGTCC! ! ATAAGTTA!
  • 8. Maximal Exact Matches (MEMs)" to de Bruijn Graphs! GCA! T G C AC … G G C A A
  • 9. Overlapping MEMs T G C C AT C G C C A AC C AT T G C C AT C G C C A AC C AT
  • 10. Suffix Trees & de Bruijn Graphs Key concepts: •  Shared sequences form repeats called “maximal exact matches” (MEM) •  Easy to identify MEMs in a suffix tree, but may be nested within other MEMs •  Use “suffix skips” to quickly decompose MEMs, add in the missing nodes and edges
  • 11. B. anthracis pan-genome (9 strains) k=25 k=1000
  • 12. Microbial Pan-Genomes E. coli (62) and B. anthracis (9) pan-genome analysis! •  Analyzed all available strains in Genbank! •  Space and time are effectively linear in the number of genomes! •  O(n log g) where g is the length of the longest genome! ! Many possible applications: ! •  Identifying “core” genes present in all strains! •  Characterizing highly variable regions (+ flanking shared) ! •  Cataloging sequences shared by pathogenic varieties! 62 strain E. coli Pan-Genome Node Sharing!
  • 14. Genetics of Autism Spectrum Disorders 1.  Constructed database of >1M transmitted and de novo indels in ~1000 families 2.  For practical reasons, analysis is computed relative to the (unpatched) reference genome •  We use population statistics to “clean” problematic regions •  We believe we are missing and/or misinterpreting some interesting variants Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Narzisi et al. (2014) Nature Methods. doi:10.1038/nmeth.3069
  • 15. Indica Total Span: 344.3 Mbp Contig N50: 22.2kbp Aus Total Span: 344.9Mbp Contig N50: 25.5kbp Whole genome de novo assemblies of three divergent strains of rice (O. sativa) documents novel gene space of aus and indica Schatz, Maron, Stein et al (2014) http://biorxiv.org/content/early/2014/04/02/003764 Nipponbare Total Span: 354.9Mbp Contig N50: 21.9kbp Population structure of Oryza sativa
  • 16. Pan-genomics of draft assemblies Strategy:! 1.  Align the genomes to each other (MUMmer)! 2.  Identify segments of genome A that do not align anywhere to genome B (BEDTools)! ! Megabases specific to each genome!!!!! 3.  Screen regions that fail to align with their k-mer frequencies (jellyfish)! •  “Genome specific regions” averaged over 10,000x kmer coverage while unique regions were ~50x! ! 100s of KB specific to each genome!!!! Indica Aus
  • 17. Genome-specific Regions Successfully able to identify many regions specific to each genome (30/30 PCR validation)! Enriched for genes for disease resistance & other interesting phenotypes!
  • 18. Pan-genomics Summary •  Now is the time to study pan-genomes –  Perfect assemblies of microbes and many smaller eukaryotic genomes are now routine –  Expect to rapidly scale up these results to larger genomes soon •  Algorithms must scale to large collections, be robust to errors, gaps, and ambiguity –  Large body of assembly and alignment theory can be repurposed –  Simple refinements, like k-mer screening, can be very effective even if the sequence is lacking •  The “right approach” will depend on the questions you ask –  We all agree we need to work from a graph, but there is not a clear consensus of what the graph should represent or how it should be encoded. –  Ultimately the needs will be driven by applications •  graph-BLAST, -BWA, -SAMTools, -TopHat/Cufflinks, -IGV, -UCSC, -MAKER, …
  • 19. Acknowledgements CSHL Hannon Lab Gingeras Lab Jackson Lab Hicks Lab Iossifov Lab Levy Lab Lippman Lab Lyon Lab Martienssen Lab McCombie Lab Tuveson Lab Ware Lab Wigler Lab Pacific Biosciences Oxford Nanopore Schatz Lab Rahul Amin Tyler Gavin James Gurtowski Han Fang Hayan Lee Maria Nattestad Aspyn Palatnick Srividya Ramakrishnan Eric Biggers Ke Jiang Shoshana Marcus Giuseppe Narzisi Rachel Sherman Greg Vurture Alejandro Wences
  • 20. Thank you http://schatzlab.cshl.edu @mike_schatz Biological Data Sciences Anne Carpenter, Michael Schatz, Matt Wood Nov 5 - 8, 2014