SlideShare a Scribd company logo
1 of 52
Download to read offline
Adam M. Phillippy
Head, Genome Informatics Section
40 Years of Genome Assembly:
Are We Done Yet?
@aphillippy
1980
2014
2001
2012
1995
2020
2010
• Genome assembly’s 40th anniversary
• Rodger Staden (1979)
• “With modern fast sequencing techniques1,2 and
suitable computer programs it is now possible to
sequence whole genomes without the need of
restriction maps.”
A strategy of DNA sequencing employing computer programs. Staden. Nucleic Acids Research (1979)
• Shotgun assembly
• 1995: Haemophilus influenzae
• 1995: Overlap graphs
• 1995: de Bruijn graphs
1980
2014
2001
2012
1995
2020
2010
• The first human genome
• 2000: Celera Assembler
• 2001: The human genome
1980
2014
2001
2012
1995
2020
2010
1980
2001
2012
1995
2020
2014
2010
• Shotgun sequencing era
Input
Extraction
Sequencing
Assembly
Output
1980
2001
2012
1995
2020
2014
2010
• Long-read shotgun sequencing
• First complete de novo assemblies
• 2012: Bacteria (106 bp)
Class I Class II
Yersinia pestis
CO92
Esche
O26:H
Bacillus anthracis
Ames
0
20
0
161
16
171
1980
2014
2001
2012
1995
2020
2010
• First complete de novo assemblies
• 2012: Bacteria (106 bp)
• 2014: Yeast (107 bp)
1980
2014
2001
2012
1995
2020
2010
• First complete de novo assemblies
• 2012: Bacteria (106 bp)
• 2014: Yeast (107 bp)
• 2014: Drosophila (108 bp)
3L3R
2R
2L X
1980
2014
2001
2012
1995
2020
2010
• First complete de novo assemblies
• 2012: Bacteria (106 bp)
• 2014: Yeast (107 bp)
• 2014: Drosophila (108 bp)
• ????: Human (109 bp)
1980
2014
2001
2012
1995
2020
2010
Assembly is solved:
Sequence all the things!
VertebrateGenomesProject.org
• HQ Reference assemblies
• >1 Mb contig N50
• Scaffolds == chromosomes
• 99.99% average base quality
• Sequencing Technology
• Long reads: PacBio
• Linked reads: 10x Genomics
• Optical maps: BioNano
• Cross linking: Arima Hi-C
Vertebrate Genomes Project
Erich Jarvis, chairperson – worldwide consortium of universities, museums, zoos, etc.
~250
~1,000
~10,000
G10K
~60,000
B10K, Bat1K
Orders
Families
Genera
Species
VGP Assembly Working Group
VGP Assembly Pipeline
PacBio
10XG
Contigging
+ Purging
Scaffolding
BioNano
Scaffolding
Hi-C
Gap-filling &
Curation
Final assembly
A
A
A
C TGGA
TGGGGA
TGGGGA
TGGGGA
A TGGGGA
Polishing
Scaffolding
exon 1 exon 2 exon 3
Primary
Alternate
• vgp.github.io
• 86 species currently posted
• 24 with all four data types
The GenomeArk
Jennifer Vashon of Maine Department of Inland Fisheries and Wildlife, left, and
UMass lynx team coordinator, Tanya Lama, with an adult male lynx from northern
Maine whose DNA was used to create first-ever whole genome for the species.
The lynx has since been released to the wild. (MassWildlife photo / Bill Byrne)
VGP Phase 1: What did we learn?
• Iterative assembly process is not ideal
• Errors carry over and are hard to correct
• Data integration is hard
• Most tools built for a single technology
• Little reward for building complex, integrated systems
• Need to decentralize
• Open data, standard formats, modular frameworks
• Nobody* likes building infrastructure
Assembly is hard
• P(Asm|Data) ∝ P(Data|Asm)
• Read coverage
• Hi-C heatmaps
• k-mer recovery
• Comparative annotation
Assembly validation is critical
• Cannot map short reads to repeats
• Therefore, cannot effectively polish/assemble with short reads
• Long read assemblies more accurate in repeats (e.g. HLA, rRNA)
• PacBio can exceed 99.999% accuracy (QV50)
Long read polishing is essential
In some regions, short-read polishing can actually harm the assembly
Oddballs
• Marmoset chimeras
• Zebra finch GRCs
• Platypus sex chrs (10!)
• Lamprey genome deletions
• Fish with spikes and stripes
Not all vertebrates are created equal
Contig N50 (Mb)
Repeats (%)
Mixed haplotypes can introduce indels
CGTTAAAGC
CGTTAAAGC
CGTTAAAGC
CGTTTAAGC
CGTTTAAGC
CGTTTAAAGC
CGTT-AAAGC
CGTT-AAAGC
CGTTTAA-GC
CGTTTAA-GC
P(sub) = 0.01
P(ins) = 0.12
P(del) = 0.02
P(mat) = 0.85
P(mat)^34 * P(sub)^2
3.983304e-07
P(mat)^36 * P(ins)^4
5.967691e-07<
Heterozygosity can lead to false duplications
P:
A:
FALCON-
Unzip
Finch Fish
Size (Gbp) 1.09 0.94 1.95 0.73
NG50 (Mbp) 3.0 0.6 2.6 0.02
BUSCO (c) 93.9 82.1 94.2 40.6
BUSCO (d) 5.0 3.3 20.8 3.4
1.2% 1.6%
Assemble the genomes
De novo assembly of haplotype-resolved genomes with trio binning.
Koren, Rhie, et al. Nature Biotechnology (2018)
×
DamSire
F1 cross
Parental
k-mers
Sire haplotype
Dam haplotype
Sire assembly Dam assembly
Unassigned
Correctly resolved alleles with TrioBinning
FALCON-
Unzip
TrioCanu
FALCON-
Unzip
TrioCanu
Size (Gbp) 1.09 0.94 1.05 1.06 1.95 0.73 1.37 1.36
NG50 (Mbp) 3.0 0.6 3.6 4.0 2.6 0.02 2.6 2.1
BUSCO (c) 93.9 82.1 94.4 93.3 94.2 40.6 91.6 92.7
BUSCO (d) 5.0 3.3 1.4 1.3 20.8 3.4 3.5 3.4
1.2% 1.6%
Esperanza: A nearly perfect diploid
125x PacBio coverage (~60x per haplotype), no Illumina polishing needed, TrioCanu haplotig NG50 70 Mbp, BUSCOs 94%
1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 X
Dam (yak)
1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 X
Sire (Highland) Esperanza
Can we finally finish the human
genome?
• The human reference genome is incomplete
• 368 unresolved issues, 102 gaps
• Segmental duplications, satellites, rDNAs
• Centromeres, telomeres, heterochromatin
• These gaps contain important information
• Missing reference sequence leads to analysis artifacts
• Variation in these gaps is unexplored (e.g. rDNAs)
• We don’t know what we don’t know…
We need to finish the genome
Our target: CHM13hTERT
Cell line from Urvashi Surti, Pitt; SKY karyotype from Jennifer Gerton and Tamara Potapova, Stowers
N=46; XX
• Repeats are long, reads are short
• “If the overlap is of sufficient length to distinguish
it from being a repeat in the sequence the two
sequences must be contiguous.”
— Rodger Staden, 1979
What’s the problem?
• How long are the repeats?
• 7 kbp LINEs
• 1 Mbp+ rDNA arrays
• 1 Mbp+ centromere arrays
• 10 Mbp+ heterochromatin blocks
• Coverage and accuracy matter too
• 1,000X of 100 bp reads at 100% accuracy? NO
• 10X of 10,000,000 bp reads at 100% accuracy, YES
• 100X of 100,000 bp reads at 90% accuracy, MAYBE?
How long do reads need to be, for human?
>50% of the genome
• Length at the expense of throughput
• Read lengths >1 Mbp possible
Ultra-long nanopore sequencing
Nanopore sequencing and assembly of a human genome with ultra-long reads.
Jain et al. Nature Biotechnology (2018)
• Prediction: 30x raw UL coverage == GRCh38
How much do we need?
Nanopore sequencing and assembly of a human genome with ultra-long reads.
Jain et al. Nature Biotechnology (2018)
• 30x Nanopore ultra-long
• Contig building
• 60x PacBio
• Polishing
• 50x 10x Genomics
• Polishing
• BioNano
• Structural validation
We need long reads. Lots of long reads
• Nanopore UL read length distribution is long tailed
It pays to go deep
repeat
• From May 1 – October 29, 2018
• 62 MinION/GridION flow cells
• 8.9M reads, 98 Gb, 1.6 Gb / cell
• N50 read length 76 kb
• 44 Gb in reads >100 kb
• Max read length 1.03 Mb
• Assembled with Canu
CHM13 sequencing
Now upwards of 90+ flow cells and counting…
The human genome, 2001
ref28 NG50 contig 0.5 Mbp
The human genome, 2019
CHM13 NG50 contig 75 Mbp (70x PacBio + 35x UL ONT)
13 14 15 16 17 18 19 20 21 22 X
1 2 3 4 5 6 7 8 9 10 11 12
Canu
The first complete assembly
of a human chromosome
A complete X chromosome
ddPCR
• Unique structural variants from PacBio
• Unique k-mers confirmed by Duplex-Seq
Stitching across the X centromere
An assembly is a hypothesis
• Per read error rates between 5–15%
• Latest Nanopore > PacBio
• Consensus error rates >99.9%
• After Nanopore polishing QV30
• After PacBio polishing QV40
• BAC validation
• >85% of BACs at >99.8% idy
• v.s. 54% for prior PacBio asm
What about the error rate?
BAC analysis courtesy of Eichler lab @ UW
88.0 / 90.6 / 92.4
• ChrX GAGE gene locus
• 19 tandemly arrayed ~9.4 kb repeats
• Corrupted by mapping/polishing pipeline
Repeat collapse analysis
Mitchell Vollger @ UW
• Mappers prefer the “best” alignment
• Consensus can be of variable quality (patches)
• Best mapping not always the correct mapping
• Marker-based anchoring
• Increase number of secondary alignments returned
• Redefine mapping quality to measure single-copy k-
mer agreement between read and assembly
Unique k-mer mapping
Before:
After:
Centromere array validation
Jennifer Gerton @ Stowers
Centromere array validation
Beth Sullivan @ Duke
1.8 Mb
0.7 Mb
0.3 Mb
It’s time to finish the human genome
• Almost!
• Have proven it’s possible for the X chromosome
• T2T assembly of all chrs within the next 2 years
• Challenges
• REPEATS, REPEATS, REPEATS
• Heterozygosity: diploids, polyploids, metagenomes
• Nanopore-only consensus quality
• Targeted long-read sequencing
Are we there yet?
• github.com/nanopore-wgs-consortium/chm13
• Draft whole-genome assemblies
• Nanopore ultra-long reads
• 10x Genomics reads
• BioNano DLS (WashU)
• PacBio (SRA)
• Coming soon:
• Arima Genomics Hi-C
• PacBio CCS
• Strand-Seq
All CHM13 data is openly released
NHGRI
• Sergey Koren
• Arang Rhie
• Jim Mullikin
• Alice Young
• Shelise Brooks
• Valerie Maduro
• Gerard Bouffard
• Sofia Barreira
• Andy Baxevanis
• Nancy Hansen
• Karen Miga, UCSC
• Jennifer Gerton, Stowers
• Tamara Potapova, Stowers
• Beth Sullivan, Duke
• Tina Graves Lindsay, WashU
• Ira Hall, WashU
• Valerie Schneider, NCBI
• Kerstin Howe, Sanger
• Jo Wood, Sanger
• Matt Loose, Nottingham
• Nick Loman, Birmingham
• Urvashi Surti, Pitt (ret.)
Acknowledgements
Evan Eichler, Mitchel Vollger, Glennis Logsdon, David Porubsky, Melanie Sorensen
It’s time to finish the human genome
Google “t2t consortium” – I’ll be hiring in the fall
The Telomere-to-Telomere (T2T) consortium is an
open, community-based effort to generate the
first complete assembly of a human genome.

More Related Content

What's hot

Structural Variation Detection
Structural Variation DetectionStructural Variation Detection
Structural Variation DetectionJennifer Shelton
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentationHJ van Veen
 
Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...tuxette
 
Introduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlowIntroduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlowPaolo Tomeo
 
NGS data analysis Overview
NGS data analysis Overview NGS data analysis Overview
NGS data analysis Overview Ravi Gandham
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyQIAGEN
 
Machine learning in biology
Machine learning in biologyMachine learning in biology
Machine learning in biologyPranavathiyani G
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformaticsatmapandey
 
Comparison between RNASeq and Microarray for Gene Expression Analysis
Comparison between RNASeq and Microarray for Gene Expression AnalysisComparison between RNASeq and Microarray for Gene Expression Analysis
Comparison between RNASeq and Microarray for Gene Expression AnalysisYaoyu Wang
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity RecognitionTomer Lieber
 
Multi Omics Approach in Medicine
Multi Omics Approach in MedicineMulti Omics Approach in Medicine
Multi Omics Approach in MedicineShreya Gupta
 
An Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and GenomicsAn Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and GenomicsBrittany Lasseigne, Ph.D.
 
An Overview to Protein bioinformatics
An Overview to Protein bioinformaticsAn Overview to Protein bioinformatics
An Overview to Protein bioinformaticsJoel Ricci-López
 

What's hot (20)

What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
 
Structural Variation Detection
Structural Variation DetectionStructural Variation Detection
Structural Variation Detection
 
Genomic Data Analysis
Genomic Data AnalysisGenomic Data Analysis
Genomic Data Analysis
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...
 
Introduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlowIntroduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlow
 
ChIP-seq Theory
ChIP-seq TheoryChIP-seq Theory
ChIP-seq Theory
 
NGS data analysis Overview
NGS data analysis Overview NGS data analysis Overview
NGS data analysis Overview
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) Technology
 
Machine learning in biology
Machine learning in biologyMachine learning in biology
Machine learning in biology
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformatics
 
Comparison between RNASeq and Microarray for Gene Expression Analysis
Comparison between RNASeq and Microarray for Gene Expression AnalysisComparison between RNASeq and Microarray for Gene Expression Analysis
Comparison between RNASeq and Microarray for Gene Expression Analysis
 
Data mining
Data miningData mining
Data mining
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Data mining
Data miningData mining
Data mining
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Multi Omics Approach in Medicine
Multi Omics Approach in MedicineMulti Omics Approach in Medicine
Multi Omics Approach in Medicine
 
An Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and GenomicsAn Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and Genomics
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
 
An Overview to Protein bioinformatics
An Overview to Protein bioinformaticsAn Overview to Protein bioinformatics
An Overview to Protein bioinformatics
 

Similar to 40 Years of Genome Assembly: Are We Done Yet?

Telomere-to-telomere assembly of a complete human X chromosome
Telomere-to-telomere assembly of a complete human X chromosomeTelomere-to-telomere assembly of a complete human X chromosome
Telomere-to-telomere assembly of a complete human X chromosomeAdam Phillippy
 
How giab fits in the rest of the world telomere to telomere consortium
How giab fits in the rest of the world   telomere to telomere consortiumHow giab fits in the rest of the world   telomere to telomere consortium
How giab fits in the rest of the world telomere to telomere consortiumGenomeInABottle
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesGenome Reference Consortium
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotesc.titus.brown
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods Mrinal Vashisth
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowBrian Krueger
 
Lecture on the annotation of transposable elements
Lecture on the annotation of transposable elementsLecture on the annotation of transposable elements
Lecture on the annotation of transposable elementsfmaumus
 
Tetrahymena genome project update 2004 by Jonathan Eisen
Tetrahymena genome project update 2004 by Jonathan EisenTetrahymena genome project update 2004 by Jonathan Eisen
Tetrahymena genome project update 2004 by Jonathan EisenJonathan Eisen
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation SequencingFarid MUSA
 
Next Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology OverviewNext Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology OverviewDominic Suciu
 
NGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools SelectionNGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools SelectionMinesh A. Jethva
 
High Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genomeHigh Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genomeBrian Krueger
 
The Human Genome Project - Part I
The Human Genome Project - Part IThe Human Genome Project - Part I
The Human Genome Project - Part Ihhalhaddad
 
CALS_Stewards_of_Future_2015_Yow_IsoSeq
CALS_Stewards_of_Future_2015_Yow_IsoSeqCALS_Stewards_of_Future_2015_Yow_IsoSeq
CALS_Stewards_of_Future_2015_Yow_IsoSeqAshley Yow
 
Human genome project
Human genome projectHuman genome project
Human genome projectRakesh R
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptAdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptRuthMWinnie
 

Similar to 40 Years of Genome Assembly: Are We Done Yet? (20)

Telomere-to-telomere assembly of a complete human X chromosome
Telomere-to-telomere assembly of a complete human X chromosomeTelomere-to-telomere assembly of a complete human X chromosome
Telomere-to-telomere assembly of a complete human X chromosome
 
How giab fits in the rest of the world telomere to telomere consortium
How giab fits in the rest of the world   telomere to telomere consortiumHow giab fits in the rest of the world   telomere to telomere consortium
How giab fits in the rest of the world telomere to telomere consortium
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can Know
 
HMD_Sequencing_KIBGE_KCHI.pptx
HMD_Sequencing_KIBGE_KCHI.pptxHMD_Sequencing_KIBGE_KCHI.pptx
HMD_Sequencing_KIBGE_KCHI.pptx
 
Lecture on the annotation of transposable elements
Lecture on the annotation of transposable elementsLecture on the annotation of transposable elements
Lecture on the annotation of transposable elements
 
Microbial physiology in genomic era
Microbial physiology in genomic eraMicrobial physiology in genomic era
Microbial physiology in genomic era
 
Tetrahymena genome project update 2004 by Jonathan Eisen
Tetrahymena genome project update 2004 by Jonathan EisenTetrahymena genome project update 2004 by Jonathan Eisen
Tetrahymena genome project update 2004 by Jonathan Eisen
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation Sequencing
 
Next Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology OverviewNext Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology Overview
 
NGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools SelectionNGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools Selection
 
High Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genomeHigh Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genome
 
The Human Genome Project - Part I
The Human Genome Project - Part IThe Human Genome Project - Part I
The Human Genome Project - Part I
 
CALS_Stewards_of_Future_2015_Yow_IsoSeq
CALS_Stewards_of_Future_2015_Yow_IsoSeqCALS_Stewards_of_Future_2015_Yow_IsoSeq
CALS_Stewards_of_Future_2015_Yow_IsoSeq
 
Human genome project
Human genome projectHuman genome project
Human genome project
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptAdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
 

Recently uploaded

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 

Recently uploaded (20)

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 

40 Years of Genome Assembly: Are We Done Yet?

  • 1. Adam M. Phillippy Head, Genome Informatics Section 40 Years of Genome Assembly: Are We Done Yet? @aphillippy
  • 2. 1980 2014 2001 2012 1995 2020 2010 • Genome assembly’s 40th anniversary • Rodger Staden (1979) • “With modern fast sequencing techniques1,2 and suitable computer programs it is now possible to sequence whole genomes without the need of restriction maps.” A strategy of DNA sequencing employing computer programs. Staden. Nucleic Acids Research (1979)
  • 3. • Shotgun assembly • 1995: Haemophilus influenzae • 1995: Overlap graphs • 1995: de Bruijn graphs 1980 2014 2001 2012 1995 2020 2010
  • 4. • The first human genome • 2000: Celera Assembler • 2001: The human genome 1980 2014 2001 2012 1995 2020 2010
  • 5. 1980 2001 2012 1995 2020 2014 2010 • Shotgun sequencing era Input Extraction Sequencing Assembly Output
  • 7. • First complete de novo assemblies • 2012: Bacteria (106 bp) Class I Class II Yersinia pestis CO92 Esche O26:H Bacillus anthracis Ames 0 20 0 161 16 171 1980 2014 2001 2012 1995 2020 2010
  • 8. • First complete de novo assemblies • 2012: Bacteria (106 bp) • 2014: Yeast (107 bp) 1980 2014 2001 2012 1995 2020 2010
  • 9. • First complete de novo assemblies • 2012: Bacteria (106 bp) • 2014: Yeast (107 bp) • 2014: Drosophila (108 bp) 3L3R 2R 2L X 1980 2014 2001 2012 1995 2020 2010
  • 10. • First complete de novo assemblies • 2012: Bacteria (106 bp) • 2014: Yeast (107 bp) • 2014: Drosophila (108 bp) • ????: Human (109 bp) 1980 2014 2001 2012 1995 2020 2010
  • 11. Assembly is solved: Sequence all the things!
  • 13. • HQ Reference assemblies • >1 Mb contig N50 • Scaffolds == chromosomes • 99.99% average base quality • Sequencing Technology • Long reads: PacBio • Linked reads: 10x Genomics • Optical maps: BioNano • Cross linking: Arima Hi-C Vertebrate Genomes Project Erich Jarvis, chairperson – worldwide consortium of universities, museums, zoos, etc. ~250 ~1,000 ~10,000 G10K ~60,000 B10K, Bat1K Orders Families Genera Species
  • 15. VGP Assembly Pipeline PacBio 10XG Contigging + Purging Scaffolding BioNano Scaffolding Hi-C Gap-filling & Curation Final assembly A A A C TGGA TGGGGA TGGGGA TGGGGA A TGGGGA Polishing Scaffolding exon 1 exon 2 exon 3 Primary Alternate
  • 16. • vgp.github.io • 86 species currently posted • 24 with all four data types The GenomeArk Jennifer Vashon of Maine Department of Inland Fisheries and Wildlife, left, and UMass lynx team coordinator, Tanya Lama, with an adult male lynx from northern Maine whose DNA was used to create first-ever whole genome for the species. The lynx has since been released to the wild. (MassWildlife photo / Bill Byrne)
  • 17. VGP Phase 1: What did we learn?
  • 18. • Iterative assembly process is not ideal • Errors carry over and are hard to correct • Data integration is hard • Most tools built for a single technology • Little reward for building complex, integrated systems • Need to decentralize • Open data, standard formats, modular frameworks • Nobody* likes building infrastructure Assembly is hard
  • 19. • P(Asm|Data) ∝ P(Data|Asm) • Read coverage • Hi-C heatmaps • k-mer recovery • Comparative annotation Assembly validation is critical
  • 20. • Cannot map short reads to repeats • Therefore, cannot effectively polish/assemble with short reads • Long read assemblies more accurate in repeats (e.g. HLA, rRNA) • PacBio can exceed 99.999% accuracy (QV50) Long read polishing is essential In some regions, short-read polishing can actually harm the assembly
  • 21. Oddballs • Marmoset chimeras • Zebra finch GRCs • Platypus sex chrs (10!) • Lamprey genome deletions • Fish with spikes and stripes Not all vertebrates are created equal Contig N50 (Mb) Repeats (%)
  • 22. Mixed haplotypes can introduce indels CGTTAAAGC CGTTAAAGC CGTTAAAGC CGTTTAAGC CGTTTAAGC CGTTTAAAGC CGTT-AAAGC CGTT-AAAGC CGTTTAA-GC CGTTTAA-GC P(sub) = 0.01 P(ins) = 0.12 P(del) = 0.02 P(mat) = 0.85 P(mat)^34 * P(sub)^2 3.983304e-07 P(mat)^36 * P(ins)^4 5.967691e-07<
  • 23. Heterozygosity can lead to false duplications P: A: FALCON- Unzip Finch Fish Size (Gbp) 1.09 0.94 1.95 0.73 NG50 (Mbp) 3.0 0.6 2.6 0.02 BUSCO (c) 93.9 82.1 94.2 40.6 BUSCO (d) 5.0 3.3 20.8 3.4 1.2% 1.6%
  • 24. Assemble the genomes De novo assembly of haplotype-resolved genomes with trio binning. Koren, Rhie, et al. Nature Biotechnology (2018) × DamSire F1 cross Parental k-mers Sire haplotype Dam haplotype Sire assembly Dam assembly Unassigned
  • 25. Correctly resolved alleles with TrioBinning FALCON- Unzip TrioCanu FALCON- Unzip TrioCanu Size (Gbp) 1.09 0.94 1.05 1.06 1.95 0.73 1.37 1.36 NG50 (Mbp) 3.0 0.6 3.6 4.0 2.6 0.02 2.6 2.1 BUSCO (c) 93.9 82.1 94.4 93.3 94.2 40.6 91.6 92.7 BUSCO (d) 5.0 3.3 1.4 1.3 20.8 3.4 3.5 3.4 1.2% 1.6%
  • 26. Esperanza: A nearly perfect diploid 125x PacBio coverage (~60x per haplotype), no Illumina polishing needed, TrioCanu haplotig NG50 70 Mbp, BUSCOs 94% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 X Dam (yak) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 X Sire (Highland) Esperanza
  • 27. Can we finally finish the human genome?
  • 28. • The human reference genome is incomplete • 368 unresolved issues, 102 gaps • Segmental duplications, satellites, rDNAs • Centromeres, telomeres, heterochromatin • These gaps contain important information • Missing reference sequence leads to analysis artifacts • Variation in these gaps is unexplored (e.g. rDNAs) • We don’t know what we don’t know… We need to finish the genome
  • 29. Our target: CHM13hTERT Cell line from Urvashi Surti, Pitt; SKY karyotype from Jennifer Gerton and Tamara Potapova, Stowers N=46; XX
  • 30. • Repeats are long, reads are short • “If the overlap is of sufficient length to distinguish it from being a repeat in the sequence the two sequences must be contiguous.” — Rodger Staden, 1979 What’s the problem?
  • 31. • How long are the repeats? • 7 kbp LINEs • 1 Mbp+ rDNA arrays • 1 Mbp+ centromere arrays • 10 Mbp+ heterochromatin blocks • Coverage and accuracy matter too • 1,000X of 100 bp reads at 100% accuracy? NO • 10X of 10,000,000 bp reads at 100% accuracy, YES • 100X of 100,000 bp reads at 90% accuracy, MAYBE? How long do reads need to be, for human? >50% of the genome
  • 32. • Length at the expense of throughput • Read lengths >1 Mbp possible Ultra-long nanopore sequencing Nanopore sequencing and assembly of a human genome with ultra-long reads. Jain et al. Nature Biotechnology (2018)
  • 33. • Prediction: 30x raw UL coverage == GRCh38 How much do we need? Nanopore sequencing and assembly of a human genome with ultra-long reads. Jain et al. Nature Biotechnology (2018)
  • 34. • 30x Nanopore ultra-long • Contig building • 60x PacBio • Polishing • 50x 10x Genomics • Polishing • BioNano • Structural validation We need long reads. Lots of long reads
  • 35. • Nanopore UL read length distribution is long tailed It pays to go deep repeat
  • 36. • From May 1 – October 29, 2018 • 62 MinION/GridION flow cells • 8.9M reads, 98 Gb, 1.6 Gb / cell • N50 read length 76 kb • 44 Gb in reads >100 kb • Max read length 1.03 Mb • Assembled with Canu CHM13 sequencing Now upwards of 90+ flow cells and counting…
  • 37. The human genome, 2001 ref28 NG50 contig 0.5 Mbp
  • 38. The human genome, 2019 CHM13 NG50 contig 75 Mbp (70x PacBio + 35x UL ONT) 13 14 15 16 17 18 19 20 21 22 X 1 2 3 4 5 6 7 8 9 10 11 12 Canu
  • 39. The first complete assembly of a human chromosome
  • 40. A complete X chromosome ddPCR
  • 41. • Unique structural variants from PacBio • Unique k-mers confirmed by Duplex-Seq Stitching across the X centromere
  • 42. An assembly is a hypothesis
  • 43. • Per read error rates between 5–15% • Latest Nanopore > PacBio • Consensus error rates >99.9% • After Nanopore polishing QV30 • After PacBio polishing QV40 • BAC validation • >85% of BACs at >99.8% idy • v.s. 54% for prior PacBio asm What about the error rate? BAC analysis courtesy of Eichler lab @ UW 88.0 / 90.6 / 92.4
  • 44. • ChrX GAGE gene locus • 19 tandemly arrayed ~9.4 kb repeats • Corrupted by mapping/polishing pipeline Repeat collapse analysis Mitchell Vollger @ UW
  • 45. • Mappers prefer the “best” alignment • Consensus can be of variable quality (patches) • Best mapping not always the correct mapping • Marker-based anchoring • Increase number of secondary alignments returned • Redefine mapping quality to measure single-copy k- mer agreement between read and assembly Unique k-mer mapping Before: After:
  • 47. Centromere array validation Beth Sullivan @ Duke 1.8 Mb 0.7 Mb 0.3 Mb
  • 48. It’s time to finish the human genome
  • 49. • Almost! • Have proven it’s possible for the X chromosome • T2T assembly of all chrs within the next 2 years • Challenges • REPEATS, REPEATS, REPEATS • Heterozygosity: diploids, polyploids, metagenomes • Nanopore-only consensus quality • Targeted long-read sequencing Are we there yet?
  • 50. • github.com/nanopore-wgs-consortium/chm13 • Draft whole-genome assemblies • Nanopore ultra-long reads • 10x Genomics reads • BioNano DLS (WashU) • PacBio (SRA) • Coming soon: • Arima Genomics Hi-C • PacBio CCS • Strand-Seq All CHM13 data is openly released
  • 51. NHGRI • Sergey Koren • Arang Rhie • Jim Mullikin • Alice Young • Shelise Brooks • Valerie Maduro • Gerard Bouffard • Sofia Barreira • Andy Baxevanis • Nancy Hansen • Karen Miga, UCSC • Jennifer Gerton, Stowers • Tamara Potapova, Stowers • Beth Sullivan, Duke • Tina Graves Lindsay, WashU • Ira Hall, WashU • Valerie Schneider, NCBI • Kerstin Howe, Sanger • Jo Wood, Sanger • Matt Loose, Nottingham • Nick Loman, Birmingham • Urvashi Surti, Pitt (ret.) Acknowledgements Evan Eichler, Mitchel Vollger, Glennis Logsdon, David Porubsky, Melanie Sorensen
  • 52. It’s time to finish the human genome Google “t2t consortium” – I’ll be hiring in the fall The Telomere-to-Telomere (T2T) consortium is an open, community-based effort to generate the first complete assembly of a human genome.