agbt 2016 workshop lindsay

Genome Reference Consortium
Genome Reference ConsortiumGenome Reference Consortium
MGI Reference Genomes
Workshop
Tina Graves-Lindsay
Feb 10, 2016
The Human Reference is a Work in Progress!
• The current reference – GRCh38 - is not optimal for some
regions of the genome and/or some individuals/ancestries.
• GRCh38 is comprised of DNA from several individual humans.
• Allelic diversity and structural variation present major
challenges when assembling a representative diploid genome.
• New technologies, methods, and resources since 2003 have
allowed for substantial improvements in the reference genome.
• Additional high-quality reference sequences are needed to
represent the full range of genetic diversity in humans
NA12878
European
NA19240
Yoruban
NA19434
Luhya
Human Phylogenetic Tree
(Li, et al 2008)
MGI Gold
Reference
Genomes
HG00514
Han Chinese
HG007333
Puerto Rican
World Map with Sample Origins
The MGI Reference Genomes
Improvement Project
Funded by the NIH, the MGI Reference Genomes Improvement Project
aims to increase the quality and diversity of existing scientific resources.
We will sequence and assemble at least 5 diploid genomes from
individuals selected to maximize human genetic diversity (right). All
sources have BAC libraries available and whenever possible, we will use
samples from a trio (two parents and child). We will sequence the parents
within the trio at a lower depth of coverage to enable haplotype phasing of
the proband sequence. Other independent efforts to sequence and
assemble new reference genomes include two Japanese, one Malaysian,
a Han Chinese and an Ashkenazim trio (as part of the Genome in a Bottle
Effort).
CHM1
MGI Platinum
Reference
Genomes
CHM13
European,
inferred
European,
inferred
HG01352
Columbian
Samples to be Sequenced
Sequencing Plan
Definitions of Genome Level
• Platinum Genome
• Haploid genome source
• Contiguous, haplotype-resolved representation of entire genome
• BAC library available
• Gold Genome
• Diploid genome source
• Part of a trio
• Parents will be sequenced to help haplotype resolve some
regions
• BAC libraries available
• Targeted regions sequenced using these BAC libraries
• Will contain some haplotype resolved regions
CHM1: A Key Resource for Improving the Reference
• CHM1 cell line established from a haploid hydatidiform
mole (complete, paternal; 46XX) (U.Surti)
• CHORI-17 BAC library (P. deJong)
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1,560 fpc contigs)
• CHORI-17 BACs
• >750 have been sequenced
• 664 of them in Genbank as phase 3 sequence
• CHM1 WGS assembly
• Initial assembly produced from >100X coverage of Illumina data
• Initial PacBio assembly produced using ~54X of P5 PacBio data
• Latest PacBio assembly produced using ~60X of P6 PacBio data
CHM1 P5 vs P6 read length distributions
Mapped Concordance (%)
FractionofMappedBases
% of Bases in Reads > 30,000 bases
17.8 %
0.05 %
CHM1 Assembly Comparisons
CHM1_2014
P5 chemistry
(54X)
CHM1_2015
P6 chemistry (61X)
Jason Chin
CHM1_2015
P6 chemistry (61X)
Adam Phillippy
# Contigs 26,312 3,641 4,849
Max Contig
Size
44,873,077 bp 109,312,888 bp 99,566,047 bp
Total
Assembly Size
3,239,081,299 bp 2,996,426,293 bp 2,939,630,703 bp
N50 4,498,608 bp 26,899,841 bp 20,609,304 bp
N90 30,687 bp 1,686,030 bp 1,188,604 bp
N95 17,815 bp 149,494 bp 95,419 bp
Hybrid Scaffolds – PacBio and BioNano
Seq
Assem
Seq
Assem
Seq
Assem
BN
Hybrid
BN
Hybrid
BN
Hybrid
# of
Contigs
Contig
N50 (Mb)
Total
Size
(Gb)
# of
Scaffolds
Scaff N50
(Mb)
Total Size
(Gb)
CHM1 (P6)
GCA_001297185
MGI CHM1 map
(Jason’s version)
3641 26.9 2.99 161 47.6 2.84
CHM1 (P6)
GCA_001307025
MGI CHM1 Map
(Adam’s version)
4850 20.6 2.94 221 40.04 2.82
Hybrid Scaffold
Hybrid Scaffold
PacBio Contigs
BioNano Contigs
Using BioNano to Compare CHM1 Assemblies
CHM1
GCA_001297185
Jason’s version
CHM1
GCA_001307025
Adam’s version
Hybrid WGS Conflicts 45 52
Hybrid BN Conflicts 51 63
SV - Deletions 35 25
SV- Insertions 32 31
SV- Inversions 7 12
SV- End 126 190
SV- Translocation_Interchr 332 529
Assembly Assessment Methods
• Assemblies will run through NCBI QA pipeline
• Assessed for contiguity, annotation, and concordance with the
finished BACs
• Assembly Assembly alignments will be generated between each PB
assembly and GRCh38
• BioNano Genome Map
• SV calls generated from comparing the BioNano data to each of the
assemblies
• Hybrid scaffolding conflicts will also point out potential assembly
errors
• Alignment of the Illumina reads back to the each of the
assemblies
• Heterozygous calls are likely indicative of a collapse in the
assembly (for the haploid genomes)
1q21 Region – GRCh38 vs GCA_001297185
1 Megabase
GRCh38
GCA_001297185
Seg Dup Track
1q21 Region - GRCh38 vs GCA_001297185
GRCh38
GCA_001297185
Seg Dup Track
99.9+% identity
99.1% identity
First Gold Genome - NA19240
Initial Assembly Stats
# Seq Contigs 3569
Max Contig Length 20,393,869bp
Total Assembly Size 2,745,634,789 bp
N50 6,003,115 bp
N90 848,151 bp
N95 345,457 bp
• NA19240 – Yoruban sample
• Generated >70X raw PacBio data
• Assembled on DNAnexus platform using Falcon pipeline
NA19240 BioNano Hybrid and SV Stats
Seq
Assem
Seq
Assem
Seq
Assem
BN
Hybrid
BN
Hybrid
BN
Hybrid
BN
Hybrid
BN
Hybrid
# of
Contigs
Contig
N50
(Mb)
Total
Size
(Gb)
# of
Scaffolds
Scaffold
N50
(Mb)
Total
Size
(Gb)
Conflicts
WGS
Conflicts
BN
NA19240
DNAnexus
3569 6.01 2.75 421 14.78 2.74 49 60
Potential
mis-assemblies
Breaks made
Conflicts 28 22
Ends 13 5
Insertions 5 2
Translocations 74 14
Alignment of NA19240 to BioNano map
Conflict identified
By BioNano data
Alignment to GRCh38
GRCh38
NA19240
CCL Region of NA19240 Assembly
GRCh38
Genes
Seg Dup
PB Assembly
1 Megabase
CCL Region with BAC alignments
GRCh38
BAC Alignments
Seg Dup
PB Assembly
100 Kb
BACs Will Resolve These Regions!
NA19240 BAC
NA19240 WGS
Which Assembly is Best?
6.40
6.60
6.80
7.00
7.20
7.40
7.60
7.80
2.810 2.820 2.830 2.840 2.850
Contig
Length
N50
(MB)
Total Assembly Size (GB)
HG00733 Puerto Rican Assembly Stats
• Use other sources to assess multiple assemblies
• BioNano
• Linked long reads
Genome Status
Data Source Origin Level of
Coverage
Status
CHM1 NA Platinum Assembly Assessment
CHM13 NA Platinum Assembly Assessment
NA19240 Yoruban Gold Analysis Underway
HG00733 Puerto Rican Gold Assembly QC
HG00514 Han Chinese Gold Assembly QC
NA12878 European Gold Data Generation Underway
HG01352 Columbian Gold Not Started Yet
Next Steps
• Platinum Genomes
• Select the best CHM1 and CHM13 assembly and then improve those
further using BioNano and other tools
• Incorporate the BACs into the assemblies
• Create Chromosomal AGPs
• Gold Genomes
• Finish analysis of the first Gold Genome
• Data production is now complete on two other Gold genomes and
assemblies for those are underway
• Data production is underway on the 4th Gold genome
• BACs are being sequenced for many of these genomes
Acknowledgements
The McDonnell Genome Institute at
Washington University in St. Louis
Rick Wilson
Bob Fulton
Wes Warren
Karyn Meltz Steinberg
Vince Magrini
Sean McGrath
Derek Albracht
Milinn Kremitzki
Susan Rock
Debbie Scheer
Chad Tomlinson
University of Washington
Evan Eichler
NCBI
Valerie Schneider
University of Pittsburgh
School of Medicine
(CHM1 and CHM13 cell line)
Urvashi Surti
10X Genomics
Deanna Church
BioNano Genomics
Palak Sheth
Alex Hastie
Pacific Biosciences
Jason Chin
Nick Sisneros
UCSF
Pui-Yan Kwok
Yvonne Lai
Chin Lin
Catherine Chu
NHGRI
Adam Phillippy
Sergey Koren
Dovetail
Todd Dickinson
1 of 25

Recommended

Creating Reference-Grade Human Genome Assemblies by
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
1.6K views31 slides
AGBT2017 Reference Workshop: Lindsay by
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayGenome Reference Consortium
490 views34 slides
Ashg2017 workshop tg by
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tgGenome Reference Consortium
590 views29 slides
Haplotype resolved structural variation assembly with long reads by
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsGenome Reference Consortium
1.8K views43 slides
Creating Reference-Grade Human Genome Assemblies by
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
349 views20 slides
Exploiting long read sequencing technology to build a substantially improved ... by
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Genome Reference Consortium
1.3K views32 slides

More Related Content

What's hot

AGBT2017 Reference Workshop: Fulton by
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonGenome Reference Consortium
1.5K views35 slides
Ashg grc workshop2014_tg by
Ashg grc workshop2014_tgAshg grc workshop2014_tg
Ashg grc workshop2014_tgGenome Reference Consortium
1.3K views25 slides
ABGT 2016 Workshop Schneider by
ABGT 2016 Workshop SchneiderABGT 2016 Workshop Schneider
ABGT 2016 Workshop SchneiderGenome Reference Consortium
649 views22 slides
AGBT2017 Reference Workshop: Schneider by
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderGenome Reference Consortium
533 views27 slides
Generating haplotype phased reference genomes for the dikaryotic wheat strip... by
Generating haplotype phased reference genomes  for the dikaryotic wheat strip...Generating haplotype phased reference genomes  for the dikaryotic wheat strip...
Generating haplotype phased reference genomes for the dikaryotic wheat strip...Benjamin Schwessinger
936 views29 slides
Getting the most from the reference assembly by
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assemblyGenome Reference Consortium
738 views46 slides

What's hot(20)

Generating haplotype phased reference genomes for the dikaryotic wheat strip... by Benjamin Schwessinger
Generating haplotype phased reference genomes  for the dikaryotic wheat strip...Generating haplotype phased reference genomes  for the dikaryotic wheat strip...
Generating haplotype phased reference genomes for the dikaryotic wheat strip...
Schneider_AGBT2014 by vaschn
Schneider_AGBT2014Schneider_AGBT2014
Schneider_AGBT2014
vaschn9.6K views

Viewers also liked

Variation reference graphs and the variation graph toolkit vg by
Variation reference graphs and the variation graph toolkit vgVariation reference graphs and the variation graph toolkit vg
Variation reference graphs and the variation graph toolkit vgGenome Reference Consortium
1.6K views17 slides
The Transforming Genetic Medicine Initiative (TGMI) by
The Transforming Genetic Medicine Initiative (TGMI)The Transforming Genetic Medicine Initiative (TGMI)
The Transforming Genetic Medicine Initiative (TGMI)Genome Reference Consortium
649 views31 slides
Graph and assembly strategies for the MHC and ribosomal DNA regions by
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGenome Reference Consortium
564 views27 slides
Everyday de novo diploid assembly by
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assemblyGenome Reference Consortium
569 views53 slides
Genome in a Bottle by
Genome in a BottleGenome in a Bottle
Genome in a BottleGenome Reference Consortium
1.4K views33 slides
Everyday de novo assembly by
Everyday de novo assemblyEveryday de novo assembly
Everyday de novo assemblyGenome Reference Consortium
624 views47 slides

Viewers also liked(10)

Aug2015 analysis team 04 10x genomics by GenomeInABottle
Aug2015 analysis team 04 10x genomicsAug2015 analysis team 04 10x genomics
Aug2015 analysis team 04 10x genomics
GenomeInABottle31K views

Similar to agbt 2016 workshop lindsay

150224 grc kms by
150224 grc kms150224 grc kms
150224 grc kmsGenome Reference Consortium
1.7K views47 slides
Review of Liao et al - A draft human pangenome reference - Nature (2023) by
Review of Liao et al - A draft human pangenome reference - Nature (2023)Review of Liao et al - A draft human pangenome reference - Nature (2023)
Review of Liao et al - A draft human pangenome reference - Nature (2023)Stuart MacGowan
109 views19 slides
Ashg2017 workshop schneider by
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneiderGenome Reference Consortium
574 views31 slides
Advancements in the human genome reference assembly (GRCh38) by
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Genome Reference Consortium
2.4K views11 slides
Benchmarking with GIAB 220907 by
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907GenomeInABottle
100 views42 slides
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database by
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseNathan Olson
1.4K views18 slides

Similar to agbt 2016 workshop lindsay(20)

Review of Liao et al - A draft human pangenome reference - Nature (2023) by Stuart MacGowan
Review of Liao et al - A draft human pangenome reference - Nature (2023)Review of Liao et al - A draft human pangenome reference - Nature (2023)
Review of Liao et al - A draft human pangenome reference - Nature (2023)
Stuart MacGowan109 views
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database by Nathan Olson
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Nathan Olson1.4K views
Generating high-quality human reference genomes using PromethION nanopore seq... by Miten Jain
Generating high-quality human reference genomes using PromethION nanopore seq...Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...
Miten Jain1.6K views
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle by Jennifer Shelton
RNA-Seq transcriptome analysis of Gonium pectorale cell cycleRNA-Seq transcriptome analysis of Gonium pectorale cell cycle
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle
Jennifer Shelton933 views
Towards a Reference Genome for Switchgrass (Panicum virgatum) - Schmutz jeremy by Shaojun Xie
Towards a Reference Genome for Switchgrass (Panicum virgatum) - Schmutz jeremyTowards a Reference Genome for Switchgrass (Panicum virgatum) - Schmutz jeremy
Towards a Reference Genome for Switchgrass (Panicum virgatum) - Schmutz jeremy
Shaojun Xie480 views
Johannes Bergsten Dna Barcoding by bioinfocourse
Johannes Bergsten Dna BarcodingJohannes Bergsten Dna Barcoding
Johannes Bergsten Dna Barcoding
bioinfocourse2.7K views
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle. by Jennifer Shelton
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
Jennifer Shelton2.1K views
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic... by Fabio Caligaris
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...
Fabio Caligaris754 views
Using BioNano Maps to Improve an Insect Genome Assembly​ by Jennifer Shelton
Using BioNano Maps to Improve an Insect Genome Assembly​Using BioNano Maps to Improve an Insect Genome Assembly​
Using BioNano Maps to Improve an Insect Genome Assembly​
Jennifer Shelton2.8K views
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma... by GenomeInABottle
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
GenomeInABottle712 views
KHMiga-AGBT.020923.upload.pdf by KarenMiga
KHMiga-AGBT.020923.upload.pdfKHMiga-AGBT.020923.upload.pdf
KHMiga-AGBT.020923.upload.pdf
KarenMiga1.3K views

More from Genome Reference Consortium

What's new and what's next for the human reference assembly? by
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?Genome Reference Consortium
2.3K views19 slides
Genome variation graphs with the vg toolkit by
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome Reference Consortium
2.1K views17 slides
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project by
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectGenome Reference Consortium
1.4K views21 slides
Why graph genome storage and updating wakes me up at 4 am by
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amGenome Reference Consortium
688 views5 slides
Mane v2 final by
Mane v2 finalMane v2 final
Mane v2 finalGenome Reference Consortium
607 views17 slides
Lrg and mane 16 oct 2018 by
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018Genome Reference Consortium
334 views9 slides

Recently uploaded

"How can I develop my learning path in bioinformatics? by
"How can I develop my learning path in bioinformatics?"How can I develop my learning path in bioinformatics?
"How can I develop my learning path in bioinformatics?Bioinformy
24 views13 slides
Distinct distributions of elliptical and disk galaxies across the Local Super... by
Distinct distributions of elliptical and disk galaxies across the Local Super...Distinct distributions of elliptical and disk galaxies across the Local Super...
Distinct distributions of elliptical and disk galaxies across the Local Super...Sérgio Sacani
31 views12 slides
DATABASE MANAGEMENT SYSTEM by
DATABASE MANAGEMENT SYSTEMDATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEMDr. GOPINATH D
7 views50 slides
Chromatography ppt.pptx by
Chromatography ppt.pptxChromatography ppt.pptx
Chromatography ppt.pptxvarshachandgudesvpm
18 views1 slide
1978 NASA News Release Log by
1978 NASA News Release Log1978 NASA News Release Log
1978 NASA News Release Logpurrterminator
10 views146 slides
DEVELOPMENT OF FROG.pptx by
DEVELOPMENT OF FROG.pptxDEVELOPMENT OF FROG.pptx
DEVELOPMENT OF FROG.pptxsushant292556
8 views21 slides

Recently uploaded(20)

"How can I develop my learning path in bioinformatics? by Bioinformy
"How can I develop my learning path in bioinformatics?"How can I develop my learning path in bioinformatics?
"How can I develop my learning path in bioinformatics?
Bioinformy24 views
Distinct distributions of elliptical and disk galaxies across the Local Super... by Sérgio Sacani
Distinct distributions of elliptical and disk galaxies across the Local Super...Distinct distributions of elliptical and disk galaxies across the Local Super...
Distinct distributions of elliptical and disk galaxies across the Local Super...
Sérgio Sacani31 views
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance... by InsideScientific
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
InsideScientific58 views
Nitrosamine & NDSRI.pptx by NileshBonde4
Nitrosamine & NDSRI.pptxNitrosamine & NDSRI.pptx
Nitrosamine & NDSRI.pptx
NileshBonde417 views
Artificial Intelligence Helps in Drug Designing and Discovery.pptx by abhinashsahoo2001
Artificial Intelligence Helps in Drug Designing and Discovery.pptxArtificial Intelligence Helps in Drug Designing and Discovery.pptx
Artificial Intelligence Helps in Drug Designing and Discovery.pptx
abhinashsahoo2001126 views
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf by KerryNuez1
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdfMODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
KerryNuez125 views
How to be(come) a successful PhD student by Tom Mens
How to be(come) a successful PhD studentHow to be(come) a successful PhD student
How to be(come) a successful PhD student
Tom Mens491 views
Conventional and non-conventional methods for improvement of cucurbits.pptx by gandhi976
Conventional and non-conventional methods for improvement of cucurbits.pptxConventional and non-conventional methods for improvement of cucurbits.pptx
Conventional and non-conventional methods for improvement of cucurbits.pptx
gandhi97619 views
별헤는 사람들 2023년 12월호 전명원 교수 자료 by sciencepeople
별헤는 사람들 2023년 12월호 전명원 교수 자료별헤는 사람들 2023년 12월호 전명원 교수 자료
별헤는 사람들 2023년 12월호 전명원 교수 자료
sciencepeople41 views

agbt 2016 workshop lindsay

  • 1. MGI Reference Genomes Workshop Tina Graves-Lindsay Feb 10, 2016
  • 2. The Human Reference is a Work in Progress! • The current reference – GRCh38 - is not optimal for some regions of the genome and/or some individuals/ancestries. • GRCh38 is comprised of DNA from several individual humans. • Allelic diversity and structural variation present major challenges when assembling a representative diploid genome. • New technologies, methods, and resources since 2003 have allowed for substantial improvements in the reference genome. • Additional high-quality reference sequences are needed to represent the full range of genetic diversity in humans
  • 3. NA12878 European NA19240 Yoruban NA19434 Luhya Human Phylogenetic Tree (Li, et al 2008) MGI Gold Reference Genomes HG00514 Han Chinese HG007333 Puerto Rican World Map with Sample Origins The MGI Reference Genomes Improvement Project Funded by the NIH, the MGI Reference Genomes Improvement Project aims to increase the quality and diversity of existing scientific resources. We will sequence and assemble at least 5 diploid genomes from individuals selected to maximize human genetic diversity (right). All sources have BAC libraries available and whenever possible, we will use samples from a trio (two parents and child). We will sequence the parents within the trio at a lower depth of coverage to enable haplotype phasing of the proband sequence. Other independent efforts to sequence and assemble new reference genomes include two Japanese, one Malaysian, a Han Chinese and an Ashkenazim trio (as part of the Genome in a Bottle Effort). CHM1 MGI Platinum Reference Genomes CHM13 European, inferred European, inferred HG01352 Columbian Samples to be Sequenced
  • 5. Definitions of Genome Level • Platinum Genome • Haploid genome source • Contiguous, haplotype-resolved representation of entire genome • BAC library available • Gold Genome • Diploid genome source • Part of a trio • Parents will be sequenced to help haplotype resolve some regions • BAC libraries available • Targeted regions sequenced using these BAC libraries • Will contain some haplotype resolved regions
  • 6. CHM1: A Key Resource for Improving the Reference • CHM1 cell line established from a haploid hydatidiform mole (complete, paternal; 46XX) (U.Surti) • CHORI-17 BAC library (P. deJong) • CHORI-17 BAC end sequences (n=325,659) • CHORI-17 multiple enzyme fingerprint map (1,560 fpc contigs) • CHORI-17 BACs • >750 have been sequenced • 664 of them in Genbank as phase 3 sequence • CHM1 WGS assembly • Initial assembly produced from >100X coverage of Illumina data • Initial PacBio assembly produced using ~54X of P5 PacBio data • Latest PacBio assembly produced using ~60X of P6 PacBio data
  • 7. CHM1 P5 vs P6 read length distributions Mapped Concordance (%) FractionofMappedBases % of Bases in Reads > 30,000 bases 17.8 % 0.05 %
  • 8. CHM1 Assembly Comparisons CHM1_2014 P5 chemistry (54X) CHM1_2015 P6 chemistry (61X) Jason Chin CHM1_2015 P6 chemistry (61X) Adam Phillippy # Contigs 26,312 3,641 4,849 Max Contig Size 44,873,077 bp 109,312,888 bp 99,566,047 bp Total Assembly Size 3,239,081,299 bp 2,996,426,293 bp 2,939,630,703 bp N50 4,498,608 bp 26,899,841 bp 20,609,304 bp N90 30,687 bp 1,686,030 bp 1,188,604 bp N95 17,815 bp 149,494 bp 95,419 bp
  • 9. Hybrid Scaffolds – PacBio and BioNano Seq Assem Seq Assem Seq Assem BN Hybrid BN Hybrid BN Hybrid # of Contigs Contig N50 (Mb) Total Size (Gb) # of Scaffolds Scaff N50 (Mb) Total Size (Gb) CHM1 (P6) GCA_001297185 MGI CHM1 map (Jason’s version) 3641 26.9 2.99 161 47.6 2.84 CHM1 (P6) GCA_001307025 MGI CHM1 Map (Adam’s version) 4850 20.6 2.94 221 40.04 2.82
  • 10. Hybrid Scaffold Hybrid Scaffold PacBio Contigs BioNano Contigs
  • 11. Using BioNano to Compare CHM1 Assemblies CHM1 GCA_001297185 Jason’s version CHM1 GCA_001307025 Adam’s version Hybrid WGS Conflicts 45 52 Hybrid BN Conflicts 51 63 SV - Deletions 35 25 SV- Insertions 32 31 SV- Inversions 7 12 SV- End 126 190 SV- Translocation_Interchr 332 529
  • 12. Assembly Assessment Methods • Assemblies will run through NCBI QA pipeline • Assessed for contiguity, annotation, and concordance with the finished BACs • Assembly Assembly alignments will be generated between each PB assembly and GRCh38 • BioNano Genome Map • SV calls generated from comparing the BioNano data to each of the assemblies • Hybrid scaffolding conflicts will also point out potential assembly errors • Alignment of the Illumina reads back to the each of the assemblies • Heterozygous calls are likely indicative of a collapse in the assembly (for the haploid genomes)
  • 13. 1q21 Region – GRCh38 vs GCA_001297185 1 Megabase GRCh38 GCA_001297185 Seg Dup Track
  • 14. 1q21 Region - GRCh38 vs GCA_001297185 GRCh38 GCA_001297185 Seg Dup Track 99.9+% identity 99.1% identity
  • 15. First Gold Genome - NA19240 Initial Assembly Stats # Seq Contigs 3569 Max Contig Length 20,393,869bp Total Assembly Size 2,745,634,789 bp N50 6,003,115 bp N90 848,151 bp N95 345,457 bp • NA19240 – Yoruban sample • Generated >70X raw PacBio data • Assembled on DNAnexus platform using Falcon pipeline
  • 16. NA19240 BioNano Hybrid and SV Stats Seq Assem Seq Assem Seq Assem BN Hybrid BN Hybrid BN Hybrid BN Hybrid BN Hybrid # of Contigs Contig N50 (Mb) Total Size (Gb) # of Scaffolds Scaffold N50 (Mb) Total Size (Gb) Conflicts WGS Conflicts BN NA19240 DNAnexus 3569 6.01 2.75 421 14.78 2.74 49 60 Potential mis-assemblies Breaks made Conflicts 28 22 Ends 13 5 Insertions 5 2 Translocations 74 14
  • 17. Alignment of NA19240 to BioNano map Conflict identified By BioNano data
  • 19. CCL Region of NA19240 Assembly GRCh38 Genes Seg Dup PB Assembly 1 Megabase
  • 20. CCL Region with BAC alignments GRCh38 BAC Alignments Seg Dup PB Assembly 100 Kb
  • 21. BACs Will Resolve These Regions! NA19240 BAC NA19240 WGS
  • 22. Which Assembly is Best? 6.40 6.60 6.80 7.00 7.20 7.40 7.60 7.80 2.810 2.820 2.830 2.840 2.850 Contig Length N50 (MB) Total Assembly Size (GB) HG00733 Puerto Rican Assembly Stats • Use other sources to assess multiple assemblies • BioNano • Linked long reads
  • 23. Genome Status Data Source Origin Level of Coverage Status CHM1 NA Platinum Assembly Assessment CHM13 NA Platinum Assembly Assessment NA19240 Yoruban Gold Analysis Underway HG00733 Puerto Rican Gold Assembly QC HG00514 Han Chinese Gold Assembly QC NA12878 European Gold Data Generation Underway HG01352 Columbian Gold Not Started Yet
  • 24. Next Steps • Platinum Genomes • Select the best CHM1 and CHM13 assembly and then improve those further using BioNano and other tools • Incorporate the BACs into the assemblies • Create Chromosomal AGPs • Gold Genomes • Finish analysis of the first Gold Genome • Data production is now complete on two other Gold genomes and assemblies for those are underway • Data production is underway on the 4th Gold genome • BACs are being sequenced for many of these genomes
  • 25. Acknowledgements The McDonnell Genome Institute at Washington University in St. Louis Rick Wilson Bob Fulton Wes Warren Karyn Meltz Steinberg Vince Magrini Sean McGrath Derek Albracht Milinn Kremitzki Susan Rock Debbie Scheer Chad Tomlinson University of Washington Evan Eichler NCBI Valerie Schneider University of Pittsburgh School of Medicine (CHM1 and CHM13 cell line) Urvashi Surti 10X Genomics Deanna Church BioNano Genomics Palak Sheth Alex Hastie Pacific Biosciences Jason Chin Nick Sisneros UCSF Pui-Yan Kwok Yvonne Lai Chin Lin Catherine Chu NHGRI Adam Phillippy Sergey Koren Dovetail Todd Dickinson

Editor's Notes

  1. As part of our work as a member of the Genome Reference Consortium, we have been working to improve the current reference, GRCh38. In doing this work, we have found that there are still a few regions of the genome not fully resolved. There are still a few genes that are not optimally represented for all individuals or ancestries, although we have fixed quite a few of them in GRCh38. The reference is comprised of many individuals, so there are regions where allelic diversity and structural variation present challenges in the assembly. Many of the newer technologies have allowed for improvements to the reference genome. But we realize that there still is a need for additional high quality human genome assemblies to fully cover the range of genetic diversity in humans.
  2. At MGI we are working on a project funded by NIH to sequence additional human reference genomes. These are the samples we plan to sequence. There are currently 6 gold genomes planned and 2 platinum genomes. We will sequence a Puerto Rican sample, A Han Chinese sample, a Columbian sample and two African samples. We also plan to improve on the European NA12878 sample as well. These genomes will help to add diversity to the reference. I will spend the first portion of my talk telling you about the one of the platinum genomes that we have been working with, CHM1 and then finish with the some details about the first of the Gold genomes we are sequencing.
  3. Vince mentioned this, but I just briefly wanted to touch on this again, we are generating ~60X coverage of PacBio long read data. We will do a de novo assembly of that PacBio data. Then we are using a variety of additional tools to help inform the assembly. BioNano has been very useful in helping to scaffold the assembly as well as to identify potential mis-assemblies. We are also starting to work with 10X genomics data as well as Dovetail data for the same reasons. We are also targeting difficult to assemble regions of the genome by sequencing BACs. Once the BACs are incorporated, we plan to align all of this data to the Reference very stringently to produce chromosomal AGPs. The end product will be a very high quality whole genome assembly.
  4. Here are the definitions we are using for both Platinum and Gold level assemblies. The Platinum genomes are single haplotype sources. We plan to achieve a contiguous, haplotype-resolved representation of the entire genome for these samples. Both of the genomes we have worked on so far for this level have BAC libraries, which as I mentioned before, will be used to help resolve regions of the genome that would be difficult to assemble on the whole genome level. The Gold genomes, will be diploid sources, all will be part of a trio, We are sequencing the child to deeper coverage and doing a lighter amount of sequence on the Parents, mainly to help sort out haplotypes in specific regions. We also have BAC libraries for all of these genomes as well.
  5. To resolve some of these issues that I mentioned that existed in the reference, especially the structurally variant regions that were most difficult to put together, a hydatidiform mole cell line was established. A hydatidiform mole is formed when an enucleated egg is fertilized by sperm. The cells go through several rounds of cell division and the resulting DNA is a diploid copy of the exact same genetic material. This first sample is known as CHM1. A BAC library was created from the CHM1 source and has been used extensively in the reference to fix some of these difficult to assemble regions of the genome. By using a haploid source it is much easier to put together regions where there are segmental duplications. Once we realized the utility of this source, we decided to sequence the entire genome. This was first done years ago and at the time, the only cost effective way to sequence an entire human genome was by generating Illumina data. A reference guided assembly was produced with this Illumina data. IT wasn’t long after that that PacBio agreed to collaborate and the Initial PacBio data was produced. That was ~54X coverage using the P5 chemistry. Then, early in 2015, PacBio believed that they could do better with the most current sequencing chemistry, and library protocols, so they generated the data again. That second set of data did prove to be much better than the first.
  6. Here is a comparison of the read length of the P5 data compared to the more recent P6 data. You can see that in the most recent data, there was over 17% of the reads that were 30Kb or longer, In the original set of data, less than 1% of the reads were that length.
  7. Besides the update in chemistries, there have also been improvements in the algrythms used to assemble these genomes. Here is a comparison of the P5 and P6 data assembled both by Jason Chin at PacBio and then the most recent data was also assembled by Adam Phillippe’s group. In both instances you will see how the N50 has improved greatly.
  8. Because we now have multiple assemblies of the same data, we plan to use the BioNano data as a way to compare the different versions of the assemblies. Here are the stats of both Jason and Adam’s assemblies when run through the hybrid scaffolding pipeline. This shows the great continuity that can be achieved through hybrid scaffolding. For Jason’s assembly, the contig N50 is 26Mb, and then together with the BN map to create hybrid scaffolds, we can achieve a scaffold N50 close to 50Mb. Adam’s assembly is very similar, it starts with a contig n50 of 20 Mb and then we get a scaffold N50 of 40Mb. This is fairly typical of what we have seen at MGI. The PacBio contig N50 nearly doubles when scaffolded with the BN data.
  9. Here is an example of one of the BioNano hybrid scaffolds that was generated The top line in green represents the Hybrid scaffold, the first set of Blue bars represents the PacBio assembly and the S2 lines represent the BioNano map contigs. So from this you can see how these two technologies are very complementary to one another. This is a snapshot of a 10 Mb portion of a larger hybrid scaffold
  10. We are usingthe Hybrid scaffolding output as well as the SV calls from our BioNano comparisons to help us to evaluate each version of the latest CHM1 assemblies. Just by looking at the raw numbers, it looks like Jason’s version might be better, but we still need to look through the data more. We’ve looked through a portion of the translocation calls and some of these are indicating joins that could potentially be made between two PacBio contigs. We do need to look at more of the calls though to come to a conclusion on which of these two assemblies is best. We also have a few other metrics we will use to make the final decision on which assembly to move forward with.
  11. Here are some of those other methods we will use to compare these assemblies. With the help of NCBI, these assemblies will be run through the NCBI QA pipeline. They will be assessed for contiguity, annotation, and concordance with any finished BAC sequences. Assembly assembly alignments will also be performed between each of the PB assemblies and GRCh38. As I mentioned, the BioNano Genome maps are being used to asses the assemblies. We have also generated Illumina data for all of our assemblies. For the haploid samples in particular, any heterozygous calls resulting from the Illumina alignments are likely indicative of a collapse in the assembly. This data will also be used to assess the potential mis-assemblies once identified by these other methods.
  12. Here is a view of 1q21 in GRCh38, the SRGAP2 gene family is located in this region. This is a highly conserved gene family that is located on three regions on chromosome 1. This view is over 6 Mb of that region. Because of the degree of similarity between the duplications in this region and the other two locations of SRGAP2, GRCh37 was very mis-assembled. In order to fix theis region in the reference, we re-sequenced the entire region using the CHORI-17 BACs, the single haplotype source. So this region of GRCh38 is made up of clones from CHM1. In this view, you can see how Jason’s version of the PacBio CHM1 assembly aligns to the reference. You will notice in places where there are quite a few segmental duplications, the assembly is much more fragmented.
  13. This is a zoomed in view from the previous slide. You will notice, the larger contig in the middle, aligns nearly perfectly, to the CH17 BAC path, where as the contigs in the segmentally duplication regions do not align as well. In this large contig, the percent identity is over 99.9% identical, which is what you would expect since this is the same source as the reference, where as in this contig, where there are known segmental duplications, the identity is not as high. Any of the red marks in the grey bars represent mismatches
  14. Switch gears and talk about our first gold genome sample, NA19240, a Yoruban sample. The initial assembly was done on the DNAnexus platform for both speed and ease of assembly. We had previously assembled smaller PacBio genomes, but nothing of this size and with this amount of data. All other human assemblies we have worked with have been assembled by the experts. We were not sure we could get an assembly of this size to finish in a timely manner. On the DNAnexus platform, the assembly and Quiver steps finished in less than 2 weeks. We have since been able to assemble this data on our own cluster, but there were quite a few modifications that needed to be made to make everything work correctly. Here are the Initial assembly stats for the this sample. The n50 contig length is 6 Mb, it is not as contiguous as the haploid samples, but we think this is still pretty good considering this is a diploid sample.
  15. As I mentioned earlier, we have been using the BioNano data as a way to assess our assembly, by doing both the hybrid scaffolding as well as calling SVs Here are those results for this sample. In this case, the hybrid scaffolding increased the scaffold length N50 to almost 15Mb. As part of our QC process for this assembly, we evaluated all of the Conflicts found during the hybrid scaffolding process, as well as some of the SV calls. After evaluating all of this data, we narrowed the list down to the calls that seemed most likely to indicate an assembly issue. We then used the alignment of the Illumina data to help pinpoint where the contigs needed to be broken. From all of this, we were able to successfully make sequence breaks in a little over 40 regions.
  16. Here is an example of one of the regions that was corrected as a result of this QC process. The Sequence contig is on top, The region in the brackets was identified as a conflict during hybrid scaffolding– From this, we were able to identify the mis-assembly in the PacBio contig, it was broken and the region was flipped. When comparing the corrected assembly back to the BN map, the maps align more consistently now.
  17. At the time we were making the initial breaks in the assembly, we didn’t have the alignments of the NA19240 assembly to the reference but we do have them now. This is that same region of the PacBio assembly, the original version, aligned to GRCh38. The top panel represents the reference and the bottom panel is the original PacBio contig. All three blue blocks represent portions of the same PacBio contig, the arrows that indicate the direction of the alignment. The reference alignment confirms what we had found with the BN data, that the middle portion of this PacBio contig needed to be flipped.
  18. Here is a diagram of the alignment of the NA19240 assembly compared to the reference through the CCL region that Deanna mentioned in her talk. We know this region to be structurally variant. Here you can see that our PacBio assembly is very contiguous in the areas where there are very little duplications, but in the segmentally duplicated regions it is fragmented.
  19. This slide is a zoomed in view of one of those segmentally duplicated regions. As I mentioned earlier, we are sequencing targeted BAC clones in some of these know structually variant regions. In this slide, you can see how those initial clone assemblies align. For the targeted BACs, we are initially sequencing all of the clones with Illumina data and then from that initial data, selecting a clone path and we will improve those clones. In this view, we heave 2 clones aligned. This line represents a contig from one of those clones and how it aligns through this region, this is the same contig and it looks to align very similarly to all three of these regions. In this case, we will need to finish the clone to understand the correct alignment through here. WE have just begun to pick the path of BACs that will be needed to resolve these regions.
  20. This is another region that we have targeted with BACs. The NBPF genes are located in many places along chromosome 1. In most regions, there are segmental duplications through those regions as well. Here is an alignment of one of those genes. The gene alignment is seen here in green and the Pacbio assembly alignment is in gray. The bottom portion in darker gray indicates segmental duplication. The pink areas of these gray bars indicate mismatch, so you can see through this region, there are quite a few mismatchs in our assembly. This gray bar is showing the alignment of one clone. This BAC is nearly contiguous with just the Illumina data. So in this case, it is easy to identify a clone to resolve this region of the assembly.
  21. For our next genome, HG00733, the Puerto Rican sample, we have completed a variety of assemblies, all assembled with different parameters. Here are just a few of the assemblies plotted by Contig length N50 and total assembly size. From these parameters as well as a few others, we need to decide on which assembly is the best one. Is it better to have a more contiguous assembly? Is it better to have as much of the total genome assembled?
  22. I want to acknowledge all of the collaborators on this project and all of the work that has gone into it thus far.