Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building a platinum human genome
assembly from single haplotype
human genomes generated from
long molecule sequencing
Kary...
0
100000
200000
300000
400000
CHM1_1.1 HuRef ALLPATHS YH_2.0
Contig Number
Contig N50
Figure 1
Last year…
Steinberg et al,...
This year…
0
5000000
10000000
15000000
20000000
25000000
30000000
CHM13
Draft
CHM1
PB_2
CHM1
PB_1
CHM1_1.1 HuRef ALLPATHS ...
This year…
Log
scale
1
10
100
1000
10000
100000
1000000
10000000
100000000
CHM13
Draft
CHM1
PB_2
CHM1
PB_1
CHM1_1.1 HuRef ...
We combine PacBio with other technologies to construct
the assembly
How do we define platinum and gold standards?
GRCh38
Platinum
(CHM1)
Gold
(NA19240)
% Reference genome
covered
100 98.40 9...
CHM13 Draft Assembly (GCA_000983455.1)
•  60X PacBio (P5 and P6 chemistry)
•  Average read length ~11kb
•  Daligner/Falcon...
Gene Model (RefSeq) Analysis
GRCh38
CHM1_
1.1
CHM1_PB1 CHM1_PB2 CHM13
Number of
sequences
not aligning
21 88 67 67 125
Spl...
Short read sequence analysis
•  100X Illumina sequence
•  Align with BWA-MEM to ordered and
oriented assembly
•  Variant c...
CHM13 Illumina data aligned to CHM13 assembly
202,016 SNVs/indels on unplaced scaffolds
SV_TYPES	
   >10kb	
   5-10kb	
   ...
BioNano SV calls can be used to identify misassembly
Collapse
Expansion
inAssembly
Gap in SequencePacBio Assembly
BioNano ...
BioNano reveals collapse in PacBio assembly
PacBio Assembly
BioNano Map
Illumina data aligned to PacBio assembly also shows
collapse
BioNano reveals collapse in PacBio assembly due to
highly homologous segmental duplications
SD = 96%
CHR1	
   46746040	
  ...
This region is rich in medically relevant genes
chr1 (p33) p31.1 1q12 q41 43 44
CYP4Z2P
CYP4A11
CYP4X1
CYP4Z1
CYP4A22
SegD...
CHM13 Hybrid Scaffold
Hybrid Scaffold
PacBio Contigs
BioNano Contigs
CHM13 Hybrid Scaffolds
BioNano Map PacBio Assmbly Hybrid Scaffold
# of Contigs 3593 1590 * 254
Min Contig Length 0.08 Mb 0...
Combining CHM1 and CHM13
reference
mapping
CHM1 CHM13
Pipeline analysis
Variant Evaluation
97 13
Acknowledgements
The McDonnell Genome Institute at
Washington University in St. Louis
Rick Wilson
Bob Fulton
Wes Warren
Ti...
Building a platinum human genome assembly from single haplotype human genomes generated from long molecule sequencing
Upcoming SlideShare
Loading in …5
×

Building a platinum human genome assembly from single haplotype human genomes generated from long molecule sequencing

4,992 views

Published on

Slides from platform presentation at #ASHG15

Published in: Data & Analytics
  • Be the first to comment

Building a platinum human genome assembly from single haplotype human genomes generated from long molecule sequencing

  1. 1. Building a platinum human genome assembly from single haplotype human genomes generated from long molecule sequencing Karyn Meltz Steinberg ASHG 2015 @KMS_Meltzy
  2. 2. 0 100000 200000 300000 400000 CHM1_1.1 HuRef ALLPATHS YH_2.0 Contig Number Contig N50 Figure 1 Last year… Steinberg et al, 2014
  3. 3. This year… 0 5000000 10000000 15000000 20000000 25000000 30000000 CHM13 Draft CHM1 PB_2 CHM1 PB_1 CHM1_1.1 HuRef ALLPATHS YH_2.0 Contig Number Contig N50
  4. 4. This year… Log scale 1 10 100 1000 10000 100000 1000000 10000000 100000000 CHM13 Draft CHM1 PB_2 CHM1 PB_1 CHM1_1.1 HuRef ALLPATHS YH_2.0 Contig Number Contig N50
  5. 5. We combine PacBio with other technologies to construct the assembly
  6. 6. How do we define platinum and gold standards? GRCh38 Platinum (CHM1) Gold (NA19240) % Reference genome covered 100 98.40 90.80 % Assigned chromosomes 99.60 98.40 90.80 % gene models covered (>95% id, >90% length) 99.96 98.78 94.26 Contig N50 67.8 Mb 26.9 Mb 6.0 Mb Number of gaps 875 3,640 3,568 Total Assembled size 3.067 Gb 2.996 Gb 2.745 Gb % haplotype blocks (>1kb) resolved NA >95 >80 http://genome.wustl.edu/projects/detail/reference-genomes-improvement/
  7. 7. CHM13 Draft Assembly (GCA_000983455.1) •  60X PacBio (P5 and P6 chemistry) •  Average read length ~11kb •  Daligner/Falcon v 0.2 Total sequence length 2,851,367,788 Number of contigs 2,873 Contig N50 12,981,785 Contig L50 68
  8. 8. Gene Model (RefSeq) Analysis GRCh38 CHM1_ 1.1 CHM1_PB1 CHM1_PB2 CHM13 Number of sequences not aligning 21 88 67 67 125 Split Transcripts 8 35 1,245 1,131 285 CDS coverage <95% 17 266 1,339 1,212 265 Total Sequences Retrieved from Entrez 49,680
  9. 9. Short read sequence analysis •  100X Illumina sequence •  Align with BWA-MEM to ordered and oriented assembly •  Variant calling via SpeedSeq (Chiang et al, 2015) •  SNVs, indels: FreeBayes •  SVs: LUMPY, SVTyper •  CNV: CNVnator
  10. 10. CHM13 Illumina data aligned to CHM13 assembly 202,016 SNVs/indels on unplaced scaffolds SV_TYPES   >10kb   5-10kb   1-5kb   <1kb   DELETIONS   174   131   430   2582   INVERSIONS   5   0   2   7   DUPLICATIONS   151   112   309   113   TOTAL   330   243   741   2702  
  11. 11. BioNano SV calls can be used to identify misassembly Collapse Expansion inAssembly Gap in SequencePacBio Assembly BioNano Map SV_TYPES   DELETIONS   41   INVERSIONS   10   INSERTIONS   15 TOTAL   66   BioNano alignment to CHM13
  12. 12. BioNano reveals collapse in PacBio assembly PacBio Assembly BioNano Map
  13. 13. Illumina data aligned to PacBio assembly also shows collapse
  14. 14. BioNano reveals collapse in PacBio assembly due to highly homologous segmental duplications SD = 96% CHR1   46746040   46857004   40   W   LBHZ01000938.1   110965   CHR1   46857005   47034202   41   N   177198   gap   CHR1   47034203   52157695   42   W   LBHZ01000245.1   5123493   PacBio Assembly BioNano Map
  15. 15. This region is rich in medically relevant genes chr1 (p33) p31.1 1q12 q41 43 44 CYP4Z2P CYP4A11 CYP4X1 CYP4Z1 CYP4A22 SegDups Genes CHM13 PacBio LBHZ010000938.1 LBHZ010000938.1 LBHZ010000245.1
  16. 16. CHM13 Hybrid Scaffold Hybrid Scaffold PacBio Contigs BioNano Contigs
  17. 17. CHM13 Hybrid Scaffolds BioNano Map PacBio Assmbly Hybrid Scaffold # of Contigs 3593 1590 * 254 Min Contig Length 0.08 Mb 0 0.27 Mb Median Contig Length 0.61 Mb 0.06 Mb 4.35 Mb Mean Contig Length 0.78 Mb 1.78 Mb 9.68 Mb Contig N50 1.02 Mb 13.46 Mb 20.79 Mb Max Contig Length 5.27 Mb 63.15 Mb 82.83 Mb Total Contig Length 2.812 Gb 2.824 Gb 2.458 Gb *Number of contigs used in hybrid scaffolding
  18. 18. Combining CHM1 and CHM13 reference mapping CHM1 CHM13 Pipeline analysis Variant Evaluation 97 13
  19. 19. Acknowledgements The McDonnell Genome Institute at Washington University in St. Louis Rick Wilson Bob Fulton Wes Warren Tina Graves-Lindsay Vince Magrini Sean McGrath Derek Albracht Milinn Kremitzki Susan Rock Debbie Scheer Aye Wollam The Finishing and Bioinformatics Teams at The Genome Institute University of Washington Evan Eichler John Huddleston Archana Raja NCBI Valerie Schneider University of Pittsburgh School of Medicine (CHM13 cell line) Urvashi Surti Personalis Deanna Church BioNano Genomics Palak Sheth Pacific Biosciences Jason Chin Nick Sisneros

×