Building a platinum human genome
assembly from single haplotype
human genomes generated from
long molecule sequencing
Karyn Meltz Steinberg
ASHG 2015
@KMS_Meltzy
0
100000
200000
300000
400000
CHM1_1.1 HuRef ALLPATHS YH_2.0
Contig Number
Contig N50
Figure 1
Last year…
Steinberg et al, 2014
This year…
0
5000000
10000000
15000000
20000000
25000000
30000000
CHM13
Draft
CHM1
PB_2
CHM1
PB_1
CHM1_1.1 HuRef ALLPATHS YH_2.0
Contig Number
Contig N50
This year…
Log
scale
1
10
100
1000
10000
100000
1000000
10000000
100000000
CHM13
Draft
CHM1
PB_2
CHM1
PB_1
CHM1_1.1 HuRef ALLPATHS YH_2.0
Contig Number
Contig N50
We combine PacBio with other technologies to construct
the assembly
How do we define platinum and gold standards?
GRCh38
Platinum
(CHM1)
Gold
(NA19240)
% Reference genome
covered
100 98.40 90.80
% Assigned chromosomes 99.60 98.40 90.80
% gene models covered
(>95% id, >90% length)
99.96 98.78 94.26
Contig N50 67.8 Mb 26.9 Mb 6.0 Mb
Number of gaps 875 3,640 3,568
Total Assembled size 3.067 Gb 2.996 Gb 2.745 Gb
% haplotype blocks
(>1kb) resolved
NA >95 >80
http://genome.wustl.edu/projects/detail/reference-genomes-improvement/
CHM13 Draft Assembly (GCA_000983455.1)
•  60X PacBio (P5 and P6 chemistry)
•  Average read length ~11kb
•  Daligner/Falcon v 0.2
Total sequence length 2,851,367,788
Number of contigs 2,873
Contig N50 12,981,785
Contig L50 68
Gene Model (RefSeq) Analysis
GRCh38
CHM1_
1.1
CHM1_PB1 CHM1_PB2 CHM13
Number of
sequences
not aligning
21 88 67 67 125
Split
Transcripts 8 35 1,245 1,131 285
CDS coverage
<95% 17 266 1,339 1,212 265
Total Sequences Retrieved from Entrez 49,680
Short read sequence analysis
•  100X Illumina sequence
•  Align with BWA-MEM to ordered and
oriented assembly
•  Variant calling via SpeedSeq (Chiang et al,
2015)
•  SNVs, indels: FreeBayes
•  SVs: LUMPY, SVTyper
•  CNV: CNVnator
CHM13 Illumina data aligned to CHM13 assembly
202,016 SNVs/indels on unplaced scaffolds
SV_TYPES	
   >10kb	
   5-10kb	
   1-5kb	
   <1kb	
  
DELETIONS	
   174	
   131	
   430	
   2582	
  
INVERSIONS	
   5	
   0	
   2	
   7	
  
DUPLICATIONS	
   151	
   112	
   309	
   113	
  
TOTAL	
   330	
   243	
   741	
   2702	
  
BioNano SV calls can be used to identify misassembly
Collapse
Expansion
inAssembly
Gap in SequencePacBio Assembly
BioNano Map
SV_TYPES	
  
DELETIONS	
   41	
  
INVERSIONS	
   10	
  
INSERTIONS	
   15
TOTAL	
   66	
  
BioNano alignment to CHM13
BioNano reveals collapse in PacBio assembly
PacBio Assembly
BioNano Map
Illumina data aligned to PacBio assembly also shows
collapse
BioNano reveals collapse in PacBio assembly due to
highly homologous segmental duplications
SD = 96%
CHR1	
   46746040	
   46857004	
   40	
   W	
   LBHZ01000938.1	
   110965	
  
CHR1	
   46857005	
   47034202	
   41	
   N	
   177198	
   gap	
  
CHR1	
   47034203	
   52157695	
   42	
   W	
   LBHZ01000245.1	
   5123493	
  
PacBio Assembly
BioNano Map
This region is rich in medically relevant genes
chr1 (p33) p31.1 1q12 q41 43 44
CYP4Z2P
CYP4A11
CYP4X1
CYP4Z1
CYP4A22
SegDups
Genes
CHM13
PacBio
LBHZ010000938.1 LBHZ010000938.1
LBHZ010000245.1
CHM13 Hybrid Scaffold
Hybrid Scaffold
PacBio Contigs
BioNano Contigs
CHM13 Hybrid Scaffolds
BioNano Map PacBio Assmbly Hybrid Scaffold
# of Contigs 3593 1590 * 254
Min Contig Length 0.08 Mb 0 0.27 Mb
Median Contig
Length
0.61 Mb 0.06 Mb 4.35 Mb
Mean Contig Length 0.78 Mb 1.78 Mb 9.68 Mb
Contig N50 1.02 Mb 13.46 Mb 20.79 Mb
Max Contig Length 5.27 Mb 63.15 Mb 82.83 Mb
Total Contig Length 2.812 Gb 2.824 Gb 2.458 Gb
*Number of contigs used in hybrid scaffolding
Combining CHM1 and CHM13
reference
mapping
CHM1 CHM13
Pipeline analysis
Variant Evaluation
97 13
Acknowledgements
The McDonnell Genome Institute at
Washington University in St. Louis
Rick Wilson
Bob Fulton
Wes Warren
Tina Graves-Lindsay
Vince Magrini
Sean McGrath
Derek Albracht
Milinn Kremitzki
Susan Rock
Debbie Scheer
Aye Wollam
The Finishing and Bioinformatics
Teams at The Genome Institute
University of Washington
Evan Eichler
John Huddleston
Archana Raja
NCBI
Valerie Schneider
University of Pittsburgh
School of Medicine (CHM13 cell line)
Urvashi Surti
Personalis
Deanna Church
BioNano Genomics
Palak Sheth
Pacific Biosciences
Jason Chin
Nick Sisneros
Building a platinum human genome assembly from single haplotype human genomes generated from long molecule sequencing

Building a platinum human genome assembly from single haplotype human genomes generated from long molecule sequencing

  • 1.
    Building a platinumhuman genome assembly from single haplotype human genomes generated from long molecule sequencing Karyn Meltz Steinberg ASHG 2015 @KMS_Meltzy
  • 2.
    0 100000 200000 300000 400000 CHM1_1.1 HuRef ALLPATHSYH_2.0 Contig Number Contig N50 Figure 1 Last year… Steinberg et al, 2014
  • 3.
  • 4.
  • 5.
    We combine PacBiowith other technologies to construct the assembly
  • 6.
    How do wedefine platinum and gold standards? GRCh38 Platinum (CHM1) Gold (NA19240) % Reference genome covered 100 98.40 90.80 % Assigned chromosomes 99.60 98.40 90.80 % gene models covered (>95% id, >90% length) 99.96 98.78 94.26 Contig N50 67.8 Mb 26.9 Mb 6.0 Mb Number of gaps 875 3,640 3,568 Total Assembled size 3.067 Gb 2.996 Gb 2.745 Gb % haplotype blocks (>1kb) resolved NA >95 >80 http://genome.wustl.edu/projects/detail/reference-genomes-improvement/
  • 7.
    CHM13 Draft Assembly(GCA_000983455.1) •  60X PacBio (P5 and P6 chemistry) •  Average read length ~11kb •  Daligner/Falcon v 0.2 Total sequence length 2,851,367,788 Number of contigs 2,873 Contig N50 12,981,785 Contig L50 68
  • 8.
    Gene Model (RefSeq)Analysis GRCh38 CHM1_ 1.1 CHM1_PB1 CHM1_PB2 CHM13 Number of sequences not aligning 21 88 67 67 125 Split Transcripts 8 35 1,245 1,131 285 CDS coverage <95% 17 266 1,339 1,212 265 Total Sequences Retrieved from Entrez 49,680
  • 9.
    Short read sequenceanalysis •  100X Illumina sequence •  Align with BWA-MEM to ordered and oriented assembly •  Variant calling via SpeedSeq (Chiang et al, 2015) •  SNVs, indels: FreeBayes •  SVs: LUMPY, SVTyper •  CNV: CNVnator
  • 10.
    CHM13 Illumina dataaligned to CHM13 assembly 202,016 SNVs/indels on unplaced scaffolds SV_TYPES   >10kb   5-10kb   1-5kb   <1kb   DELETIONS   174   131   430   2582   INVERSIONS   5   0   2   7   DUPLICATIONS   151   112   309   113   TOTAL   330   243   741   2702  
  • 11.
    BioNano SV callscan be used to identify misassembly Collapse Expansion inAssembly Gap in SequencePacBio Assembly BioNano Map SV_TYPES   DELETIONS   41   INVERSIONS   10   INSERTIONS   15 TOTAL   66   BioNano alignment to CHM13
  • 12.
    BioNano reveals collapsein PacBio assembly PacBio Assembly BioNano Map
  • 13.
    Illumina data alignedto PacBio assembly also shows collapse
  • 14.
    BioNano reveals collapsein PacBio assembly due to highly homologous segmental duplications SD = 96% CHR1   46746040   46857004   40   W   LBHZ01000938.1   110965   CHR1   46857005   47034202   41   N   177198   gap   CHR1   47034203   52157695   42   W   LBHZ01000245.1   5123493   PacBio Assembly BioNano Map
  • 15.
    This region isrich in medically relevant genes chr1 (p33) p31.1 1q12 q41 43 44 CYP4Z2P CYP4A11 CYP4X1 CYP4Z1 CYP4A22 SegDups Genes CHM13 PacBio LBHZ010000938.1 LBHZ010000938.1 LBHZ010000245.1
  • 16.
    CHM13 Hybrid Scaffold HybridScaffold PacBio Contigs BioNano Contigs
  • 17.
    CHM13 Hybrid Scaffolds BioNanoMap PacBio Assmbly Hybrid Scaffold # of Contigs 3593 1590 * 254 Min Contig Length 0.08 Mb 0 0.27 Mb Median Contig Length 0.61 Mb 0.06 Mb 4.35 Mb Mean Contig Length 0.78 Mb 1.78 Mb 9.68 Mb Contig N50 1.02 Mb 13.46 Mb 20.79 Mb Max Contig Length 5.27 Mb 63.15 Mb 82.83 Mb Total Contig Length 2.812 Gb 2.824 Gb 2.458 Gb *Number of contigs used in hybrid scaffolding
  • 18.
    Combining CHM1 andCHM13 reference mapping CHM1 CHM13 Pipeline analysis Variant Evaluation 97 13
  • 19.
    Acknowledgements The McDonnell GenomeInstitute at Washington University in St. Louis Rick Wilson Bob Fulton Wes Warren Tina Graves-Lindsay Vince Magrini Sean McGrath Derek Albracht Milinn Kremitzki Susan Rock Debbie Scheer Aye Wollam The Finishing and Bioinformatics Teams at The Genome Institute University of Washington Evan Eichler John Huddleston Archana Raja NCBI Valerie Schneider University of Pittsburgh School of Medicine (CHM13 cell line) Urvashi Surti Personalis Deanna Church BioNano Genomics Palak Sheth Pacific Biosciences Jason Chin Nick Sisneros