2. The Human Reference is a Work in Progress!
• The current reference – GRCh38 - is not optimal for some
regions of the genome and/or some individuals/ancestries.
• GRCh38 is comprised of DNA from several individual humans.
• Allelic diversity and structural variation present major
challenges when assembling a representative diploid genome.
• New technologies, methods, and resources since 2003 have
allowed for substantial improvements in the reference genome.
• Additional high-quality reference sequences are needed to
represent the full range of genetic diversity in humans
6. Definitions of Genome Level
• Platinum Genome
• Haploid genome source
• Contiguous, haplotype-resolved representation of entire genome
• BAC library available
• Gold Genome
• Diploid genome source
• Part of a trio
• Parents will be sequenced to help haplotype resolve some
• BAC libraries available
• Targeted regions sequenced using these BAC libraries
• Will contain some haplotype resolved regions
7. CHM1: A Key Resource for Improving the Reference
• CHM1 cell line established from a haploid hydatidiform
mole (complete, paternal; 46XX) (U.Surti)
• CHORI-17 BAC library (P. deJong)
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1,560 fpc contigs)
• CHORI-17 BACs
• >750 have been sequenced
• 664 of them in Genbank as phase 3 sequence
• CHM1 WGS assembly
• Initial assembly produced from >100X coverage of Illumina data
• Initial PacBio assembly produced using ~54X of P5 PacBio data
• Latest PacBio assembly produced using ~60X of P6 PacBio data
8. Assembly Assessment Methods
• Assemblies will run through NCBI QA pipeline
• Assessed for contiguity, annotation, and concordance with the
• Assembly Assembly alignments will be generated between each PB
assembly and GRCh38
• BioNano Genome Map
• SV calls generated from comparing the BioNano data to each of the
• Hybrid scaffolding conflicts will also point out potential assembly
• Alignment of the Illumina reads back to the each of the
• Heterozygous calls are likely indicative of a collapse in the
assembly (for the haploid genomes)
12. 1q21 Region – GRCh38 vs GCA_001297185
Seg Dup Track
13. 1q21 Region - GRCh38 vs GCA_001297185
Seg Dup Track
14. CHM1 – Next Steps
• Move forward with improving GCA_001297185
• Based on alignment of BioNano data as well as
comparisons to GRCh38, make additional breaks where
• Incorporate all finished BACs
• Final alignment to GRCh38 in order to produce
chromosome AGPs and submit
15. First Gold Genome - NA19240
Initial Assembly Stats
# Seq Contigs 3569
Max Contig Length 20,393,869 bp
Total Assembly Size 2,745,634,789 bp
N50 6,003,115 bp
N90 848,151 bp
N95 345,457 bp
• NA19240 – Yoruban sample
• Generated >70X raw PacBio data
18. Which Assembly is Best?
2.810 2.820 2.830 2.840 2.850
Total Assembly Size (GB)
HG00733 Puerto Rican Assembly Stats
• Use other sources to assess multiple assemblies
• Long linked reads
19. Genome Status
Data Source Origin Level of
CHM1 NA Platinum Assembly Improvement
CHM13 NA Platinum Assembly Assessment
NA19240 Yoruban Gold Paper in Review
HG00733 Puerto Rican Gold Assembly Assessment
HG00514 Han Chinese Gold Assembly Assessment
NA12878 European Gold Assembly Assessment
HG01352 Columbian Gold Assembly Assessment
HG02818 Gambian Gold Data Generation Completed
Gold Data Generation Completed
NA19434 Luhya Gold Data Generation
The McDonnell Genome Institute at
Washington University in St. Louis
Karyn Meltz Steinberg
University of Washington
University of Pittsburgh
School of Medicine
(CHM1 and CHM13 cell line)
I want to thank the organizers for asking me to speak her today to tell you about creating reference grade human genome assemblies
As part of our work as a member of the Genome Reference Consortium, we have been working to improve the current reference, GRCh38. In doing this work, we have found that there are still a few regions of the genome not fully resolved. There are still a few genes that are not optimally represented for all individuals or ancestries, although we have fixed quite a few of them in GRCh38. The reference is comprised of many individuals, so there are regions where allelic diversity and structural variation present challenges in the assembly. Many of the newer technologies have allowed for improvements to the reference genome. But we realize that there still is a need for additional high quality human genome assemblies to fully cover the range of genetic diversity in humans.
This is a great example of how allelic differences can cause assembly problems. The gene UGT2B17 is known to be copy number variant. Some individuals have 1 copy of this gene and other individuals lack this gene altogether. In Build 36, clones in this regions were all from the RP11 sample, which happened to be heterozygous for this indel.The blue, red, and black colored boxes represent the clone path through the region - the yellow blocks indicates annotated segmental duplication, and there were two genes annotated in this region. In Build 36 we were representing both the insertion and deletion alleles in the assembly.
By removing the black clones from the path, we were able to close the gap, then we created an alternate allele from the black clones, which required sequencing one additional clone. To end up with both the insertion and deletion alleles in GRCh37.
This changes our understand of the biology of this region, we closed the gap that existed, we removed falsely annotated duplication and there is really only one copy of this gene present in the assembly with an allelic variant.
This example shows how multiple haplotypes in the assembly can cause problems
As part of the GRC we have been focused on fixing the current reference as well as adding additional alleles where we can. We have more recently been working on a project funded by NIH to sequence additional human reference genomes. These are the samples we have been working on. Originally we planned to sequence 5 gold genomes and 2 platinum genomes. Currently we believe we should be able to complete at least 7 gold genomes. These genomes will help to add diversity to the reference. I will spend the first portion of my talk telling you about the one of the platinum genomes that we have been working with, CHM1 and then finish with the some details about the Gold genomes we are sequencing.
As part of this project, we are generating ~60X coverage of PacBio long read data. We will do a de novo assembly of that PacBio data. Then we are using a variety of additional tools to help inform the assembly. BioNano has been very useful in helping to scaffold the assembly as well as to identify potential mis-assemblies. We are also starting to work with 10X genomics data as well as Dovetail data for the same reasons. We are also targeting difficult to assemble regions of the genome by sequencing BACs. Once the BACs are incorporated, we plan to align all of this data to the Reference very stringently to produce chromosomal AGPs. The end product will be a very high quality whole genome assembly.
Here are the definitions we are using for both Platinum and Gold level assemblies. The Platinum genomes are single haplotype sources. We plan to achieve a contiguous, haplotype-resolved representation of the entire genome for these samples. Both of the genomes we have worked on so far for this level have BAC libraries, which as I mentioned before, will be used to help resolve regions of the genome that would be difficult to assemble on the whole genome level. The Gold genomes, will be diploid sources, all will be part of a trio, We are sequencing the child to deeper coverage and doing a lighter amount of sequence on the Parents, mainly to help sort out haplotypes in specific regions. We also have BAC libraries for all of these genomes as well.
To resolve some of these issues that I mentioned that existed in the reference, especially the structurally variant regions that were most difficult to put together, a hydatidiform mole cell line was established. A hydatidiform mole is formed when an enucleated egg is fertilized by sperm. The cells go through several rounds of cell division and the resulting DNA is a diploid copy of the exact same genetic material. This first sample is known as CHM1. A BAC library was created from the CHM1 source and has been used extensively in the reference to fix some of these difficult to assemble regions of the genome. By using a haploid source it is much easier to put together regions where there are segmental duplications. Once we realized the utility of this source, we decided to sequence the entire genome. This was first done years ago and at the time, the only cost effective way to sequence an entire human genome was by generating Illumina data. A reference guided assembly was produced with this Illumina data. IT wasn’t long after that that PacBio agreed to collaborate and the Initial PacBio data was produced. That was ~54X coverage using the P5 chemistry. Then, early in 2015, PacBio believed that they could do better with the most current sequencing chemistry, and library protocols, so they generated the data again. That second set of data did prove to be much better than the first.
We have been working with many different data sets and sometimes many assemblies of the same data set, so we needed methods to assess the various assemblies. Here are the tools that we have been using to assess any given assembly. With the help of NCBI, the assemblies will be run through the NCBI QA pipeline. They will be assessed for contiguity, annotation, and concordance with any finished BAC sequences. Assembly assembly alignments will also be performed between the PB assemblies and GRCh38.
As I mentioned, the BioNano Genome maps are being used to asses the assemblies.
We have also generated Illumina data for all of our assemblies. For the haploid samples in particular, any heterozygous calls resulting from the Illumina alignments are likely indicative of a collapse in the assembly. This data will also be used to assess the potential mis-assemblies once identified by these other methods.
For those not familiar with BioNano, this is a diagram that shows the work flow. It is a nanochannel technology where long DNA molecules are nicked and labeled at specific recognition sites, you end up with nick sites along the DNA molecules, similar to a restriction digest, only you have the added benefit of the nicks being in context to one another. Once the data is assembled, the resulting Genome maps can be used for SV detection, gap sizing, assembly QC, and scaffolding
As I mentioned, once the CHM1 data was public, Adam Phillippy’s group assembled it as well. We ran both of these assemblies through the various assessment tools. This slide shows how the BioNano data scaffolded each of the assemblies. This shows the great continuity that can be achieved through hybrid scaffolding. For Jason’s assembly, the contig N50 is 26Mb, and then together with the BN map we can achieve a scaffold N50 close to 50Mb. Adam’s assembly is very similar, it starts with a contig n50 of 20 Mb and then we get a scaffold N50 of 40Mb. This is fairly typical of what we have seen at MGI. The PacBio contig N50 nearly doubles when scaffolded with the BN data.
Here is an example of one of the BioNano hybrid scaffolds that was generated The top line in green represents the Hybrid scaffold, the first set of Blue bars represents the PacBio assembly and the S2 lines represent the BioNano map contigs. So from this you can see how these two technologies are very complementary to one another. This is a snapshot of a 10 Mb portion of a larger hybrid scaffold
Here is a region of the reference where sequencing with the CHM1 BAC library was necessary to resolve the sequence completely. This region of 1q21 contains SRGAP2 gene family. SRGAP2 is a highly conserved gene family that is located on three regions along chromosome 1. The view shown here contains over 6 Mb of one of those regions. Because of the degree of similarity between the duplications in this region and the other two locations of SRGAP2, GRCh37 was very mis-assembled. In order to fix this region in the reference, we re-sequenced the entire region using the CHORI-17 BACs. In this view, along the very top, you see the BACs that make up the reference, then the next two tracks, you can see how Jason’s version of the PacBio CHM1 assembly aligns to the reference. The last track here is the segmental duplication track. You will notice in places where there are quite a few segmental duplications, the assembly is much more fragmented. In the next slide, I will zoom into the boxed region,
On this view, the gray bars, indicate regions that align very well, the red marks are mismatches. You will notice, the larger contig in the middle, aligns nearly perfectly, to the CH17 BAC path, where as the contigs in the segmentally duplication regions do not align as well. In this large contig, the percent identity is over 99.9% identical, which is what you would expect since this is the same source as the reference, where as in this contig, where there are known segmental duplications, the identity is not as high.
For CHM1 – next steps, we plan to move forward to improve Jason’s version of the assembly.
I want to switch gears and talk about our first gold genome sample, NA19240, a Yoruban sample. Here are the Initial assembly stats for the this sample. This is the first human whole genome sample we sequenced. The n50 contig length is 6 Mb, it is not as contiguous as the haploid sample, but we think this is still pretty good considering this is a diploid sample.
We used the BioNano data as a way to assess the NA19240 assembly, by doing both the hybrid scaffolding as well as calling SVs Here are those results for this assembly. In this case, the hybrid scaffolding increased the scaffold length N50 to almost 15Mb. As part of our QC process for this assembly, we evaluated all of the Conflicts found during the hybrid scaffolding process, as well as some of the SV calls. After evaluating all of this data, we narrowed the list down to the calls that seemed most likely to indicate an assembly issue. We then used the alignment of the Illumina data to help pinpoint where the contigs needed to be broken. From all of this, we were able to successfully make sequence breaks in a little over 40 regions. After the breaks were made the contigs were aligned very stringently to GRCh38, from this,along with some manual curation, we were able to create chromosome agps for this assembly. This has been submitted Genbank and can be found using the accession listed here.
As I mentioned earlier, there are regions of the assembly that we targeted with BAC sequencing because we knew that they would not assemble well in the context of the whole genome assembly. The NBPF gene family is one of those regions. This green bar shows the NBPF11 gene, the next track here shows how the whole genome PacBio assembly aligns. This bottom track shows the segmental duplications. You will see that the PacBio assembly does pretty good until it gets to the region of more segmental duplication and this is where all of the red bars are. Because of the similarity of the NBPF genes, it is likely that these contigs are collapsed and actually contain reads from multiple We have BACs sequenced through this region, in this track, you can see how well they align and will resolve the region in the whole genome assembly.
For our next genome, HG00733, the Puerto Rican sample, we have completed a variety of assemblies, all assembled with different parameters. Here are just a few of the assemblies plotted by Contig length N50 and total assembly size. From these parameters as well as a few others, we need to decide on which assembly is the best one. Is it better to have a more contiguous assembly? Is it better to have as much of the total genome assembled?
Here is where all of the genomes are currently at. We have a lot of data and a lot of assemblies to work with, so right now our biggest task is getting these in shape to submit them.
I want to acknowledge all of the collaborators on this project and all of the work that has gone into it thus far.