2. The Human Reference is a Work in Progress!
• The current reference – GRCh38 - is not optimal for some
regions of the genome and/or some individuals/ancestries.
• GRCh38 is comprised of DNA from several individual humans.
• Allelic diversity and structural variation present major
challenges when assembling a representative diploid genome.
• New technologies, methods, and resources since 2003 have
allowed for substantial improvements in the reference genome.
• Additional high-quality reference sequences are needed to
represent the full range of genetic diversity in humans
Human Phylogenetic Tree
(Li, et al 2008)
World Map with Sample Origins
The MGI Reference Genomes
Funded by the NIH, the MGI Reference Genomes Improvement Project
aims to increase the quality and diversity of existing scientific resources.
We will sequence and assemble at least 5 diploid genomes from
individuals selected to maximize human genetic diversity (right). All
sources have BAC libraries available and whenever possible, we will use
samples from a trio (two parents and child). We will sequence the parents
within the trio at a lower depth of coverage to enable haplotype phasing of
the proband sequence. Other independent efforts to sequence and
assemble new reference genomes include two Japanese, one Malaysian,
a Han Chinese and an Ashkenazim trio (as part of the Genome in a Bottle
Samples to be Sequenced
5. Definitions of Genome Level
• Platinum Genome
• Haploid genome source
• Contiguous, haplotype-resolved representation of entire genome
• BAC library available
• Gold Genome
• Diploid genome source
• Part of a trio
• Parents will be sequenced to help haplotype resolve some
• BAC libraries available
• Targeted regions sequenced using these BAC libraries
• Will contain some haplotype resolved regions
6. CHM1: A Key Resource for Improving the Reference
• CHM1 cell line established from a haploid hydatidiform
mole (complete, paternal; 46XX) (U.Surti)
• CHORI-17 BAC library (P. deJong)
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1,560 fpc contigs)
• CHORI-17 BACs
• >750 have been sequenced
• 664 of them in Genbank as phase 3 sequence
• CHM1 WGS assembly
• Initial assembly produced from >100X coverage of Illumina data
• Initial PacBio assembly produced using ~54X of P5 PacBio data
• Latest PacBio assembly produced using ~60X of P6 PacBio data
7. CHM1 P5 vs P6 read length distributions
Mapped Concordance (%)
% of Bases in Reads > 30,000 bases
8. CHM1 Assembly Comparisons
P6 chemistry (61X)
P6 chemistry (61X)
# Contigs 26,312 3,641 4,849
44,873,077 bp 109,312,888 bp 99,566,047 bp
3,239,081,299 bp 2,996,426,293 bp 2,939,630,703 bp
N50 4,498,608 bp 26,899,841 bp 20,609,304 bp
N90 30,687 bp 1,686,030 bp 1,188,604 bp
N95 17,815 bp 149,494 bp 95,419 bp
11. Using BioNano to Compare CHM1 Assemblies
Hybrid WGS Conflicts 45 52
Hybrid BN Conflicts 51 63
SV - Deletions 35 25
SV- Insertions 32 31
SV- Inversions 7 12
SV- End 126 190
SV- Translocation_Interchr 332 529
12. Assembly Assessment Methods
• Assemblies will run through NCBI QA pipeline
• Assessed for contiguity, annotation, and concordance with the
• Assembly Assembly alignments will be generated between each PB
assembly and GRCh38
• BioNano Genome Map
• SV calls generated from comparing the BioNano data to each of the
• Hybrid scaffolding conflicts will also point out potential assembly
• Alignment of the Illumina reads back to the each of the
• Heterozygous calls are likely indicative of a collapse in the
assembly (for the haploid genomes)
13. 1q21 Region – GRCh38 vs GCA_001297185
Seg Dup Track
14. 1q21 Region - GRCh38 vs GCA_001297185
Seg Dup Track
15. First Gold Genome - NA19240
Initial Assembly Stats
# Seq Contigs 3569
Max Contig Length 20,393,869bp
Total Assembly Size 2,745,634,789 bp
N50 6,003,115 bp
N90 848,151 bp
N95 345,457 bp
• NA19240 – Yoruban sample
• Generated >70X raw PacBio data
• Assembled on DNAnexus platform using Falcon pipeline
22. Which Assembly is Best?
2.810 2.820 2.830 2.840 2.850
Total Assembly Size (GB)
HG00733 Puerto Rican Assembly Stats
• Use other sources to assess multiple assemblies
• Linked long reads
23. Genome Status
Data Source Origin Level of
CHM1 NA Platinum Assembly Assessment
CHM13 NA Platinum Assembly Assessment
NA19240 Yoruban Gold Analysis Underway
HG00733 Puerto Rican Gold Assembly QC
HG00514 Han Chinese Gold Assembly QC
NA12878 European Gold Data Generation Underway
HG01352 Columbian Gold Not Started Yet
24. Next Steps
• Platinum Genomes
• Select the best CHM1 and CHM13 assembly and then improve those
further using BioNano and other tools
• Incorporate the BACs into the assemblies
• Create Chromosomal AGPs
• Gold Genomes
• Finish analysis of the first Gold Genome
• Data production is now complete on two other Gold genomes and
assemblies for those are underway
• Data production is underway on the 4th Gold genome
• BACs are being sequenced for many of these genomes
The McDonnell Genome Institute at
Washington University in St. Louis
Karyn Meltz Steinberg
University of Washington
University of Pittsburgh
School of Medicine
(CHM1 and CHM13 cell line)
As part of our work as a member of the Genome Reference Consortium, we have been working to improve the current reference, GRCh38. In doing this work, we have found that there are still a few regions of the genome not fully resolved. There are still a few genes that are not optimally represented for all individuals or ancestries, although we have fixed quite a few of them in GRCh38. The reference is comprised of many individuals, so there are regions where allelic diversity and structural variation present challenges in the assembly. Many of the newer technologies have allowed for improvements to the reference genome. But we realize that there still is a need for additional high quality human genome assemblies to fully cover the range of genetic diversity in humans.
At MGI we are working on a project funded by NIH to sequence additional human reference genomes. These are the samples we plan to sequence. There are currently 6 gold genomes planned and 2 platinum genomes. We will sequence a Puerto Rican sample, A Han Chinese sample, a Columbian sample and two African samples. We also plan to improve on the European NA12878 sample as well. These genomes will help to add diversity to the reference. I will spend the first portion of my talk telling you about the one of the platinum genomes that we have been working with, CHM1 and then finish with the some details about the first of the Gold genomes we are sequencing.
Vince mentioned this, but I just briefly wanted to touch on this again, we are generating ~60X coverage of PacBio long read data. We will do a de novo assembly of that PacBio data. Then we are using a variety of additional tools to help inform the assembly. BioNano has been very useful in helping to scaffold the assembly as well as to identify potential mis-assemblies. We are also starting to work with 10X genomics data as well as Dovetail data for the same reasons. We are also targeting difficult to assemble regions of the genome by sequencing BACs. Once the BACs are incorporated, we plan to align all of this data to the Reference very stringently to produce chromosomal AGPs. The end product will be a very high quality whole genome assembly.
Here are the definitions we are using for both Platinum and Gold level assemblies. The Platinum genomes are single haplotype sources. We plan to achieve a contiguous, haplotype-resolved representation of the entire genome for these samples. Both of the genomes we have worked on so far for this level have BAC libraries, which as I mentioned before, will be used to help resolve regions of the genome that would be difficult to assemble on the whole genome level. The Gold genomes, will be diploid sources, all will be part of a trio, We are sequencing the child to deeper coverage and doing a lighter amount of sequence on the Parents, mainly to help sort out haplotypes in specific regions. We also have BAC libraries for all of these genomes as well.
To resolve some of these issues that I mentioned that existed in the reference, especially the structurally variant regions that were most difficult to put together, a hydatidiform mole cell line was established. A hydatidiform mole is formed when an enucleated egg is fertilized by sperm. The cells go through several rounds of cell division and the resulting DNA is a diploid copy of the exact same genetic material. This first sample is known as CHM1. A BAC library was created from the CHM1 source and has been used extensively in the reference to fix some of these difficult to assemble regions of the genome. By using a haploid source it is much easier to put together regions where there are segmental duplications. Once we realized the utility of this source, we decided to sequence the entire genome. This was first done years ago and at the time, the only cost effective way to sequence an entire human genome was by generating Illumina data. A reference guided assembly was produced with this Illumina data. IT wasn’t long after that that PacBio agreed to collaborate and the Initial PacBio data was produced. That was ~54X coverage using the P5 chemistry. Then, early in 2015, PacBio believed that they could do better with the most current sequencing chemistry, and library protocols, so they generated the data again. That second set of data did prove to be much better than the first.
Here is a comparison of the read length of the P5 data compared to the more recent P6 data. You can see that in the most recent data, there was over 17% of the reads that were 30Kb or longer, In the original set of data, less than 1% of the reads were that length.
Besides the update in chemistries, there have also been improvements in the algrythms used to assemble these genomes. Here is a comparison of the P5 and P6 data assembled both by Jason Chin at PacBio and then the most recent data was also assembled by Adam Phillippe’s group. In both instances you will see how the N50 has improved greatly.
Because we now have multiple assemblies of the same data, we plan to use the BioNano data as a way to compare the different versions of the assemblies. Here are the stats of both Jason and Adam’s assemblies when run through the hybrid scaffolding pipeline. This shows the great continuity that can be achieved through hybrid scaffolding. For Jason’s assembly, the contig N50 is 26Mb, and then together with the BN map to create hybrid scaffolds, we can achieve a scaffold N50 close to 50Mb. Adam’s assembly is very similar, it starts with a contig n50 of 20 Mb and then we get a scaffold N50 of 40Mb. This is fairly typical of what we have seen at MGI. The PacBio contig N50 nearly doubles when scaffolded with the BN data.
Here is an example of one of the BioNano hybrid scaffolds that was generated The top line in green represents the Hybrid scaffold, the first set of Blue bars represents the PacBio assembly and the S2 lines represent the BioNano map contigs. So from this you can see how these two technologies are very complementary to one another. This is a snapshot of a 10 Mb portion of a larger hybrid scaffold
We are usingthe Hybrid scaffolding output as well as the SV calls from our BioNano comparisons to help us to evaluate each version of the latest CHM1 assemblies. Just by looking at the raw numbers, it looks like Jason’s version might be better, but we still need to look through the data more. We’ve looked through a portion of the translocation calls and some of these are indicating joins that could potentially be made between two PacBio contigs. We do need to look at more of the calls though to come to a conclusion on which of these two assemblies is best. We also have a few other metrics we will use to make the final decision on which assembly to move forward with.
Here are some of those other methods we will use to compare these assemblies. With the help of NCBI, these assemblies will be run through the NCBI QA pipeline. They will be assessed for contiguity, annotation, and concordance with any finished BAC sequences. Assembly assembly alignments will also be performed between each of the PB assemblies and GRCh38.
As I mentioned, the BioNano Genome maps are being used to asses the assemblies.
We have also generated Illumina data for all of our assemblies. For the haploid samples in particular, any heterozygous calls resulting from the Illumina alignments are likely indicative of a collapse in the assembly. This data will also be used to assess the potential mis-assemblies once identified by these other methods.
Here is a view of 1q21 in GRCh38, the SRGAP2 gene family is located in this region. This is a highly conserved gene family that is located on three regions on chromosome 1. This view is over 6 Mb of that region. Because of the degree of similarity between the duplications in this region and the other two locations of SRGAP2, GRCh37 was very mis-assembled. In order to fix theis region in the reference, we re-sequenced the entire region using the CHORI-17 BACs, the single haplotype source. So this region of GRCh38 is made up of clones from CHM1. In this view, you can see how Jason’s version of the PacBio CHM1 assembly aligns to the reference. You will notice in places where there are quite a few segmental duplications, the assembly is much more fragmented.
This is a zoomed in view from the previous slide. You will notice, the larger contig in the middle, aligns nearly perfectly, to the CH17 BAC path, where as the contigs in the segmentally duplication regions do not align as well. In this large contig, the percent identity is over 99.9% identical, which is what you would expect since this is the same source as the reference, where as in this contig, where there are known segmental duplications, the identity is not as high. Any of the red marks in the grey bars represent mismatches
Switch gears and talk about our first gold genome sample, NA19240, a Yoruban sample. The initial assembly was done on the DNAnexus platform for both speed and ease of assembly. We had previously assembled smaller PacBio genomes, but nothing of this size and with this amount of data. All other human assemblies we have worked with have been assembled by the experts. We were not sure we could get an assembly of this size to finish in a timely manner. On the DNAnexus platform, the assembly and Quiver steps finished in less than 2 weeks. We have since been able to assemble this data on our own cluster, but there were quite a few modifications that needed to be made to make everything work correctly. Here are the Initial assembly stats for the this sample. The n50 contig length is 6 Mb, it is not as contiguous as the haploid samples, but we think this is still pretty good considering this is a diploid sample.
As I mentioned earlier, we have been using the BioNano data as a way to assess our assembly, by doing both the hybrid scaffolding as well as calling SVs Here are those results for this sample. In this case, the hybrid scaffolding increased the scaffold length N50 to almost 15Mb. As part of our QC process for this assembly, we evaluated all of the Conflicts found during the hybrid scaffolding process, as well as some of the SV calls. After evaluating all of this data, we narrowed the list down to the calls that seemed most likely to indicate an assembly issue. We then used the alignment of the Illumina data to help pinpoint where the contigs needed to be broken. From all of this, we were able to successfully make sequence breaks in a little over 40 regions.
Here is an example of one of the regions that was corrected as a result of this QC process. The Sequence contig is on top, The region in the brackets was identified as a conflict during hybrid scaffolding– From this, we were able to identify the mis-assembly in the PacBio contig, it was broken and the region was flipped. When comparing the corrected assembly back to the BN map, the maps align more consistently now.
At the time we were making the initial breaks in the assembly, we didn’t have the alignments of the NA19240 assembly to the reference but we do have them now. This is that same region of the PacBio assembly, the original version, aligned to GRCh38. The top panel represents the reference and the bottom panel is the original PacBio contig. All three blue blocks represent portions of the same PacBio contig, the arrows that indicate the direction of the alignment. The reference alignment confirms what we had found with the BN data, that the middle portion of this PacBio contig needed to be flipped.
Here is a diagram of the alignment of the NA19240 assembly compared to the reference through the CCL region that Deanna mentioned in her talk. We know this region to be structurally variant. Here you can see that our PacBio assembly is very contiguous in the areas where there are very little duplications, but in the segmentally duplicated regions it is fragmented.
This slide is a zoomed in view of one of those segmentally duplicated regions. As I mentioned earlier, we are sequencing targeted BAC clones in some of these know structually variant regions. In this slide, you can see how those initial clone assemblies align. For the targeted BACs, we are initially sequencing all of the clones with Illumina data and then from that initial data, selecting a clone path and we will improve those clones. In this view, we heave 2 clones aligned. This line represents a contig from one of those clones and how it aligns through this region, this is the same contig and it looks to align very similarly to all three of these regions. In this case, we will need to finish the clone to understand the correct alignment through here. WE have just begun to pick the path of BACs that will be needed to resolve these regions.
This is another region that we have targeted with BACs. The NBPF genes are located in many places along chromosome 1. In most regions, there are segmental duplications through those regions as well. Here is an alignment of one of those genes. The gene alignment is seen here in green and the Pacbio assembly alignment is in gray. The bottom portion in darker gray indicates segmental duplication. The pink areas of these gray bars indicate mismatch, so you can see through this region, there are quite a few mismatchs in our assembly. This gray bar is showing the alignment of one clone. This BAC is nearly contiguous with just the Illumina data. So in this case, it is easy to identify a clone to resolve this region of the assembly.
For our next genome, HG00733, the Puerto Rican sample, we have completed a variety of assemblies, all assembled with different parameters. Here are just a few of the assemblies plotted by Contig length N50 and total assembly size. From these parameters as well as a few others, we need to decide on which assembly is the best one. Is it better to have a more contiguous assembly? Is it better to have as much of the total genome assembled?
I want to acknowledge all of the collaborators on this project and all of the work that has gone into it thus far.