Creating Reference-Grade Human Genome Assemblies

Genome Reference Consortium
Sep. 30, 2016
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
1 of 20

More Related Content

What's hot

Ashg grc workshop2014_tgAshg grc workshop2014_tg
Ashg grc workshop2014_tgGenome Reference Consortium
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderGenome Reference Consortium
Grc workshop agbt2015_tgGrc workshop agbt2015_tg
Grc workshop agbt2015_tgGenome Reference Consortium
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderGenome Reference Consortium
Ashg grc workshop2015_tgAshg grc workshop2015_tg
Ashg grc workshop2015_tgGenome Reference Consortium
Ashg2015 schneider finalAshg2015 schneider final
Ashg2015 schneider finalGenome Reference Consortium

Viewers also liked

Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assemblyGenome Reference Consortium
The Transforming Genetic Medicine Initiative (TGMI)The Transforming Genetic Medicine Initiative (TGMI)
The Transforming Genetic Medicine Initiative (TGMI)Genome Reference Consortium
Variation reference graphs and the variation graph toolkit vgVariation reference graphs and the variation graph toolkit vg
Variation reference graphs and the variation graph toolkit vgGenome Reference Consortium
Everyday de novo assemblyEveryday de novo assembly
Everyday de novo assemblyGenome Reference Consortium
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGenome Reference Consortium
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsGenome Reference Consortium

Similar to Creating Reference-Grade Human Genome Assemblies

Generating high-quality human reference genomes using PromethION nanopore seq...Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Miten Jain
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907GenomeInABottle
Review of Liao et al - A draft human pangenome reference - Nature (2023)Review of Liao et al - A draft human pangenome reference - Nature (2023)
Review of Liao et al - A draft human pangenome reference - Nature (2023)Stuart MacGowan
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pubsesejun
New data from giab genomes   promethionNew data from giab genomes   promethion
New data from giab genomes promethionGenomeInABottle
Using BioNano Maps to Improve an Insect Genome Assembly​Using BioNano Maps to Improve an Insect Genome Assembly​
Using BioNano Maps to Improve an Insect Genome Assembly​Jennifer Shelton

Similar to Creating Reference-Grade Human Genome Assemblies(20)

More from Genome Reference Consortium

What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?Genome Reference Consortium
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Genome Reference Consortium
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesGenome Reference Consortium
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome Reference Consortium
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectGenome Reference Consortium
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amGenome Reference Consortium

Recently uploaded

Emerging trends of Nanotechnology and also other several technique in Pharmac...Emerging trends of Nanotechnology and also other several technique in Pharmac...
Emerging trends of Nanotechnology and also other several technique in Pharmac...Anilmeher6
Catheter and Guidewire.pptxCatheter and Guidewire.pptx
Catheter and Guidewire.pptxDr. Dheeraj Kumar
OVERVIEW OF ANTIMICROBIAL STEWARDSHIPOVERVIEW OF ANTIMICROBIAL STEWARDSHIP
OVERVIEW OF ANTIMICROBIAL STEWARDSHIPTanveerRehman4
Protein microarray.pptxProtein microarray.pptx
Protein microarray.pptx03342729593
Radiographic Exposure.pptxRadiographic Exposure.pptx
Radiographic Exposure.pptxDr. Dheeraj Kumar
Oral Cholecystography .pptxOral Cholecystography .pptx
Oral Cholecystography .pptxDr. Dheeraj Kumar

Creating Reference-Grade Human Genome Assemblies

Editor's Notes

  1. I want to thank the organizers for asking me to speak her today to tell you about creating reference grade human genome assemblies
  2. As part of our work as a member of the Genome Reference Consortium, we have been working to improve the current reference, GRCh38. In doing this work, we have found that there are still a few regions of the genome not fully resolved. There are still a few genes that are not optimally represented for all individuals or ancestries, although we have fixed quite a few of them in GRCh38. The reference is comprised of many individuals, so there are regions where allelic diversity and structural variation present challenges in the assembly. Many of the newer technologies have allowed for improvements to the reference genome. But we realize that there still is a need for additional high quality human genome assemblies to fully cover the range of genetic diversity in humans.
  3. This is a great example of how allelic differences can cause assembly problems. The gene UGT2B17 is known to be copy number variant. Some individuals have 1 copy of this gene and other individuals lack this gene altogether. In Build 36, clones in this regions were all from the RP11 sample, which happened to be heterozygous for this indel.The blue, red, and black colored boxes represent the clone path through the region - the yellow blocks indicates annotated segmental duplication, and there were two genes annotated in this region. In Build 36 we were representing both the insertion and deletion alleles in the assembly. By removing the black clones from the path, we were able to close the gap, then we created an alternate allele from the black clones, which required sequencing one additional clone. To end up with both the insertion and deletion alleles in GRCh37. This changes our understand of the biology of this region, we closed the gap that existed, we removed falsely annotated duplication and there is really only one copy of this gene present in the assembly with an allelic variant. This example shows how multiple haplotypes in the assembly can cause problems
  4. As part of the GRC we have been focused on fixing the current reference as well as adding additional alleles where we can. We have more recently been working on a project funded by NIH to sequence additional human reference genomes. These are the samples we have been working on. Originally we planned to sequence 5 gold genomes and 2 platinum genomes. Currently we believe we should be able to complete at least 7 gold genomes. These genomes will help to add diversity to the reference. I will spend the first portion of my talk telling you about the one of the platinum genomes that we have been working with, CHM1 and then finish with the some details about the Gold genomes we are sequencing.
  5. As part of this project, we are generating ~60X coverage of PacBio long read data. We will do a de novo assembly of that PacBio data. Then we are using a variety of additional tools to help inform the assembly. BioNano has been very useful in helping to scaffold the assembly as well as to identify potential mis-assemblies. We are also starting to work with 10X genomics data as well as Dovetail data for the same reasons. We are also targeting difficult to assemble regions of the genome by sequencing BACs. Once the BACs are incorporated, we plan to align all of this data to the Reference very stringently to produce chromosomal AGPs. The end product will be a very high quality whole genome assembly.
  6. Here are the definitions we are using for both Platinum and Gold level assemblies. The Platinum genomes are single haplotype sources. We plan to achieve a contiguous, haplotype-resolved representation of the entire genome for these samples. Both of the genomes we have worked on so far for this level have BAC libraries, which as I mentioned before, will be used to help resolve regions of the genome that would be difficult to assemble on the whole genome level. The Gold genomes, will be diploid sources, all will be part of a trio, We are sequencing the child to deeper coverage and doing a lighter amount of sequence on the Parents, mainly to help sort out haplotypes in specific regions. We also have BAC libraries for all of these genomes as well.
  7. To resolve some of these issues that I mentioned that existed in the reference, especially the structurally variant regions that were most difficult to put together, a hydatidiform mole cell line was established. A hydatidiform mole is formed when an enucleated egg is fertilized by sperm. The cells go through several rounds of cell division and the resulting DNA is a diploid copy of the exact same genetic material. This first sample is known as CHM1. A BAC library was created from the CHM1 source and has been used extensively in the reference to fix some of these difficult to assemble regions of the genome. By using a haploid source it is much easier to put together regions where there are segmental duplications. Once we realized the utility of this source, we decided to sequence the entire genome. This was first done years ago and at the time, the only cost effective way to sequence an entire human genome was by generating Illumina data. A reference guided assembly was produced with this Illumina data. IT wasn’t long after that that PacBio agreed to collaborate and the Initial PacBio data was produced. That was ~54X coverage using the P5 chemistry. Then, early in 2015, PacBio believed that they could do better with the most current sequencing chemistry, and library protocols, so they generated the data again. That second set of data did prove to be much better than the first.
  8. We have been working with many different data sets and sometimes many assemblies of the same data set, so we needed methods to assess the various assemblies. Here are the tools that we have been using to assess any given assembly. With the help of NCBI, the assemblies will be run through the NCBI QA pipeline. They will be assessed for contiguity, annotation, and concordance with any finished BAC sequences. Assembly assembly alignments will also be performed between the PB assemblies and GRCh38. As I mentioned, the BioNano Genome maps are being used to asses the assemblies. We have also generated Illumina data for all of our assemblies. For the haploid samples in particular, any heterozygous calls resulting from the Illumina alignments are likely indicative of a collapse in the assembly. This data will also be used to assess the potential mis-assemblies once identified by these other methods.
  9. For those not familiar with BioNano, this is a diagram that shows the work flow. It is a nanochannel technology where long DNA molecules are nicked and labeled at specific recognition sites, you end up with nick sites along the DNA molecules, similar to a restriction digest, only you have the added benefit of the nicks being in context to one another. Once the data is assembled, the resulting Genome maps can be used for SV detection, gap sizing, assembly QC, and scaffolding
  10. As I mentioned, once the CHM1 data was public, Adam Phillippy’s group assembled it as well. We ran both of these assemblies through the various assessment tools. This slide shows how the BioNano data scaffolded each of the assemblies. This shows the great continuity that can be achieved through hybrid scaffolding. For Jason’s assembly, the contig N50 is 26Mb, and then together with the BN map we can achieve a scaffold N50 close to 50Mb. Adam’s assembly is very similar, it starts with a contig n50 of 20 Mb and then we get a scaffold N50 of 40Mb. This is fairly typical of what we have seen at MGI. The PacBio contig N50 nearly doubles when scaffolded with the BN data.
  11. Here is an example of one of the BioNano hybrid scaffolds that was generated The top line in green represents the Hybrid scaffold, the first set of Blue bars represents the PacBio assembly and the S2 lines represent the BioNano map contigs. So from this you can see how these two technologies are very complementary to one another. This is a snapshot of a 10 Mb portion of a larger hybrid scaffold
  12. Here is a region of the reference where sequencing with the CHM1 BAC library was necessary to resolve the sequence completely. This region of 1q21 contains SRGAP2 gene family. SRGAP2 is a highly conserved gene family that is located on three regions along chromosome 1. The view shown here contains over 6 Mb of one of those regions. Because of the degree of similarity between the duplications in this region and the other two locations of SRGAP2, GRCh37 was very mis-assembled. In order to fix this region in the reference, we re-sequenced the entire region using the CHORI-17 BACs. In this view, along the very top, you see the BACs that make up the reference, then the next two tracks, you can see how Jason’s version of the PacBio CHM1 assembly aligns to the reference. The last track here is the segmental duplication track. You will notice in places where there are quite a few segmental duplications, the assembly is much more fragmented. In the next slide, I will zoom into the boxed region,
  13. On this view, the gray bars, indicate regions that align very well, the red marks are mismatches. You will notice, the larger contig in the middle, aligns nearly perfectly, to the CH17 BAC path, where as the contigs in the segmentally duplication regions do not align as well. In this large contig, the percent identity is over 99.9% identical, which is what you would expect since this is the same source as the reference, where as in this contig, where there are known segmental duplications, the identity is not as high.
  14. For CHM1 – next steps, we plan to move forward to improve Jason’s version of the assembly.
  15. I want to switch gears and talk about our first gold genome sample, NA19240, a Yoruban sample. Here are the Initial assembly stats for the this sample. This is the first human whole genome sample we sequenced. The n50 contig length is 6 Mb, it is not as contiguous as the haploid sample, but we think this is still pretty good considering this is a diploid sample.
  16. We used the BioNano data as a way to assess the NA19240 assembly, by doing both the hybrid scaffolding as well as calling SVs Here are those results for this assembly. In this case, the hybrid scaffolding increased the scaffold length N50 to almost 15Mb. As part of our QC process for this assembly, we evaluated all of the Conflicts found during the hybrid scaffolding process, as well as some of the SV calls. After evaluating all of this data, we narrowed the list down to the calls that seemed most likely to indicate an assembly issue. We then used the alignment of the Illumina data to help pinpoint where the contigs needed to be broken. From all of this, we were able to successfully make sequence breaks in a little over 40 regions. After the breaks were made the contigs were aligned very stringently to GRCh38, from this,along with some manual curation, we were able to create chromosome agps for this assembly. This has been submitted Genbank and can be found using the accession listed here.
  17. As I mentioned earlier, there are regions of the assembly that we targeted with BAC sequencing because we knew that they would not assemble well in the context of the whole genome assembly. The NBPF gene family is one of those regions. This green bar shows the NBPF11 gene, the next track here shows how the whole genome PacBio assembly aligns. This bottom track shows the segmental duplications. You will see that the PacBio assembly does pretty good until it gets to the region of more segmental duplication and this is where all of the red bars are. Because of the similarity of the NBPF genes, it is likely that these contigs are collapsed and actually contain reads from multiple We have BACs sequenced through this region, in this track, you can see how well they align and will resolve the region in the whole genome assembly.
  18. For our next genome, HG00733, the Puerto Rican sample, we have completed a variety of assemblies, all assembled with different parameters. Here are just a few of the assemblies plotted by Contig length N50 and total assembly size. From these parameters as well as a few others, we need to decide on which assembly is the best one. Is it better to have a more contiguous assembly? Is it better to have as much of the total genome assembled?
  19. Here is where all of the genomes are currently at. We have a lot of data and a lot of assemblies to work with, so right now our biggest task is getting these in shape to submit them.
  20. I want to acknowledge all of the collaborators on this project and all of the work that has gone into it thus far.