Successfully reported this slideshow.
Your SlideShare is downloading. ×

Ashg2017 workshop tg

Loading in …3

Check these out next

1 of 29 Ad

More Related Content

Slideshows for you (20)

Similar to Ashg2017 workshop tg (20)


More from Genome Reference Consortium (17)

Recently uploaded (20)


Ashg2017 workshop tg

  1. 1. Reference-Grade Human Genome Assemblies Tina Graves Lindsay GRC - GIAB Workshop at ASHG Oct 17, 2017
  2. 2. The Human Reference is a Work in Progress! • The current reference – GRCh38 - is not optimal for some regions of the genome and/or some individuals/ancestries. • GRCh38 is comprised of DNA from several individual humans. • Allelic diversity and structural variation present major challenges when assembling a representative diploid genome. • New technologies, methods, and resources since 2003 have allowed for substantial improvements in the reference genome. • Additional high-quality reference sequences are needed to represent the full range of genetic diversity in humans
  3. 3. AC074378.4 AC079749.5 AC134921.2 AC147055.2 AC140484.1 AC019173.4 AC093720.2 AC021146.7 NCBI36NC_000004.10 (chr4) Tiling Path Xue Y et al, 2008 TMPRSS11E TMPRSS11E2 GRCh37NC_000004.11 (chr4) Tiling Path AC074378.4 AC079749.5 AC134921.1 AC147055.2 AC093720.2 AC021146.7 TMPRSS11E GRCh37: NT_167250.1 (UGT2B17 alternate locus) AC074378.4 AC140484.1 AC019173.4 AC226496.2 AC021146.7 TMPRSS11E2 UGT2B17 – Conflicting Alleles G A P
  4. 4. Samples to be Sequenced
  5. 5. Sequencing Plan
  6. 6. Genome Status Data Source Origin Assembly Accession Status CHM1 NA GCA_001297185.1 Assembly Improvement CHM13 NA GCA_000983455.2 Assembly Assessment NA19240 Yoruban GCA_001524155.4 Chr-level Assembly Submitted HG00733 Puerto Rican GCA_002208065.1 Contig Assembly Submitted HG00514 Han Chinese GCA_002180035.1 Contig Assembly Submitted NA12878 European GCA_002077035.2 Chr-level Assembly Submitted HG01352 Columbian GCA_002209525.1 Contig Assembly Submitted HG02818 Gambian Assembly Underway HG02059 Kinh-Vietnamese Assembly Assessment NA19434 Luhya Assembly Assessment HG04217 Telugu Data Production Underway HG03486 Mende Assembly Underway** ** First Sequel only data set
  7. 7. Genome Total Size # Contigs Contig N50 NA19240 2.84 Gb 2965 25.7 Mb HG00733 2.88 Gb 3580 22.2 Mb NA12878 2.86 Gb 3663 14.5 Mb HG01352 2.88 Gb 3120 22.8 Mb HG00514 2.87 Gb 3160 25.3 Mb NA19434 2.86 Gb 3083 21.6 Mb HG02059 2.89 Gb 3148 26.0 Mb Assembly Stats
  8. 8. Assembly QC and Submission Steps Multiple Falcon Assemblies Using stats and alignment to Bionano, pick the best assembly Quiver and Pilon on best assembly Use Bionano to identify mis- assemblies Submit conitg level AGPs to Genbank Run through NCBI assembly QA pipeline Evaluate and curate output of QA pipeline Generate final chromosome level AGPs and Submit Annotation of chromosome level assembly
  9. 9. Hybrid Scaffold Hybrid Scaffold PacBio Contigs BioNano Contigs
  10. 10. Hybrid Stats Seq Assem Seq Assem Seq Assem BN Hybrid BN Hybrid BN Hybrid # of Contigs Contig N50 (Mb) Total Size (Gb) # of Scaffolds Scaffold N50 (Mb) Total Size (Gb) NA19240 2889 26.3 2.87 218 39.9 2.82 NA12878 3551 15.1 2.86 270 28.7 2.83 HG00514 3190 24.2 2.88 208 37.0 2.83 HG00733 3553 22.8 2.88 167 48.8 2.87 HG01352 3077 22.8 2.88 220 40.0 2.84 NA19434 3083 21.9 2.86 253 34.7 2.83 HG02059 3148 26.1 2.90 242 37.2 2.83
  11. 11. NA19240 Assembly Assessment Initial Calls Breaks made Conflicts 51 35 Translocation SV 321 16 Complex 123 9 Nucmer Alignments 9 69 Total breaks made Contig # Contig N50 Total Assembly Size Before Breaks 2889 26.4 Mb 2.87 Gb After Breaks 2951 25.7 Mb 2.87 Gb
  12. 12. NA19240 contig break
  13. 13. Chimeric PacBio Contig GRCh38 – Chr 1 GRCh38 – Chr 4 NA19240 Contig NA19240 Contig Segmental Duplications Segmental Duplications
  14. 14. NA19240 Inversion Compared to GRCh38 GRCh38 NA19240 Bionano Contigs
  15. 15. Bionano Identified SVs Compared to GRCh38 Genome Deletions Insertions Inversions Yoruban (NA19240) 756 1795 8 European (NA12878) 750 1791 17 Han Chinese (HG00514) 743 1724 8 Puerto Rican (HG00733) 743 1862 27 Colombian (HG01352) 711 1661 6 Vietnamese (HG02059) 626 1536 4 Luhya (NA19434) 694 1643 10 Mende (HG03486) 871 1888 3
  16. 16. NA19240 MHC Region GRCh38 Bionano Contigs
  17. 17. NA19240 MHC Region NA19240 Reference Alts ~65 kb insertion
  18. 18. CYP2D6 Alternate Alleles Courtesy of Karyn Meltz Steinberg
  19. 19. NA12878 CYP2D6 Region in Bionano Map GRCh38 NA12878 allele 1 NA12878 allele 2
  20. 20. NA12878 CYP2D6 Region in Bionano Map GRCh38 NA12878 allele 1 NA12878 allele 2
  21. 21. Falcon Assembly of NA12878 in CYP2D6 Region CYP2D8 CYP2D7 CYP2D6 Alignment of NA12878 to GRCh38 Region of NA12878 that doesn’t exist in GRCh38 Shows Duplication of CYP2D7 gene in NA12878 genome
  22. 22. Falcon Unzip
  23. 23. Falcon Unzip Assemblies Contig # Assembly Length Contig N50 Avg Contig Length Largest Contig Primary Contigs 1220 2.83 Gb 21.63 Mb 2.31 Mb 83.00 Mb Haplotigs 11,686 2.45 Gb 443.3 Kb 210 Kb 3.41 Mb Gambian (HG02818) Assembly Contig # Assembly Length Contig N50 Avg Contig Length Largest Contig Primary Contigs 1,801 2.83 Gb 21.16 Mb 1.57 Mb 81.12 Mb Haplotigs 13,130 2.49 Gb 458.2 Kb 190 Kb 3.23 Mb Yoruban (NA19240) Assembly – Not polished yet
  24. 24. 10X Genomics Overview (DNA) (Church 10X Genomics)
  25. 25. 10X Data – Separating a Heterozygous Allele GRCh38 NA12878 Falcon 10X Allele 1 10X Allele 2 Heterozygous SV identified by Bionano 10X Supernova assembly used - GCA_002022845.1
  26. 26. Short Term Future Plans • Lots of assemblies to analyze! • Generate the latest Falcon Unzip assemblies for all samples • Improve those assemblies • Identifying misassemblies • Making the breaks where needed • Scaffolding the assemblies • Incorporating BACs as they are finished • Create Chromosomal AGPs • Submit to Genbank
  27. 27. Longer Term Future Work • Better Utilization of the Reference • Mapping Strategies • Graph based alignments • Other alt-aware read mapping strategies • Alternative reference data display challenges – How should we present data • Do we continue the current scheme of alt alleles? • Full reference sequences? • 2 Haplo-resolved sequences for each allele • Using Falcon unzip • Using 10X • Other technologies?
  28. 28. Acknowledgements The McDonnell Genome Institute at Washington University in St. Louis Susan Dutcher Bob Fulton Wes Warren Karyn Meltz Steinberg Derek Albracht Milinn Kremitzki Susan Rock Chad Tomlinson Patrick Minx Chris Markovic Eddie Belter Lee Trani Sara Kohlberg University of Washington Evan Eichler NCBI Valerie Schneider University of Pittsburgh School of Medicine (CHM1 and CHM13 cell line) Urvashi Surti BioNano Genomics Alex Hastie Pacific Biosciences Nick Sisneros Sarah Kingan Luke Hickey Greg Concepcion UCSF Pui-Yan Kwok Yvonne Lai Chin Lin Catherine Chu 10X Genomics Deanna Church Nationwide Children’s Hospital Richard Wilson Vince Magrini Sean McGrath

Editor's Notes

  • As part of our work as a member of the Genome Reference Consortium, we have been working to improve the current reference, GRCh38. In doing this work, we have found that there are still a few regions of the genome not fully resolved. There are still a few genes that are not optimally represented for all individuals or ancestries, although we have fixed quite a few of them in GRCh38. The reference is comprised of many individuals, so there are regions where allelic diversity and structural variation present challenges in the assembly. Many of the newer technologies have allowed for improvements to the reference genome. But we realize that there still is a need for additional high quality human genome assemblies to fully cover the range of genetic diversity in humans.
  • This is a great example of how allelic differences can cause assembly problems. The gene UGT2B17 is known to be copy number variant. Some individuals have 1 copy of this gene and other individuals lack this gene altogether. In Build 36, clones in this regions were all from the RP11 sample, which happened to be heterozygous for this indel.The blue, red, and black colored boxes represent the clone path through the region - the yellow blocks indicates annotated segmental duplication, and there were two genes annotated in this region. In Build 36 we were representing both the insertion and deletion alleles in the assembly.

    By removing the black clones from the path, we were able to close the gap, then we created an alternate allele from the black clones, which required sequencing one additional clone. To end up with both the insertion and deletion alleles in GRCh37.
    This changes our understand of the biology of this region, we closed the gap that existed, we removed falsely annotated duplication and there is really only one copy of this gene present in the assembly with an allelic variant.
    This example shows how multiple haplotypes in the assembly can cause problems
  • In the past few years we have been working on a project funded by NIH to sequence additional human reference genomes. These are the samples we have been working on. Originally we planned to sequence 5 diploid genomes and 2 haploidgenomes. Currently we are working on our 10th diploid genome. These genomes will help to add diversity to the reference.
  • As part of this project, we are generating ~60X coverage of PacBio long read data. We will do a de novo assembly of that PacBio data. Then we are using a variety of additional tools to help inform the assembly. BioNano has been very useful in helping to scaffold the assembly as well as to identify potential mis-assemblies. We are also starting to work with 10X genomics data as well. For the initial few genomes, we were targeting difficult to assemble regions of the genome by sequencing BACs. Once the BACs are incorporated, we plan to align all of this data to the Reference very stringently to produce chromosomal AGPs. The end product will be a very high quality whole genome assembly.
  • To date, data has been generated for 2 Haploid genomes and 10 diploid genomes, all at ~60X coverage or higher. We have a lot of data and a lot of assemblies to work with. For 2 of the diploid genomes, we have Chromosome level assemblies, the rest are at the contig leve.
    **2 additional genomes – data will be generated soon
  • Here are the assembly stats we have for all of the genomes we have assembled to date. All of these genomes are being assembled using Falcon. With the newer version of Falcon, we are seeing a huge increase in contiguity. In most cases, the N50 has increased by 3 times. FALCON-integrate 1.7.5, Various assemblies are generated, minimum seed read lengths and min_cov
  • We generate multiple assemblies, varying the minimum seed read length and min_cov. From those 20 or so assemblies, we the Raw data is generally submitted a month or so after production of the data is completed
  • This diagram shows the work flow for the Bionano Irys system. It is a nanochannel technology where long DNA molecules are nicked and labeled at specific recognition sites, you end up with nick sites along the DNA molecules, similar to a restriction digest, only you have the added benefit of the nicks being in context to one another. Once the data is assembled, the resulting Genome maps can be used for SV detection, gap sizing, assembly QC, and scaffolding
  • Here is an example of one of the BioNano hybrid scaffolds that was generated The top line in green represents the Hybrid scaffold, the first set of Blue bars represents the PacBio assembly and the S2 lines represent the BioNano map contigs. So from this you can see how these two technologies are very complementary to one another. This is a snapshot of a 10 Mb portion of a larger hybrid scaffold
  • BioNano has also identified a second enzyme that nicks well for human genomes. You can create a second map with the other enzyme and then through softtware improvements that are coming in the next month, will be able to align you sequence to both maps. This will increase the N50 by 2 times.

    used 14k_120_120_1
  • Once we identified which assembly version we wanted to improve, we aligned to BioNano, SV calls were generated as well as doing hybrid scaffolding. During the hybrid scaffolding process, conflicts are identified. For this genome, 51 contflicts were identified. We looked at the sequence alignments for all of these conflicts and found 35 to be pacbio assemblie errors. WE also looked through the translocation and complex SV calls, as well as a rough alignment of the assembly to GRCh38 to identify contigs that crossed chromosomes. From looking through all of this data, 69 breaks were done. You will see that breaking the obvious chimeric contigs only brought the N50 down a little bit to 25.7 Mb.

    Sequence alignments were looked at for all conflicts, then to narrow down the complex and translocations first looked at the BioNano alignments in Irysview
  • This is the same Pacbio contig as in the last slide, only this time, it is comparing the pacbio contig to GRCh38, it in the top panel you can see
  • We have also been using the bionano maps to identify variation between our genomes and the reference. In this example, there are 2 haplotypes in BN compared to GRCh38 – This appears to be a heterozygous inversion in NA19240.
  • Here is a list of initial set of SV calls of our genomes when compared to GRCh38. These contain both homozygous and heterozygous calls.
  • I have a few examples of what we have been seeing in these assemblies. We decided to take a look at the MHC region, of NA19240. This is a comparison of the BioNano map of NA19240 to the reference, the reference is in green and the NA19240 BN map in blue. It looks like from the BN map there is a ~65kb insertion.
  • We then aligned the contig from Jason’s most recent assembly to the current reference as well as the alts. This is the region that cooresponds to the insertion in the BN map, so from this initial look, it appears there is an insertion here in this assembly. Need to look at it further to evaluate if this would be a useful addition to the alts that already are present.
  • CYP2D6 is a very diverse genomic region that has implications on drug metabolism. In collaboration with the Pharmaco Genomics Research Network (PGRN), we have sequenced multiple alleles in this region using fosmid libraries created from ethnically diverse individuals. Within the region, there is also another Cyp gene, CYP2D7 and a pseudogene called CYP2D8 that contain with common repeats interspersed between genes and pseudogene copies, facilitating genomic rearrangements. The gene CYP2D6 and the associated pseudo genes are shown here, along with some of the different alleles we have sequenced.
  • This is the alignment of NA12878 to GRCh38 as well as the genes aligned to the NA12878
  • IT was important, especailly in highly variable regions of the gneome to capture both alleles from the diploid samples. In collaboration with Pacbio, they have generated an unzip assembly for us. Here is a diagram showing how with Falcon you will be missing allelic variation, but by using Falcon unzip, you should capture the variation that is present. You end up with a set of very contiguous primary contigs and then a set of smaller haplotigs that contain the variation.
  • Gambian assembly was done at Pacbio for us and this version is polished
  • I want to acknowledge all of the collaborators on this project and all of the work that has gone into it thus far.