Successfully reported this slideshow.
Your SlideShare is downloading. ×

AGBT2017 Reference Workshop: Lindsay

Loading in …3

Check these out next

1 of 34 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (18)


Similar to AGBT2017 Reference Workshop: Lindsay (20)

More from Genome Reference Consortium (12)


Recently uploaded (20)

AGBT2017 Reference Workshop: Lindsay

  1. 1. Creating Reference-Grade Human Genome Assemblies Tina Graves Lindsay Reference Genome Workshop at AGBT Feb 13, 2017
  2. 2. The Human Reference is a Work in Progress! • The current reference – GRCh38 - is not optimal for some regions of the genome and/or some individuals/ancestries. • GRCh38 is comprised of DNA from several individual humans. • Allelic diversity and structural variation present major challenges when assembling a representative diploid genome. • New technologies, methods, and resources since 2003 have allowed for substantial improvements in the reference genome. • Additional high-quality reference sequences are needed to represent the full range of genetic diversity in humans
  3. 3. AC074378.4 AC079749.5 AC134921.2 AC147055.2 AC140484.1 AC019173.4 AC093720.2 AC021146.7 NCBI36NC_000004.10 (chr4) Tiling Path Xue Y et al, 2008 TMPRSS11E TMPRSS11E2 GRCh37NC_000004.11 (chr4) Tiling Path AC074378.4 AC079749.5 AC134921.1 AC147055.2 AC093720.2 AC021146.7 TMPRSS11E GRCh37: NT_167250.1 (UGT2B17 alternate locus) AC074378.4 AC140484.1 AC019173.4 AC226496.2 AC021146.7 TMPRSS11E2 UGT2B17 – Conflicting Alleles G A P
  4. 4. Samples to be Sequenced
  5. 5. Sequencing Plan
  6. 6. Definitions of Genome Level • Platinum Genome • Haploid genome source • Contiguous, haplotype-resolved representation of entire genome • BAC library available • Gold Genome • Diploid genome source • Part of a trio • Parents will be sequenced to help haplotype resolve some regions • BAC libraries available • Targeted regions sequenced using these BAC libraries • Will contain some haplotype resolved regions
  7. 7. CHM1: A Key Resource for Improving the Reference • CHM1 cell line established from a haploid hydatidiform mole (complete, paternal; 46XX) (U.Surti) • CHORI-17 BAC library (P. deJong) • CHORI-17 BAC end sequences (n=325,659) • CHORI-17 multiple enzyme fingerprint map (1,560 fpc contigs) • CHORI-17 BACs • >750 have been sequenced • 664 of them in Genbank as phase 3 sequence • CHM1 WGS assembly • Initial assembly produced from >100X coverage of Illumina data • Initial PacBio assembly produced using ~54X of P5/C3 PacBio data • Latest PacBio assembly produced using ~60X of P6/C4 PacBio data
  8. 8. Assembly Assessment Methods • Assemblies run through NCBI QA pipeline • Assessed for contiguity, annotation, and concordance with the finished BACs • Assembly Assembly alignments can be generated between each PB assembly and GRCh38 • BioNano Genome Map • SV calls generated from comparing the BioNano data to each of the assemblies • Hybrid scaffolding conflicts will also point out potential assembly errors • Alignment of the Illumina reads back to the each of the assemblies • Heterozygous calls are likely indicative of a collapse in the assembly (for the haploid genomes)
  9. 9. Hybrid Scaffolds – PacBio and BioNano Seq Assem Seq Assem Seq Assem BN Hybrid BN Hybrid BN Hybrid # of Contigs Contig N50 (Mb) Total Size (Gb) # of Scaffolds Scaff N50 (Mb) Total Size (Gb) CHM1 (P6) GCA_001297185 MGI CHM1 map (Jason’s version) 3641 26.9 2.99 161 47.6 2.84 CHM1 (P6) GCA_001307025 MGI CHM1 Map (Adam’s version) 4850 20.6 2.94 221 40.04 2.82
  10. 10. Hybrid Scaffold Hybrid Scaffold PacBio Contigs BioNano Contigs
  11. 11. 1q21 Region – GRCh38 vs GCA_001297185 1 Megabase GRCh38 GCA_001297185 Seg Dup Track
  12. 12. 1q21 Region - GRCh38 vs GCA_001297185 GRCh38 GCA_001297185 Seg Dup Track 99.9+% identity 99.1% identity 1 Megabase
  13. 13. CHM1 – Next Steps • Currently running Pilon on GCA_001297185, for improved base pair accuracy • Based on alignment of BioNano data as well as comparisons to GRCh38, we will make additional breaks where needed • Incorporate all finished BACs • Final alignment to GRCh38 in order to produce chromosome AGPs and submit
  14. 14. Samples to be Sequenced
  15. 15. Genome Status Data Source Origin Level of Coverage Status CHM1 NA Platinum Assembly Improvement CHM13 NA Platinum In Assembly Queue NA19240 Yoruban Gold Assembly Submission HG00733 Puerto Rican Gold Assessing New Assembly HG00514 Han Chinese Gold Assessing New Assembly** NA12878 European Gold Assessing New Assembly HG01352 Columbian Gold Assessing New Assembly HG02818 Gambian Gold Assembly Underway HG02059 Kinh-Vietnamese Gold In Assembly Queue NA19434 Luhya Gold In Assembly Queue HG04217 Telugu Gold Data Production Underway **100x coverage was generated for the Han Chinese sample
  16. 16. Genome Total Size (older version Falcon) # Contigs (older version Falcon) Contig N50 (older version Falcon) Contig N50 (newer version Falcon) NA19240 2.75 Gb 3569 6.0 Mb 26.4 Gb HG00733 2.84 Gb 3715 7.6 Mb 22-23 Mb NA12878 2.80 Gb 4412 4.49 Mb 14-15 Mb HG01352 2.85 Gb 4080 8.22 Mb 20-24 Mb HG00514 2.85 Gb 2808 10.0 Mb 22-24 Mb HG02818 2.82 Gb 3300 7.24 Mb Assembly underway Assembly Stats
  17. 17. First Gold Genome - NA19240 • NA19240 – Yoruban sample • Generated >70X raw P6/C4 RSII PacBio data Initial Assembly Stats Latest Assembly Stats # Seq Contigs 3569 2889 Max Contig Length 20,393,869 bp 75,769,079 bp Total Assembly Size 2,745,634,789 bp 2,874,720,146 bp N50 6,003,115 bp 26,385,265 bp N90 848,151 bp 2,559,914 bp N95 345,457 bp 710,070 bp
  18. 18. Assembly QC and Submission Steps Multiple Falcon Assemblies Using stats and alignment to Bionano, pick the best assembly Quiver and Pilon on best assembly Use Bionano to identify mis- assemblies and scaffold assembly Submit scaffold- level AGPs to Genbank Run through NCBI assembly QA pipeline Evaluate and curate output of QA pipeline Generate final chromosome level AGPs and Submit Annotation of chromosome level assembly
  19. 19. Hybrid Stats Seq Assem Seq Assem Seq Assem BN Hybrid BN Hybrid BN Hybrid # of Contigs Contig N50 (Mb) Total Size (Gb) # of Scaffolds Scaffold N50 (Mb) Total Size (Gb) NA19240 2889 26.3 2.87 218 39.9 2.82 NA12878 3551 15.1 2.86 270 28.7 2.83 HG00514 3190 24.2 2.88 208 37.0 2.83
  20. 20. NA19240 Assembly Assessment Initial Calls Breaks made Conflicts 51 35 Translocation SV 321 16 Complex 123 9 Nucmer Alignments 9 69 Total breaks made Contig # Contig N50 Total Assembly Size Before Breaks 2889 26.4 Mb 2.87 Gb After Breaks 2951 25.7 Mb 2.87 Gb
  21. 21. NA19240 contig break
  22. 22. Chimeric PacBio Contig GRCh38 – Chr 1 GRCh38 – Chr 4 NA19240 Contig NA19240 Contig Segmental Duplications Segmental Duplications
  23. 23. NA19240 Bionano Map Compared to GRCh38 SV Type Number of Calls Insertion 1795 Deletion 756 End 71 Inversions 8 Complex 62 Translocations 6
  24. 24. NA19240 Inversion Compared to GRCh38 GRCh38 NA19240 Bionano Contigs
  25. 25. NA19240 MHC Region GRCh38 Bionano Contigs
  26. 26. NA19240 MHC Region NA19240 Reference Alts ~65 kb insertion
  27. 27. Finished BACs Resolve This Region GRCh38 PB Assembly BAC Alignments Seg Dup
  28. 28. Spanning Reference Gaps • HG00514 80X assembly • Initial assessment had 75 potential gap spanning contigs • Closer look only 32 are real gap spanning contigs, that span 40 total gaps
  29. 29. True Gap Spanner GRCh38 HG00514 Contig
  30. 30. False Gap Spanner False Alignment Seg Dup True Alignment 7kb 3 kb 10 kb
  31. 31. Short Term Future Plans • Lots of assemblies to analyze! • Generate the latest Falcon assemblies for all samples • Improve those assemblies • Identifying misassemblies • Making the breaks where needed • Scaffolding the assemblies • Incorporating BACs as they are finished • Create Chromosomal AGPs • Submit to Genbank
  32. 32. Longer Term Future Work • Better Utilization of the Reference • Mapping Strategies • Graph based alignments • Other alt-aware read mapping strategies • Alternative reference data display challenges – When and how to present data • Alt alleles? • Full reference sequences • Haplo-resolved (10X)? • Wet Lab Improvements • Haplo-resolved strategies (10X) • Clone-based work replacements? - Hyb 10X or Pac Bio? • New long read technologies • PacBio Sequel • Oxford Nanopore
  33. 33. Acknowledgements The McDonnell Genome Institute at Washington University in St. Louis Susan Dutcher Bob Fulton Wes Warren Karyn Meltz Steinberg Derek Albracht Milinn Kremitzki Susan Rock Chad Tomlinson Patrick Minx Chris Markovic Eddie Belter Lee Trani Sara Kohlberg University of Washington Evan Eichler NCBI Valerie Schneider University of Pittsburgh School of Medicine (CHM1 and CHM13 cell line) Urvashi Surti BioNano Genomics Alex Hastie Pacific Biosciences Jason Chin Nick Sisneros UCSF Pui-Yan Kwok Yvonne Lai Chin Lin Catherine Chu NHGRI Adam Phillippy Sergey Koren 10X Genomics Deanna Church Nationwide Children’s Hospital Richard Wilson Vince Magrini Sean McGrath