Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What's new and what's next for the human reference assembly?


Published on

Presentation at 2019 ASHG GRC/GIAB workshop describing history of the human reference genome, current curation efforts and future plans, and the relationship of all 3 to efforts to produce a human pan-genome.

Published in: Science
  • Be the first to comment

  • Be the first to like this

What's new and what's next for the human reference assembly?

  1. 1. GRC/GIAB Workshop: Getting the Most from the Reference Assembly and Reference Materials Oct 15, 2019: 9 am-12 pm
  2. 2. What's new and what's next for the human reference assembly? Valerie Schneider, Ph.D. NCBI 15 October 2019
  3. 3. GRC • Valerie Schneider • Kerstin Howe • Tina Graves • Paul Flicek • Tayebeh Rezaie • Nathan Bouk • Hsiu-Chuan Chen • Jo Wood • Joanna Collins • Sarah Pelan • Will Chow • James Torrance • Derek Albracht • Milinn Kremitzki • Laura Clarke Thanks to many GRC Collaborators CreditsTwitter: @GenomeRef Announcements: Funding: • This work was supported in part by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. • The European Molecular Biology Laboratory. • The Wellcome Trust, UK. • The MGI was supported by National Institutes of Health grants 5U54HG003079, 5U41HG007635 and 5U24HG009081.
  4. 4. • What’s the reference? • What’s new: GRCh38.p13 through today • What’s next? Outline
  5. 5. What’s the reference? Anonymous samples Individual 1A Individual 2A Individual 1B Haploid mosaic assembly • Highly contiguous • Contig N50: 57.9 Mb • Highly accurate • per bp error: <10-5 Today’s reference assembly does not represent: 1. The most common allele/haplotype 2. The longest allele/haplotype 3. The ancestral allele/haplotype The reference represents the available Human Genome Project sequence 1 library ~
  6. 6. What’s the reference? Assembly Model Evolution Gene1 Gene2 Sample Gene2 Gene1 chromosome alt scaffold Reference Assembly Gene1 Ref Assembly false gap chromosome Sequences from haplotype 1 Sequences from haplotype 2 Linear model: impacts on assembly building and analysis GRCh37/GRCh38 reference assembly model: represent both haplotypes many alt loci scaffold 1 chromosome alt loci scaffold 2 alt loci scaffold 3 Reference Assembly
  7. 7. Reference Assembly 101: Assembly Model Evolution chromosome Patch release: No change to chromosome coordinates Assembly nomenclature: GRCh38.p$ novel patch scaffold ALLELIC fix patch scaffold PREFERRED
  8. 8. • What’s the reference? • What’s new: GRCh38.p13 through today • What’s next? Outline
  9. 9. GRCh38.p13 (cumulative stats) • 113 Fix patches: Add >3.88 Mb novel sequence • 43 added in p13 • 72 Novel patches: Add >1.1 Mb novel sequence • 2 added in p13 • >25 genes affected What’s new?: GRCh38.p13 Tayebeh Rezaie Weds, 9 am Grand Ballroom B Level 3 Convention Center
  10. 10. What’s new?: NOR Distal Junction Regions Brian McStay Lab DJ sequences are >99% identical between acrocentrics
  11. 11. What’s new?: NOR Distal Junction Regions Updated chr 21 p-arm <<<<CENTROMERE TELOMERE>>>> Reduced clone path (unordered/unoriented) GRCh38 chr 21 alignment _ rDNA + NOR DJ
  12. 12. What’s new?: Gap Closures Data Source Origin Assembly Accession Status # Contigs Contig N50 CHM1 NA (haploid) GCA_001297185.2 Contig Assembly Submitted 3,709 26.5 Mb CHM13 NA (haploid) GCA_002884485.1 Contig Assembly Submitted 1,916 29.2 Mb NA19240 Yoruban GCA_001524155.4 Chr-level Assembly Submitted 1,826 29.1 Mb HG00514 Han Chinese GCA_002180035.3 Chr-level Assembly Submitted 2,877 29.4 Mb NA12878 European GCA_002077035.3 Chr-level Assembly Submitted 3,220 16.8 Mb HG00733 Puerto Rican GCA_002208065.1 Contig Assembly Submitted 3,580 22.2 Mb HG01352 Columbian GCA_002209525.1 Contig Assembly Submitted 3,120 22.8 Mb NA19434 Luhya GCA_002872155.1 Contig Assembly Submitted 3,123 21.5 Mb HG02059 Kinh-Vietnamese GCA_003070785.1 Contig Assembly Submitted 3,180 25.3 Mb HG03486 Mende GCA_003086635.1 Contig Assembly Submitted 3,465 5.3 Mb (Sequel) HG02818 Gambian GCA_003574075.1 Contig Assembly Submitted 3,267 22.5 Mb HG03807 Bengali GCA_003601015.1 Contig Assembly Submitted 3,103 8.4 Mb (Sequel) HG04217 Telugu GCA_007821485.1 Contig Assembly Submitted 4,249 3.4 Mb (Sequel) HG02106 Peruvian GCA_008583285.1 Contig Assembly Submitted 2,636 3.2 Mb (Sequel) HG00268 Finnish GCA_008065235.1 Contig Assembly Submitted 1,995 20.0 Mb (Sequel) Compressed diploid assemblies unless otherwise noted
  13. 13. • GRCh38 gaps to be evaluated (n=196) • Excludes biological gaps and WGS intra-scaffold gaps • Evaluation: Alignment of 8 collapsed diploid assemblies • 26 gaps spanned all 8 WGS assemblies, with constant insert length • Spanning sequence included in GRCh38.p13 • 3 gaps spanned by all 8 WGS assemblies, with variable insert length • 24 gaps spanned by only a subset of the 8 assemblies • Remainder of gap evaluations still in progress Clone CloneWGS WGS WGS PacBio Assembly Assessed as one gap GRCh38 What’s new?: Gap Closures
  14. 14. • What’s the reference? • What’s new: GRCh38.p13 through today • What’s next? Outline
  15. 15. Unresolved genome issues Current curation status Resolution likelihoods as determined by the GRC review n=234 What’s next? Slide: Tayebeh Rezaie
  16. 16. What’s next? Data Source Origin Status NA19836 African American Assembly Submission Underway NA20502 Tuscan Assembly Submission Underway NA20862 Gujarati Indian Assembly Submission Underway HG03125 Esan Assembly Assessment Underway HG02970 Esan Assembly Assessment Underway NA21309 Maasai Assembly Assessment Underway NA20300 African American Assembly Assessment Underway NA20129 African American Assembly Assessment Underway HG01567 Peruvian Assembly Assessment Underway HG03719 Telugu Assembly Assessment Underway HG00766 Chinese Dai Assembly Assessment Underway NA12395 CEPH Assembly Underway NA19030 Luyha Assembly Underway NA19734 Mexican Ancestry Assembly Underway HG03736 Sri Lankan Assembly Underway
  17. 17. 0 20 40 60 80 100 120 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Growth of accessioned complete human genome assemblies in NCBI Assembly database • Engagement with T2T consortium – Chr X – Other missing sequences • Continued engagement with Gold Genomes project – Gap closures – New Novel patches • New clone paths for immune regions (improve existing paths and add diversity) – MHC – IgH • Chr 21 p-arm sequence review and update – Not possible as patches? • Community outreach – Workshops – Website: Help Desk/FAQs • Your Data? What’s next? (For updated assemblies, only date of initial submission is counted) GRCh38 released n=98 GRCh38.p14 (2020)
  18. 18. What’s next? • Consortium Goals – Produce 350 Human whole genome assemblies – Fully phased diploid assemblies – Identify SVs between samples and current Reference GRCh38 – Incorporate those SVs into the reference, likely as a graph representation
  19. 19. • What’s the reference? • What’s new: GRCh38.p13 through today • What’s next? Outline