Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Schneider grc workshop_final


Published on

Overview of the GRCh38 human reference assembly presented by Valerie Schneider at GRC/GIAB workshop at ASHG 2018 meeting.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Schneider grc workshop_final

  1. 1. GRC/GIAB Workshop: Getting the Most from the Reference Assembly and Reference Materials Oct 16, 2018: 1-4 pm
  2. 2. The human reference assembly: past, present and future Valerie Schneider, Ph.D. NCBI 16 October 2018
  3. 3. Credits GRCh38 Collaborators • NCBI RefSeq and gpipe annotation team • Havana annotators • Karen Miga • Karyn Meltz Steinberg • David Schwartz • Steve Goldstein • Mario Caceres • Giulio Genovese • Jeff Kidd • Peter Lansdorp • Mark Hills • David Page • Jim Knight • Stephan Schuster • 1000 Genomes GRC SAB • Rick Myers • Granger Sutton • Evan Eichler • Jim Kent • Roderic Guigo • Jan Korbel • Liz Worthey • Matthew Hurles • Richard Gibbs • Carol Bult • Derek Stemple GRC Tina Graves-Lindsay Tayebeh Rezaie Kerstin Howe Paul Flicek Monte Westerfield Curators Developers Deanna Church Richard Durbin Laura Clarke Twitter: @GenomeRef Announcements:
  4. 4. • Past: Reference assembly 101 • Present: Curating GRCh38 • Future: What’s next for the reference? Outline
  5. 5. The reference is a Sanger-seq’d, clone-based assembly BAC insert BAC vector Shotgun sequence clone Assemble clone GAPS Finish (via PCR) Minimal Clone Tiling Path Define consensus from switch points of adjacent clones Ordering the Path Fingerprint maps Genetic linkage maps Radiation hybrid maps Reference Assembly 101
  6. 6. Today’s reference assembly does not represent: 1.The most common allele/haplotype 2.The longest allele/haplotype 3.The ancestral allele/haplotype It represents the clone-based sequence available from the HGP Reference Assembly 101 • Highly contiguous • High sequence accuracy (finished: <10-5) • Haploid mosaic
  7. 7. The reference is comprised of sequences from multiple individuals Reference Assembly 101
  8. 8. Reference Assembly 101 Gene1 Gene2 Sample Gene1 Ref Assembly Slide Credit: Deanna Church
  9. 9. Reference Assembly 101 Current assembly model: represent both haplotypes alt loci scaffold chromosomemany Gene1 Gene2 Sample Gene2 Gene1 chromosome alt scaffold Reference GRCh38 (Dec. 2013) • 178 regions with alt loci: 2% of chromosome sequence (61.9 Mb) • 261 Alt Loci: 3.6 Mb novel sequence relative to chromosomes • Average alt length = 400 kb, max = ~5 Mb • >150 genes only represented on alt loci Gene1 Ref Assembly Original assembly model: compress into a consensus false gap chromosome Sequences from haplotype 1 Sequences from haplotype 2
  10. 10. • Past: Reference assembly 101 • Present: Curating GRCh38 • Future: What’s next for the reference? Outline
  11. 11. • >1000 reported issues resolved • Closed gaps • Targeted base fixes • Corrected path errors Genome Research 27(5):849-864 (2017) • Addition of missing paralogs • Better representation of variation • Better annotation substrate • Modeled centromeres GRCh38 (Dec 2013) Curating GRCh38
  12. 12. Curating GRCh38 chromosome novel patch scaffold fix patch scaffold Patch release: No change to chromosome coordinates Assembly nomenclature: GRCh38.p$ GRCh38.p12 • 70 FIX, 70 NOVEL • Added >2.2 Mb novel sequence • >20 genes affected Since ASHG 2017: 113 resolved
  13. 13. 0 10 20 30 40 50 60 70 Gap Clone Variation Localization Path Missing Seq GRC Housekeeping Unknown Resolution Odds (n=215/385) likely potential unlikely Curating GRCh38 *Unknown: typically bp discrepancy for which there is currently insufficient info to distinguish clone error vs. variation * Poster 444F (3:00-4:00) Latest improvements in the human genome reference assembly (GRCh38) Tayebeh Rezaie
  14. 14. • Past: Reference assembly 101 • Present: Curating GRCh38 • Future: What’s next for the reference? Outline
  15. 15. • Ideals: • Provides chromosome context for any common human sequence >500 bp • Supports unambiguous data interpretation at all clinically relevant loci • Imparts no systematic error/bias in genome-wide analyses • Real-World: • Community interest • Resources for curation HGP GRC What’s next? Defining “Done”
  16. 16. What’s next?
  17. 17. Initial Falcon Assembly Collection of 40-50 Falcon Assemblies w/ varied parameters Select “Best” Assembly: combo of N50/length Error Correction Quiver/Pilon Identify chimeric contigs from BioNano alignment Submit to GenBank What’s next?
  18. 18. Data Source Origin Assembly Accession Status CHM1 NA (haploid) GCA_001297185.2 Contig Assembly Submitted CHM13 NA (haploid) GCA_002884485.1 Contig Assembly Submitted NA19240 Yoruban GCA_001524155.4 Chr-level Assembly Submitted HG00514 Han Chinese GCA_002180035.2 Chr-level Assembly Submitted NA12878 European GCA_002077035.3 Chr-level Assembly Submitted HG00733 Puerto Rican GCA_002208065.1 Contig Assembly Submitted HG01352 Columbian GCA_002209525.1 Contig Assembly Submitted NA19434 Luhya GCA_002872155.1 Contig Assembly Submitted HG02059 Kinh-Vietnamese GCA_003070785.1 Contig Assembly Submitted HG03486 Mende GCA_003086635.1 Contig Assembly Submitted HG02818 Gambian GCA_003574075.1 Contig Assembly Submitted HG03807 Bengali GCA_003601015.1 Contig Assembly Submitted HG04217 Telugu Assembly Assessment HG02106 Peruvian Assembly Assessment HG00268 Finnish Assembly Assessment NA19836 African American Assembly Underway HG03125 Esan Data Generation Underway What’s next?
  19. 19. Sample Population Ungapped Size # Contigs Contig N50 Sequencer NA19240 Yoruban 2.87 Gb 2521 29.1 Mb RSII HG00733 Puerto Rican 2.88 Gb 3580 22.2 Mb RSII NA12878 European 2.85 Gb 3220 16.8 Mb RSII HG01352 Columbian 2.88 Gb 3120 22.8 Mb RSII HG00514 Han Chinese 2.87 Gb 3190 25.3 Mb RSII NA19434 Luhya 2.86 Gb 3123 21.5 Mb RSII HG02059 Kinh-Vietnamese 2.90 Gb 3180 25.3 Mb RSII HG02818 Gambian 2.88 Gb 3267 22.5 Mb RSII HG03486 Mende 2.87 Gb 3465 5.3 Mb* Sequel HG003087 Bengali 2.86 Gb 3103 8.4 Mb** Sequel +RSII Poster 442W (3:00-4:00) New methods for discovery and interpretation of allelic diversity in human genomes Bob Fulton What’s next?
  20. 20. GRC curation challenge: which insertion(s) to represent? Indel polymorphism at GRCh38 gap What’s next? GAP Optical map confirmation of WGS contigs Indel region Indel region Indel region
  21. 21. • Add representation for acrocentric chromosome short- arm sequences (McStay) • Improved centromere representations (Miga) • New clone paths for immune regions (improve existing paths and add diversity) (Watson) • Community outreach –Workshops –Website: Help Desk/FAQs • Your Data? What’s next? (For updated assemblies, only date of initial submission is counted) 0 10 20 30 40 50 60 70 80 90 100 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Growth of accessioned (full) human genome assemblies in NCBI Assembly database GRCh38 released n=91
  22. 22. GRCh39? • Remain committed to mission to provide the best representation of the human genome to meet basic and clinical research needs • Make GRCh38 updates publicly available at regular intervals in the form of patch releases • Indefinitely postpone GRCh39 while evaluating new models and sequence content for the human reference assembly currently in development What’s next?
  23. 23. MGI Assemblies Acknowledgements The McDonnell Genome Institute at Washington University in St. Louis Susan Dutcher Bob Fulton Wes Warren Ira Hall Karyn Meltz Steinberg Derek Albracht Milinn Kremitzki Susan Rock Chad Tomlinson Patrick Minx Chris Markovic Eddie Belter Lee Trani Sara Kohlberg University of Washington Evan Eichler NCBI Valerie Schneider BioNano Genomics Alex Hastie Pacific Biosciences Nick Sisneros Sarah Kingan Luke Hickey Greg Concepcion UCSF Pui-Yan Kwok Yvonne Lai Chin Lin Catherine Chu 10X Genomics Deanna Church Nationwide Children’s Hospital Richard Wilson Vince Magrini Sean McGrath UCSC Ed Green