2. The human reference assembly: past, present
and future
Valerie Schneider, Ph.D.
NCBI
16 October 2018
https://genomereference.org
3. Credits
GRCh38 Collaborators
• NCBI RefSeq and gpipe annotation team
• Havana annotators
• Karen Miga
• Karyn Meltz Steinberg
• David Schwartz
• Steve Goldstein
• Mario Caceres
• Giulio Genovese
• Jeff Kidd
• Peter Lansdorp
• Mark Hills
• David Page
• Jim Knight
• Stephan Schuster
• 1000 Genomes
GRC SAB
• Rick Myers
• Granger Sutton
• Evan Eichler
• Jim Kent
• Roderic Guigo
• Jan Korbel
• Liz Worthey
• Matthew Hurles
• Richard Gibbs
• Carol Bult
• Derek Stemple
GRC
Tina Graves-Lindsay
Tayebeh Rezaie
Kerstin Howe
Paul Flicek
Monte Westerfield
Curators
Developers
Deanna Church
Richard Durbin
Laura Clarke
Twitter: @GenomeRef
Announcements: grc-announce@ncbi.nlm.nih.gov
4. • Past: Reference assembly 101
• Present: Curating GRCh38
• Future: What’s next for the
reference?
Outline
5. The reference is a Sanger-seq’d, clone-based assembly
BAC insert
BAC vector
Shotgun sequence clone
Assemble clone
GAPS
Finish (via PCR)
Minimal Clone Tiling Path
Define consensus from switch points of adjacent clones
Ordering the Path
Fingerprint maps
Genetic linkage maps
Radiation hybrid maps
Reference Assembly 101
6. Today’s reference assembly does not represent:
1.The most common allele/haplotype
2.The longest allele/haplotype
3.The ancestral allele/haplotype
It represents the clone-based sequence available from the HGP
Reference Assembly 101
• Highly contiguous
• High sequence accuracy (finished: <10-5)
• Haploid mosaic
7. The reference is comprised of sequences from multiple individuals
Reference Assembly 101
9. Reference Assembly 101
Current assembly model:
represent both haplotypes
alt loci scaffold
chromosomemany
Gene1 Gene2
Sample
Gene2
Gene1
chromosome
alt scaffold
Reference
GRCh38 (Dec. 2013)
• 178 regions with alt loci: 2% of chromosome
sequence (61.9 Mb)
• 261 Alt Loci: 3.6 Mb novel sequence relative to
chromosomes
• Average alt length = 400 kb, max = ~5 Mb
• >150 genes only represented on alt loci
Gene1
Ref
Assembly
Original assembly model:
compress into a consensus
false
gap
chromosome
Sequences from haplotype 1
Sequences from haplotype 2
10. • Past: Reference assembly 101
• Present: Curating GRCh38
• Future: What’s next for the
reference?
Outline
13. 0 10 20 30 40 50 60 70
Gap
Clone
Variation
Localization
Path
Missing Seq
GRC Housekeeping
Unknown
Resolution Odds (n=215/385)
likely potential unlikely
Curating GRCh38
*Unknown: typically bp discrepancy for which
there is currently insufficient info to distinguish
clone error vs. variation
*
Poster 444F (3:00-4:00)
Latest improvements in the
human genome reference
assembly (GRCh38)
Tayebeh Rezaie
14. • Past: Reference assembly 101
• Present: Curating GRCh38
• Future: What’s next for the
reference?
Outline
15. • Ideals:
• Provides chromosome context for
any common human sequence
>500 bp
• Supports unambiguous data
interpretation at all clinically
relevant loci
• Imparts no systematic error/bias in
genome-wide analyses
• Real-World:
• Community interest
• Resources for curation
HGP GRC
What’s next?
Defining “Done”
17. Initial Falcon Assembly
Collection of 40-50 Falcon
Assemblies w/ varied parameters
Select “Best” Assembly: combo
of N50/length
Error Correction Quiver/Pilon
Identify chimeric contigs from
BioNano alignment
Submit to GenBank
What’s next?
18. Data Source Origin Assembly Accession Status
CHM1 NA (haploid) GCA_001297185.2 Contig Assembly Submitted
CHM13 NA (haploid) GCA_002884485.1 Contig Assembly Submitted
NA19240 Yoruban GCA_001524155.4 Chr-level Assembly Submitted
HG00514 Han Chinese GCA_002180035.2 Chr-level Assembly Submitted
NA12878 European GCA_002077035.3 Chr-level Assembly Submitted
HG00733 Puerto Rican GCA_002208065.1 Contig Assembly Submitted
HG01352 Columbian GCA_002209525.1 Contig Assembly Submitted
NA19434 Luhya GCA_002872155.1 Contig Assembly Submitted
HG02059 Kinh-Vietnamese GCA_003070785.1 Contig Assembly Submitted
HG03486 Mende GCA_003086635.1 Contig Assembly Submitted
HG02818 Gambian GCA_003574075.1 Contig Assembly Submitted
HG03807 Bengali GCA_003601015.1 Contig Assembly Submitted
HG04217 Telugu Assembly Assessment
HG02106 Peruvian Assembly Assessment
HG00268 Finnish Assembly Assessment
NA19836 African American Assembly Underway
HG03125 Esan Data Generation Underway
What’s next?
19. Sample Population Ungapped Size # Contigs Contig N50 Sequencer
NA19240 Yoruban 2.87 Gb 2521 29.1 Mb RSII
HG00733 Puerto Rican 2.88 Gb 3580 22.2 Mb RSII
NA12878 European 2.85 Gb 3220 16.8 Mb RSII
HG01352 Columbian 2.88 Gb 3120 22.8 Mb RSII
HG00514 Han Chinese 2.87 Gb 3190 25.3 Mb RSII
NA19434 Luhya 2.86 Gb 3123 21.5 Mb RSII
HG02059 Kinh-Vietnamese 2.90 Gb 3180 25.3 Mb RSII
HG02818 Gambian 2.88 Gb 3267 22.5 Mb RSII
HG03486 Mende 2.87 Gb 3465 5.3 Mb* Sequel
HG003087 Bengali 2.86 Gb 3103 8.4 Mb** Sequel +RSII
Poster 442W (3:00-4:00)
New methods for discovery and
interpretation of allelic diversity
in human genomes
Bob Fulton
What’s next?
20. GRC curation challenge: which
insertion(s) to represent?
Indel polymorphism at GRCh38 gap
What’s next?
GAP
Optical map confirmation of WGS contigs
Indel region
Indel region
Indel region
21. • Add representation for
acrocentric chromosome short-
arm sequences (McStay)
• Improved centromere
representations (Miga)
• New clone paths for immune
regions (improve existing paths
and add diversity) (Watson)
• Community outreach
–Workshops
–Website: Help Desk/FAQs
• Your Data?
What’s next?
(For updated assemblies, only date of initial submission is counted)
0
10
20
30
40
50
60
70
80
90
100
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
Growth of accessioned (full) human genome
assemblies in NCBI Assembly database
GRCh38
released
n=91
22. GRCh39?
• Remain committed to mission to provide the
best representation of the human genome to
meet basic and clinical research needs
• Make GRCh38 updates publicly available at
regular intervals in the form of patch releases
• Indefinitely postpone GRCh39 while evaluating
new models and sequence content for the
human reference assembly currently in
development
What’s next?
23. MGI Assemblies Acknowledgements
The McDonnell Genome Institute at
Washington University in St. Louis
Susan Dutcher
Bob Fulton
Wes Warren
Ira Hall
Karyn Meltz Steinberg
Derek Albracht
Milinn Kremitzki
Susan Rock
Chad Tomlinson
Patrick Minx
Chris Markovic
Eddie Belter
Lee Trani
Sara Kohlberg
University of Washington
Evan Eichler
NCBI
Valerie Schneider
BioNano Genomics
Alex Hastie
Pacific Biosciences
Nick Sisneros
Sarah Kingan
Luke Hickey
Greg Concepcion
UCSF
Pui-Yan Kwok
Yvonne Lai
Chin Lin
Catherine Chu
10X Genomics
Deanna Church
Nationwide Children’s Hospital
Richard Wilson
Vince Magrini
Sean McGrath
UCSC
Ed Green