GRC/GIAB Workshop:
Getting the Most from the Reference
Assembly and Reference Materials
Oct 17: 1-4 pm
Valerie Schneider (NCBI): GRCh38 assembly basics and updates
Tina Lindsay (MGI): Reference-grade human assemblies
Karen Miga (UCSC): Centromere assemblies
BREAK (15 min)
Benedict Paten (UCSC): Building human variation graphs
Fritz Sedlazeck (BCM): Structural Variation Characterization Across the Human Genome and Populations
Justin Zook (NIST): GIAB benchmarks for difficult variants
GRCh38 assembly basics and updates
Valerie Schneider, Ph.D.
NCBI
17 October 2017
https://genomereference.org
https://genomereference.org
Twitter: @GenomeRef
Announcements: grc-announce@ncbi.nlm.nih.gov
• Assembly basics
• GRCh38 updates
• Taking advantage of the data
Outline
Assembly Basics
Reference Assembly Basics
(For updated assemblies, only date of initial submission is counted)
Other
assemblies
GRCh38
(reference)
Sanger-seq’d, clone-based assembly BAC insert
BAC vector
Shotgun sequence clone
Assemble clone
GAPS
Finish (via PCR)
Minimal Clone Tiling Path
Define consensus from switch points of adjacent clones
Consequences:
• Highly contiguous
• High sequence accuracy (<10-5)
• Haploid mosaic
Ordering the Path
Fingerprint maps
Genetic linkage maps
Radiation hybrid maps
Reference Assembly Basics
HuRef
SOAPdenovo
NA12878
ALLPATHS
NA12878
Lander and Waterman
(1988) Genomics
SequencedNot sequenced
1X Coverage
5X Coverage
10X Coverage
37% 63%
0.6% 99.4%
0.005% 99.995%
The likelihood a base is seq’d.Coverage
Contig N50
MHAP
CHM1
Chaisson and Eichler (2015), with modification
Measure of contiguity. Half of the assembly
is in contigs this length or greater.
Reference Assembly Basics
AK1
HX1
NA12878_prelim
Why all this matters:
Longer haplotype blocks
Fewer collapsed repeats & segmental duplications
Better annotation
More robust mapping target
Reference Assembly Basics
Today’s reference assembly does not represent:
1.The most common allele/haplotype
2.The longest allele/haplotype
3.The ancestral allele/haplotype
It represents the sequence available from the HGP
Reference Assembly Basics
Gene1 Gene2
Gene1
Sample
Ref
Assembly
Reference assembly influence
Slide Credit: Deanna Church Reference Assembly Basics
75 % off-target alignments
25% no alignment
chromosome
variant
PLoS Biology (Jul 5, 2011)
Sequences from haplotype 1
Sequences from haplotype 2
Reference Assembly Basics
Original assembly model:
compress into a consensus
false
gap
chromosome
Current assembly model:
represent both haplotypes
alt loci scaffold
chromosomemany
Gene1 Gene2
Sample
Gene2
Gene1
chromosome
alt scaffold
Reference
GRCh38 (Dec. 2013)
• 178 regions with alt loci: 2% of chromosome
sequence (61.9 Mb)
• 261 Alt Loci: 3.6 Mb novel sequence relative to
chromosomes
• Average alt length = 400 kb, max = ~5 Mb
• >150 genes only represented on alt loci
Reference Assembly Basics
Reference Assembly Basics
• Closed gaps
• Targeted base fixes
• Corrected path errors
• Addition of missing paralogs
• Better representation of variation
• Better annotation
• Modeled centromeres
• Genome Research 27(5):849-864
(2017)
• PubMed: 28396521
GRCh38
• Changed coordinates
• Remapping challenges
• Alt Loci Usability
• Allelic duplication/Aligners
• Reporting multiple locations
• Variant analysis
• Clinical validation
2016
Growth in SRA submission
over prior year
GRCh38
GRCh37
Outline
• Assembly basics
• GRCh38 updates
• Taking advantage of the data
GRCh38 Updates
GRCh38: Dec. 2013
(n=1797)
(n=1396) (n=401)
GRCh38 Updates
(rare allele analysis)
GRCh38 Updates
chromosome
novel patch scaffold
alt loci scaffold
chromosome
fix patch scaffold
Patch release: No change to chromosome coordinates
Assembly nomenclature: GRCh38.p$
GRCh38.p11
• 64 FIX, 59 NOVEL
• Added >1.5 Mb novel
sequence
• >20 genes affected
GRCh38 Updates
GRCh38: 5S rRNA cluster under-represented (19 copies)
GRCh38 patch: 5S rRNA cluster valid representation (35 copies)
Poster 423F (11:30-12:30)
Updates to the human
reference genome assembly
Tayebeh Rezaie
GRCh38 Updates
• Ideals:
• Chromosome context for any
common human sequence >500 bp
• Unambiguous data interpretation at
all clinically relevant loci
• No systematic error/bias in
genome-wide analyses
• Real-World:
• Community interest
• Resources for curation
• GRCh39
• Substantial added value
• User must-haves
Outline
• Assembly basics
• GRCh38 & updates
• Taking advantage of the data
Accessing the Data
Assembly Stats
https://genomereference.org
Accessing the Data
Accessing the Data
Accessing the Data
https://www.ncbi.nlm.nih.gov/genome/gdv/
Learn more about GDV:
Data CoLab #159
Weds 10:30-11:00
Poster 1531W
Weds 2:00-3:00
Accessing the Data
Assembly Support Track Set
Accessing the Data
http://www.ensembl.org/GRC Tracks
Accessing the Data
ftp://ngs.sanger.ac.uk/production/grit/track_hub/hub.txt
Outline
• Assembly basics
• GRCh38 updates
• Taking advantage of the data
Credits
GRCh38 Collaborators
• NCBI RefSeq and gpipe annotation team
• Havana annotators
• Karen Miga
• Karyn Meltz Steinberg
• David Schwartz
• Steve Goldstein
• Mario Caceres
• Giulio Genovese
• Jeff Kidd
• Peter Lansdorp
• Mark Hills
• David Page
• Jim Knight
• Stephan Schuster
• 1000 Genomes
GRC SAB
• Rick Myers
• Granger Sutton
• Evan Eichler
• Jim Kent
• Roderic Guigo
• Carol Bult
• Derek Stemple
• Jan Korbel
• Liz Worthey
• Matthew Hurles
• Richard Gibbs
GRC
Tina Graves-Lindsay
Tayebeh Rezaie
Kerstin Howe
Richard Durbin
Paul Flicek
Laura Clarke
Deanna Church
Curators!
Developers!

Ashg2017 workshop schneider

  • 1.
    GRC/GIAB Workshop: Getting theMost from the Reference Assembly and Reference Materials Oct 17: 1-4 pm Valerie Schneider (NCBI): GRCh38 assembly basics and updates Tina Lindsay (MGI): Reference-grade human assemblies Karen Miga (UCSC): Centromere assemblies BREAK (15 min) Benedict Paten (UCSC): Building human variation graphs Fritz Sedlazeck (BCM): Structural Variation Characterization Across the Human Genome and Populations Justin Zook (NIST): GIAB benchmarks for difficult variants
  • 2.
    GRCh38 assembly basicsand updates Valerie Schneider, Ph.D. NCBI 17 October 2017 https://genomereference.org
  • 3.
  • 4.
    • Assembly basics •GRCh38 updates • Taking advantage of the data Outline
  • 5.
  • 6.
    Reference Assembly Basics (Forupdated assemblies, only date of initial submission is counted) Other assemblies GRCh38 (reference)
  • 8.
    Sanger-seq’d, clone-based assemblyBAC insert BAC vector Shotgun sequence clone Assemble clone GAPS Finish (via PCR) Minimal Clone Tiling Path Define consensus from switch points of adjacent clones Consequences: • Highly contiguous • High sequence accuracy (<10-5) • Haploid mosaic Ordering the Path Fingerprint maps Genetic linkage maps Radiation hybrid maps Reference Assembly Basics
  • 9.
    HuRef SOAPdenovo NA12878 ALLPATHS NA12878 Lander and Waterman (1988)Genomics SequencedNot sequenced 1X Coverage 5X Coverage 10X Coverage 37% 63% 0.6% 99.4% 0.005% 99.995% The likelihood a base is seq’d.Coverage Contig N50 MHAP CHM1 Chaisson and Eichler (2015), with modification Measure of contiguity. Half of the assembly is in contigs this length or greater. Reference Assembly Basics AK1 HX1 NA12878_prelim
  • 10.
    Why all thismatters: Longer haplotype blocks Fewer collapsed repeats & segmental duplications Better annotation More robust mapping target Reference Assembly Basics
  • 11.
    Today’s reference assemblydoes not represent: 1.The most common allele/haplotype 2.The longest allele/haplotype 3.The ancestral allele/haplotype It represents the sequence available from the HGP Reference Assembly Basics
  • 12.
    Gene1 Gene2 Gene1 Sample Ref Assembly Reference assemblyinfluence Slide Credit: Deanna Church Reference Assembly Basics 75 % off-target alignments 25% no alignment chromosome variant PLoS Biology (Jul 5, 2011)
  • 13.
    Sequences from haplotype1 Sequences from haplotype 2 Reference Assembly Basics Original assembly model: compress into a consensus false gap chromosome Current assembly model: represent both haplotypes alt loci scaffold chromosomemany Gene1 Gene2 Sample Gene2 Gene1 chromosome alt scaffold Reference
  • 14.
    GRCh38 (Dec. 2013) •178 regions with alt loci: 2% of chromosome sequence (61.9 Mb) • 261 Alt Loci: 3.6 Mb novel sequence relative to chromosomes • Average alt length = 400 kb, max = ~5 Mb • >150 genes only represented on alt loci Reference Assembly Basics
  • 15.
    Reference Assembly Basics •Closed gaps • Targeted base fixes • Corrected path errors • Addition of missing paralogs • Better representation of variation • Better annotation • Modeled centromeres • Genome Research 27(5):849-864 (2017) • PubMed: 28396521 GRCh38 • Changed coordinates • Remapping challenges • Alt Loci Usability • Allelic duplication/Aligners • Reporting multiple locations • Variant analysis • Clinical validation 2016 Growth in SRA submission over prior year GRCh38 GRCh37
  • 16.
    Outline • Assembly basics •GRCh38 updates • Taking advantage of the data
  • 17.
    GRCh38 Updates GRCh38: Dec.2013 (n=1797) (n=1396) (n=401)
  • 18.
  • 19.
    GRCh38 Updates chromosome novel patchscaffold alt loci scaffold chromosome fix patch scaffold Patch release: No change to chromosome coordinates Assembly nomenclature: GRCh38.p$ GRCh38.p11 • 64 FIX, 59 NOVEL • Added >1.5 Mb novel sequence • >20 genes affected
  • 20.
    GRCh38 Updates GRCh38: 5SrRNA cluster under-represented (19 copies) GRCh38 patch: 5S rRNA cluster valid representation (35 copies) Poster 423F (11:30-12:30) Updates to the human reference genome assembly Tayebeh Rezaie
  • 21.
    GRCh38 Updates • Ideals: •Chromosome context for any common human sequence >500 bp • Unambiguous data interpretation at all clinically relevant loci • No systematic error/bias in genome-wide analyses • Real-World: • Community interest • Resources for curation • GRCh39 • Substantial added value • User must-haves
  • 22.
    Outline • Assembly basics •GRCh38 & updates • Taking advantage of the data
  • 23.
    Accessing the Data AssemblyStats https://genomereference.org
  • 24.
  • 25.
  • 26.
    Accessing the Data https://www.ncbi.nlm.nih.gov/genome/gdv/ Learnmore about GDV: Data CoLab #159 Weds 10:30-11:00 Poster 1531W Weds 2:00-3:00
  • 27.
    Accessing the Data AssemblySupport Track Set
  • 28.
  • 29.
  • 30.
    Outline • Assembly basics •GRCh38 updates • Taking advantage of the data
  • 31.
    Credits GRCh38 Collaborators • NCBIRefSeq and gpipe annotation team • Havana annotators • Karen Miga • Karyn Meltz Steinberg • David Schwartz • Steve Goldstein • Mario Caceres • Giulio Genovese • Jeff Kidd • Peter Lansdorp • Mark Hills • David Page • Jim Knight • Stephan Schuster • 1000 Genomes GRC SAB • Rick Myers • Granger Sutton • Evan Eichler • Jim Kent • Roderic Guigo • Carol Bult • Derek Stemple • Jan Korbel • Liz Worthey • Matthew Hurles • Richard Gibbs GRC Tina Graves-Lindsay Tayebeh Rezaie Kerstin Howe Richard Durbin Paul Flicek Laura Clarke Deanna Church Curators! Developers!