Getting the Most from the Reference
Assembly
Valerie Schneider, Ph.D.
NCBI
6 October 2015
http://genomereference.org
http://genomereference.org
Twitter: @GenomeRef
grc-announce@ncbi.nlm.nih.gov
Outline
• Assembly basics
• The assembly model
• GRCh38 & updates
• Taking advantage of the data
Reference Assembly Basics
Sims et al. (2014) Nat Rev Genet. 15(2):121-32
30x
1x
Reference Assembly Basics
Lander and Waterman
(1988) Genomics SequencedNot sequenced
1X Coverage
5X Coverage
10X Coverage
37% 63%
0.6% 99.4%
0.005% 99.995%
Reference Assembly Basics
FINISHED?
BAC insert
BAC vector
Shotgun sequence
Assemble
Foldsequence
Gaps
deeper sequence
coverage rarely
resolves all gaps
GAPS
“finishers” go in to manually
fill the gaps, often by PCR
Clone based assemblies
Reference Assembly Basics
Minimal Tiling Path
Human assemblies available in the NCBI assembly database
http://www.ncbi.nlm.nih.gov/assembly
Reference Assembly Basics
Oct. 2014: 13 assemblies
Oct. 2015: 25 assemblies
YRI
CEU
CEU
CHB
Reference Assembly Basics
Sanger Sanger Illumina Illumina PacBio (older)
clone WGS WGS WGS WGS
Reads:
Method:
PacBio (newer)
WGS
N50:
Measure of continuity.
Half of the contigs in the
assembly are this length or
greater.Why all this matters:
Longer haplotype blocks
Fewer collapsed repeats & segmental duplications
Improved annotation
More robust mapping target
Outline
• Assembly basics
• The assembly model
• GRCh38 & updates
• Taking advantage of the data
Sequences from haplotype 1
Sequences from haplotype 2
Old Assembly model: compress into a consensus
Current Assembly model: represent both haplotypes
GRC Assembly Model
many
Assembly (e.g. GRCh38)
Primary
Assembly
Unit
Non-nuclear
assembly unit
(e.g. MT)
PAR
Genomic
Region
(MHC)
Genomic
Region
(UGT2B17)
Genomic
Region
(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091
GRC Assembly Model
ALT
2
ALT
3
ALT
4
ALT
5
ALT
6
ALT
7
ALT
1
GRC Assembly Model
Alt loci alignments are an integral part of the assembly model
alignment to chr + scaffold sequence = Alt
GRCh38
• 178 regions with alt loci: 2% of chromosome
sequence (61.9 Mb)
• 261 Alt Loci: 3.6 Mb novel sequence relative to
chromosomes
• Average alt length = 400 kb, max = ~5 Mb
GRCh38
Outline
• Assembly basics
• The assembly model
• GRCh38 & updates
• Taking advantage of the data
GRCh38: Alt Loci
Alignment Legend
no alignmentmismatchdeletion
chromosome
alt/patch
reads On-target alignment
Off-target alignments
(n=122,922)
GRCh38: Alt Loci
GRCh38: Assembly Stats
http://genomereference.org
GRCh38 vs. GRCh37
GRCh38: Annotation Stats
GRCh38 Base Updates
Targeted PCR/WGS: n=91
GRCh38 Centromeres
Miga et al., Genome Res. 2014 Apr;24(4):697-707
GRCh38 Novel Sequence
GRCh38 Novel Sequence
Assembly (e.g. GRCh38.p1)
Primary
Assembly
Unit
Non-nuclear
assembly unit
(e.g. MT)
ALT
1
ALT
2
ALT
3
ALT
4
ALT
5
ALT
6
ALT
7
PAR
Genomic
Region
(MHC)
Genomic
Region
(UGT2B17)
Genomic
Region
(MAPT)
Patches
Genomic
Region
(ABO)
Genomic
Region
(FOXO6)
Genomic
Region
(FCGBP)
Assembly Updates
Patches
FIX NOVEL
SCAFFOLD STATUS AT NEXT
MAJOR ASSEMBLY RELEASE
ALT
LOCI
--
(integrated)
Treat as:
Allelic
Treat as:
Preferred
GRCh38.p4
• 55 Patches: >400 kb novel sequence
• 37 FIX
• 18 NOVEL
Assembly Updates
Learn more about
assembly updates at
the GRC poster:
1834W (6-7 pm)
Outline
• Assembly basics
• The assembly model
• GRCh38 & updates
• Taking advantage of the data
Accessing the Data
http://genomereference.org
Accessing the Data
Accessing the Data
Accessing the Data
Accessing the Data
Accessing the Data
GRC Assembly Management
Accessing the Data
http://www.ensembl.org/
Accessing the Data
Accessing the Data
ftp://ngs.sanger.ac.uk/production/grit/track_hub/hub.txt
Accessing the Data
http://www.ncbi.nlm.nih.gov/variation/view/NCBI Variation Viewer
Accessing the Data
Learn more about
viewing GRCh38 at
NCBI: 1748T (12-1 pm)
http://www.ncbi.nlm.nih.gov/genome/tools/remap
Outline
• Assembly basics
• The assembly model
• GRCh38 & updates
• Taking advantage of the data
GRCh38 Collaborators
• NCBI RefSeq and gpipe annotation team
• Havana annotators
• Karen Miga
• David Schwartz
• Steve Goldstein
• Mario Caceres
• Giulio Genovese
• Jeff Kidd
• Peter Lansdorp
• Mark Hills
• David Page
• Jim Knight
• Stephan Schuster
• 1000 Genomes
GRC SAB
• Rick Myers
• Granger Sutton
• Evan Eichler
• Jim Kent
• Roderic Guigo
• Carol Bult
• Derek Stemple
• Jan Korbel
• Liz Worthey
• Matthew Hurles
• Richard Gibbs
GRC Creditshttp://genomereference.org

Ashg2015 schneider final