Understanding the reference assembly
Valerie Schneider
NCBI
26 October 2016
http://www.biorxiv.org/content/early/2016/08/30/072116
Dilthey et al.Paten et al.
Scientific Models
• Distinguishing features of the human reference
assembly
• Implications for genomic analyses and tools
• Where do you get assembly-relevant data?
Outline
Assembly Basics
Sanger-seq’d, clone based assembly BAC insert
BAC vector
Shotgun sequence clone
Assemble
GAPS
Finish
Minimal Tiling Path
Define switch points for adjacent components
(haploid mosaic)
Most contiguous
Highest sequence quality
Today’s reference assembly does not represent:
1.The most common allele
2.The longest allele
3.The ancestral allele
Assembly Basics
It represents the sequence available from the HGP
GRC Assembly Model
Sequences from haplotype 1
Sequences from haplotype 2
Old Assembly model: compress into a consensus
Current Assembly model: represent both haplotypes
many
Assembly (e.g. GRCh38)
Primary
Assembly
Unit
Non-nuclear
assembly unit
(e.g. MT)
PAR
Genomic
Region
(MHC)
Genomic
Region
(UGT2B17)
Genomic
Region
(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091
GRC Assembly Model
ALT
2
ALT
3
ALT
4
ALT
5
ALT
6
ALT
7
ALT
1
The alignments of the alternate loci scaffolds to the
chromosomes are an integral part of the assembly
and can be downloaded from GenBank with the
assembly sequences
Assembly (e.g. GRCh38.p1)
Primary
Assembly
Unit
Non-nuclear
assembly unit
(e.g. MT)
ALT
1
ALT
2
ALT
3
ALT
4
ALT
5
ALT
6
ALT
7
PAR
Genomic
Region
(MHC)
Genomic
Region
(UGT2B17)
Genomic
Region
(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091
Patches
Genomic
Region
(ABO)
Genomic
Region
(FOXO6)
Genomic
Region
(FCGBP)
GRC Assembly Model
Patches
FIX NOVEL
SCAFFOLD STATUS AT NEXT
MAJOR ASSEMBLY RELEASE
ALT
LOCI
--
(integrated)
Treat as:
Allelic
Treat as:
Preferred
1q32 1q21 1p21
Dennis et al., 2012
GRC Assembly Model
GRC: Assembly Model
GRCh38
• 178 regions with alt loci: 2% of chromosome
sequence (61.9 Mb)
• 261 Alt Loci: 3.6 Mb novel sequence relative to
chromosomes
GRCh38.p9
• 96 Patches: >1 Mb novel sequence
• 48 FIX
• 48 NOVEL
GRC: Assembly Model
GRCh38: Alt Loci
Alignment Legend
no alignmentmismatchdeletion
chromosome
alt/patch
reads On-target alignment
Off-target alignments
(n=122,922)
GRCh38: Alt LociPLoS Biol. 2011 Jul;9(7):e1001091
Anatomy of an alt
Anatomy of an alt
AC012314.8
CU151838.1
ALT LOCI
AC012314.8
AC245052.3 CHR. 19
Due to anchor components, alternate loci contain some sequence
that is redundant to the primary assembly unit
GRCh38 Model Centromeres
Karen Miga (Kent Lab, UCSC)
GRCh38 Model Centromeres
WGS WGS WGS
GRCh38 Centromeres
Miga et al., Genome Res. 2014 Apr;24(4):697-707
GRCh38: Where’s the data?
GRCh38: Where’s the data?
GRCh38 Sequences for alignment pipelines
GRCh38: Where’s the data?
Assembly Sequence and Statistics Reports
GRCh38: Where’s the data?
GRCh38: Where’s the data?
GRCh38: Where’s the data?
Assembly Regions Report: Alts, Patches and Centromeres
GRCh38: Where’s the data?
GRCh38: Where’s the data?
GRCh38: Where’s the data?
Accessing the Datahttps://genomereference.org
Accessing the Datahttps://genomereference.org
Dumped daily
Frozen mappings to
prior assembly
versions in GFF3
Accessing the Datahttps://genomereference.org
Mapped to latest GRCh38 and GRCh37.p13
Accessing the Datahttps://genomereference.org
GRCh38 Collaborators
• NCBI RefSeq and gpipe annotation team
• Havana annotators
• Karen Miga
• David Schwartz
• Steve Goldstein
• Mario Caceres
• Giulio Genovese
• Jeff Kidd
• Peter Lansdorp
• Mark Hills
• David Page
• Jim Knight
• Stephan Schuster
• 1000 Genomes
GRC SAB
• Rick Myers
• Granger Sutton
• Evan Eichler
• Jim Kent
• Roderic Guigo
• Carol Bult
• Derek Stemple
• Jan Korbel
• Liz Worthey
• Matthew Hurles
• Richard Gibbs
GRC Creditshttps://genomereference.org
Alt Loci: Informatics Challenges
Masks and alt aware aligners reduce the incidence of
ambiguous alignments observed when aligning reads to
the full assembly
Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds
Simulated Reads
GRCh38: Alt Loci
The Changing Reference
The Changing Reference
Dilthey et al.Paten et al.
The Changing Reference
• Distinguishing features of the human reference
assembly
• Implications for genomic analyses and tools
• Where do you get assembly-relevant data?
Outline

Understanding the reference assembly: CSHL Hackathon