3. • Differences between the reference genome
assembly and other assemblies
• Features of the current reference assembly
model and their relationship to genomic analyses
and tools
• The changing reference genome assembly
Outline
4.
5. Sequences from haplotype 1
Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
GRC Assembly Model
6. Assembly (e.g. GRCh38)
Primary
Assembly
Unit
Non-nuclear
assembly unit
(e.g. MT)
PAR
Genomic
Region
(MHC)
Genomic
Region
(UGT2B17)
Genomic
Region
(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091
GRC Assembly Model
The human reference genome assembly is not a haploid model
ALT
2
ALT
3
ALT
4
ALT
5
ALT
6
ALT
7
ALT
1
Alternate loci are not synonymous with haplotypes
7. Assembly (e.g. GRCh38.p1)
Primary
Assembly
Unit
Non-nuclear
assembly unit
(e.g. MT)
ALT
1
ALT
2
ALT
3
ALT
4
ALT
5
ALT
6
ALT
7
PAR
Genomic
Region
(MHC)
Genomic
Region
(UGT2B17)
Genomic
Region
(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091
Patches
Genomic
Region
(ABO)
Genomic
Region
(FOXO6)
Genomic
Region
(FCGBP)
GRC Assembly Model
Patches
FIX NOVEL
SCAFFOLD STATUS AT NEXT
MAJOR ASSEMBLY RELEASE
ALT
LOCI
--
(integrated)
8. 1q32 1q21 1p21
Dennis et al., 2012
GRC Assembly Model
Fix patches are different than novel patches
9. The alignments of the alternate loci scaffolds to the chromosomes are part of the assembly
10. Anatomy of an alt
Alignment Legend
no alignmentmismatchdeletion
11. Anatomy of an alt
AC012314.8
CU151838.1
ALT LOCI
AC012314.8
AC245052.3 CHR. 19
Alternate loci contain some sequence that is redundant to the primary assembly unit
13. Masks and alt aware aligners reduce the incidence of
ambiguous alignments observed when aligning reads to
the full assembly
Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds
Simulated Reads
GRCh38: Alt Loci
14. GRC: Assembly Model
GRCh38
• 178 regions with alt loci: 2% of chromosome
sequence (61.9 Mb)
• 261 Alt Loci: 3.6 Mb novel sequence relative to
chromosomes
19. Collaborators
• NCBI RefSeq and gpipe annotation team
• Havana annotators
• Karen Miga
• David Schwartz
• Steve Goldstein
• Mario Caceres
• Giulio Genovese
• Jeff Kidd
• Peter Lansdorp
• Mark Hills
• David Page
• Jim Knight
• Stephan Schuster
• 1000 Genomes
GRC SAB
• Rick Myers
• Granger Sutton
• Evan Eichler
• Jim Kent
• Roderic Guigo
• Carol Bult
• Derek Stemple
• Matthew Hurles
• Richard Gibbs
GRC Credits
20. Source/Recruitment of DNA Donors for Library Construction
Another implication of the fact that 99.9% of the human DNA sequence
is shared by any two individuals is that the backgrounds of the
individuals who donate DNA for the first human sequence will make no
scientific difference in terms of the usefulness and applicability of the
information that results from sequencing the human genome. At the
same time, there will undoubtedly be some sensitivity about the
choice of DNA sources. There are no scientific reasons why DNA donors
should not be selected from diverse pools of potential donors.
http://www.genome.gov/10000921 (August 17, 1996)
Reference Composition
21. Today’s reference assembly does not represent:
1.The most common allele
2.The longest allele
3.The ancestral allele
22. Roles for the reference
• Getting the sequence
• Cataloging genes (and other features)
• Establishing a coordinate system
• Humans vs. other organisms
Editor's Notes
I’d like to begin this talk by reminding everyone of the difference between a genome and an assembly. A human genome is a physical object. An assembly is our representation of that object. It is a model. And as shown here, genome models can take many forms.
And as these atomic models illustrate, scientific models evolve over time to reflect our growing knowledge base. And so it is with the human assembly model, the reference genome.
I’m going to cover three main areas in today’s talk:
What makes the reference assembly different than other genome assemblies
Features of the current reference assembly model and their relationship to genomic analyses and tools
The changing reference genome assembly
The human reference assembly is a special kind of genome model. In today’s era of personal genome sequencing, most assemblies only model a diploid genome.
But the reference assembly is a model of many diploid genomes, meant to represent the “human” genome. This slide shows the assembly composition of the GRCh38 primary assembly. While 70% of the genome comes from one donor, sequence from >70 individuals is represented.
Even when assembling the genome of single individual, there may be divergent haplotypes that confound genome assembly. In the original reference assembly model, which was essentially a stick model of linear chromosomes, there really wasn’t a good way to represent highly variant or complex genomic regions. Different haplotypes were simply compressed into a consensus. The insertion of different haplotypes in such regions, however, often led to non-existent allele combinations and artificial gaps, as illustrated here.
In the assembly model we’re using now, there’s a mechanism to cleanly represent multiple haplotypes: alternate loci. The current model allows the reference assembly to contain alternate representations for regions where haplotype compression isn’t appropriate or a single sequence path is considered insufficient. At the same time, it retains the linear chromosome models with which most users are comfortable.
GRCh37 was the first genome assembly to use this new model, which is illustrated in this cartoon. The first thing to know about the model is that the “assembly” is comprised of multiple assembly units.
Primary assembly unit is the collection of chromosomes and unlocalized and unplaced scaffolds. This is essentially the original assembly model.
Non-nuclear genomes are assigned to their own assembly unit.
Regions are defined for those areas of the genome for which alternate sequence representation is desired.
Those alternate sequence representations go into alternate loci assembly units.
The first alternate sequence representation for each region goes into one assembly unit.
Each additional sequence representation for a region goes into its own assembly unit.
Alt loci are stand-alone scaffold sequences that are given chromosome context via their alignment to the primary assembly. I’ll return to this point shortly.
The assembly model also includes regions for the pseudo-autosomal regions of the primary assembly unit.
Another aspect of the assembly model I’d like to discuss are the patches. Patches enable the reference assembly to be updated without changing chromosome coordinates. This feature of the model allows the GRC to make assembly updates available in a timely fashion to those users whose work is adversely affected by errors in the current genome, while not disrupting the coordinates upon which other users rely.
Regions are defined for the genomic locations to be updated, and the sequences representing those updates are put into the “Patches” assembly unit. Like the alt loci, the patches are stand-alone scaffold sequences.
It’s important to distinguish the two types of patches:
(1) FIX patches correct problems in the assembly: deprecated in next assembly release
(2) NOVEL patches add new alternate sequence representations to the assembly: become alternate loci in the next assembly release.
It’s important to recognize that the way fix and novel patches should be used for analysis is different. The novel patches can be treated just like alt loci, and should be considered allelic to the chromosomes. In contrast, because the FIX patches represent assembly corrections, read alignments to the fix patches should take precedence over alignments to the chromosomes.
Fix patch updates can sometimes be quite dramatic. This slide shows a FIX patch that corrects a mis-assembly of the 1q21 region of GRCh37 involving the SRGAP gene family. In GRCh37 the 1q21 region was comprised of a mix of sequences from the various SRGAP-associated duplications. The bottom panel shows the alignment of chr. 1 to the fix patch, and this dot matrix view really highlights how much things changed. From this, it’s easy to see how a fix patch could improve your analyses, and why you would want to exclude alignments from the corresponding chromosome region.
As I mentioned before, the alternate loci and patches are stand-alone scaffold sequences given chromosome context by virtue of their alignment to the chromosomes, as illustrated in this image of GRCh38 chr. 19, with its aligned alts and their corresponding regions. In the next slide, we’ll take a closer look at an alt in order to better appreciate its relationship to the chromosome.
Thus, another important thing to understand about the assembly model is that the alignments of the alt loci to the chromosomes are an integral part of the assembly. The alignment, in conjunction with the sequence, is what defines the alt.
This image shows one of the alt loci from the LRC/KIR region on chr. 19. The blue bars represent the component sequences of the alt. The alignment of the chromosome to the alt is shown beneath. The legend below explains the alignment graphic. We can see that there is a region at either end of the alt that aligns perfectly to the chromosome. To understand why this is the case, we will zoom in on one of these regions and compare to the corresponding region of chr. 19.
At this level, we see that the first component in the alternate locus is also a component in the chromosome. This is what is known as an anchor component. Anchor components are present in all human alt loci and are included to insure a robust alignment to the chromosome. However, the extent to which the anchor component contributes sequence to the alternate locus and the chromosome may differ, because of differences in the position of the switch point with the adjacent components, which are not the same. The identical region of the alignment corresponds to sub-region of the anchor that is common to both the alt and the chromosome.
As a result of the anchor sequences, all human alt loci contain some sequence that is redundant to the chromosomes. This is the reason that most aligners are not compatible with the alternate loci. Reads corresponding to anchor sequence will map identically to the alt and chromosome, resulting in depressed mapping scores and exclusion from downstream analyses. Reads mapping to other regions of the alternate loci that are similar to the chromosome, even if not derived from anchor sequences, will have the same issue. Alternate aware aligners must recognize the relationship of the alt to the chromosome and not treat reads that map in those regions as ambiguously placed.
In the interim, the GRC has been looking at approaches that may help users make use of existing tool chains. For example, we’ve tested use of a mask that hides the duplication in the alts. In this slide, you can see the mask we’ve generated for this GRCh38 alt loci, which has an insertion relative to the chromosome, but is identical for much of the remaining length.
Prior to the release of GRCh38, we began looking at the effect of masking on alt-unaware BWA aligner and compared results to those obtained with use of an NCBI-developed alternate aware aligner called srprism. In this analysis, simulated reads were aligned to GRCh37.p9 primary or full assembly. For BWA, we tested masking of the alts/patches only, or masking a combination of sequences on the alts/patches and the chromosome. We then looked at the incidence of reads with ambiguous alignments.
As shown in first two columns of the figure, there is an expected increase in multiple alignments when reads are aligned to the full assembly with BWA and no mask (expanded red). In the next two columns, you can see how use of either masking approach suppresses the increase in multiple alignments. The last two columns show that srprism, the alt aware aligner, does not need a mask to prevent ambiguous mappings.
We will continue this analysis on GRCh38, with both simulated and actual reads, but I hope that even this preliminary data makes the point that it is possible to develop tools that can handle the alternate loci and may allow users to reap the benefits of using the full assembly in analyses.
GRCh38 has 178 regions associated with 261 alternate loci scaffolds. There is more than 3 Mb of sequence whose only representation in the assembly occurs in the alternate loci. We can now look at the value of the alternate loci and their implications for analysis.
One reason the alt loci add value to the assembly is their gene content. In GRCh38, there are 64 protein coding and 112 non-protein coding genes that are found only on the alternate loci.
An example is shown in this slide, where you can see several genes annotated in the regions of this alternate representation of the chr. 19 KIR region that have no alignment to the chromosome.
Thus, if you’re not using the alt loci in your analyses, you may be missing genes. This can affect the development of exome capture reagents. In addition, many of these alts contain paralogous gene copies that will affect alignments and your understanding of the protein content of the genome.
We’ve also been doing some analyses to investigate the severity of mapping errors that can occur when alternate loci aren’t used in alignment target sets. Since our analyses of GRCh38 are ongoing, I’ll describe an earlier study we did with the GRCh37.p9 assembly. In that study, we looked at the behavior of simulated reads sourced from sequence unique to patches or alt loci. We asked what happened to them when aligned to GRCh37 without the alt loci, where their true target is missing. We aligned the reads either as singletons or pairs, using two different aligners (BWA and srprism).
As shown in this graph, regardless of read pairing or the aligner, 25% of these reads failed to align (red). What’s particularly concerning is that nearly three-quarters had an off-target alignment on the GRCh37 primary assembly (in blue). These off-target alignments are likely to result in errors in variation analyses.
This analysis demonstrates the value of including alternate loci in alignment target sets and again highlights the need for the development of alt aware aligners and downstream components of variant calling tool chains.
At this point, I’d like to shift gears for a little bit and conclude this talk by discussing the changing reference genome. New technologies and resources are one driving force for change. For example, a single haplotype hydatidifom mole resource is helping the GRC resolve highly complex regions. A PacBio long read was used in GRCh38 to provide sequence that had been impossible to resolve by other means. Optical map data is helping us resolve misassemblies, and will also be used to find regions that are missing sequence.
The assembly model itself may also change with time. The GRC is currently curating the CHM1 hydatidifom mole assembly in addition to the reference, and using sequence from it to improve the reference. In the future, the GRC expects to curate selected genomic regions from additional individuals representing diverse populations, to provide a more comprehensive representation of complex genomic regions. These may contribute new alt loci. At the moment, such additional genomes are considered distinct from the reference, as this slide illustrates. However, as new data becomes available both within and beyond the GRC, the GRC will continue to assess the assembly model and work towards its goal of providing a reference assembly that can be used to put any common human sequence in its chromosome context.
When the HGP was envisioned, we knew much less about human variation than we do today and the implications that using multiple donors might have for the reference assembly. That’s illustrated in the final sentence of this quote from the NHGRI/DOE guidelines for selecting DNA donors for the reference.
There’s still no doubt we want and need a reference assembly that represents diverse samples. But we now know, thanks in part to projects like 1000G, that DNA assemblies, our models of the genome, are affected by sample diversity.
Before I go any further, I want to point out what today’s reference is not. It does not represent:
The most common allele
The longest allele
The ancestral allele
It represents the sequence available from the HGP.
The role of the reference genome has also changed with time. At the start of the HGP, major goals for the reference included: (1) getting the sequence, (2) cataloging genes and other features and (3) establishing a coordinate system. (4) We were interested in the difference between humans and other organisms.
Today, we’re just as interested in the differences between individual humans. This has led to calls that we need a reference that, if not a pan-genome, can provide representation for complex and/or diverse genomic regions.
In order to understand how the reference genome can represent diversity we’re interested in, we’ll take a look at the reference assembly model and how it has changed since the HGP.